In this notebook we will examine if hits from Cdc14 mass spec are enriched in binding site for Cdc28. Data for the hits is provided in an excel sheet with hits for yeast and hyphae. We will download protein sequences as search through them for matches to the Cdc28 binding consensus. Really we should be doing this using a Cdc28 PWM and proper motif matching, but in the absence of such a PWM, we will use consensus sequences provided by Pete : S/T-P-X-K/R or P-X-S/T-X-(K/R){2-5}.
First we download the Candida protein sequences....
! wget http://www.candidagenome.org/download/sequence/C_albicans_SC5314/Assembly22/current/C_albicans_SC5314_A22_current_orf_trans_all.fasta.gz
Now we look for motif in each of the protein sequences.
from CGAT import IOTools, FastaIterator
import re
import pandas
regex = re.compile("[ST]P.[KR]")
hits_table=dict()
for protein in FastaIterator.iterate(
IOTools.openFile("C_albicans_SC5314_A22_current_orf_trans_all.fasta.gz")):
fname = protein.title.split(" ")[0][:-2]
hits_table[fname] = len(regex.findall(protein.sequence))
hits_table = pandas.Series(hits_table)
hits_table = hits_table.reset_index()
hits_table.columns = ["fname","nMotifs"]
hits_table.head()
Now to get hold of the excel table:
ms_hits = pandas.read_excel("CA_hits_annotatedb.xlsx")
ms_hits.fname = ms_hits.fname.apply(lambda x: x[:-2])
ms_hits.head()
Now we combine these two datasets to create a single table containing the number of motif hits for each protein and whether it was a hit in yeast, hyphae or either yeast or hyphae.
hits_table = hits_table.merge(ms_hits[["fname","Yeast", "Hyphae"]], on="fname", how='left')
hits_table.Yeast = hits_table.Yeast == "+"
hits_table.Hyphae = hits_table.Hyphae == "+"
hits_table["MS Hit"] = (hits_table.Yeast) | (hits_table.Hyphae)
hits_table.head()
We are now ready to look at the association between the two.
pandas.crosstab(hits_table.nMotifs>0, hits_table["MS Hit"])
In other words 65/(61+65) = 52% of the mass spec hits have at least one Cdc28 motif, while 1422/(1422+4778) = 23% of proteins that are not a mass spec hit have at least one Cdc28 motif. This is approximately a 2.2 fold enrichment on expectation. To test if this is significant we can do a fisher's exact test:
%load_ext rpy2.ipython
%%R -i hits_table
fisher.test(hits_table$nMotif > 0, hits_table$MS.Hit)
Thus we can conclude that this enrichment is very statistically significant, with an odds ration of 3.5 and a p-value of $1 \times 10^{-11}$.
Maybe we should go one step further - many Cdc28 targets have multiple binding sites. Lets have a look at the association of proteins that have more 2 or more Cdc28 targets and the hits from the MS.
pandas.crosstab(hits_table.nMotifs>1, hits_table["MS Hit"])
Thus 29% of the MS hits have 2 or more Cdc28 sites, while only 6% of non-MS hits have 2 or more Cdc28 sites, a 4.3 fold enrichment.
%%R
fisher.test(hits_table$nMotif > 1, hits_table$MS.Hit)
This is of course even more significant, with an odds ratio greater than 6.
We can repeat the same analysis, but only considering hits in the yeast.
pandas.crosstab(hits_table.nMotifs>0, hits_table["Yeast"])
Thus 51% of hits the yeast have Cdc28 motifs, and 23% overall, this looks very similar to the overall figures.
hits_table.dtypes
%%R -i hits_table
fisher.test(hits_table$nMotif > 0, hits_table$Yeast)
Overall the comparison for yeast looks very similar to that for yeast or hyphae.
Now we turn to mass spec hits that were found in the hyphae.
pandas.crosstab(hits_table.nMotifs>0, hits_table["Hyphae"])
Thus 62% of mass spec hits from the hyphae have the Cdc28 motif, which is an enrichment of 2.7 fold.
%%R
fisher.test(hits_table$nMotif > 0, hits_table$Hyphae)
The p-value here is a little less significant due to the smaller numbers, but $1 \times 10^{-7}$ is still very significant in anyones books. The Odds ratio of 5.2 is also higher than that for the yeast hits.
We can also look at the enrichment of proteins with two Cdc28 motifs in the hyphae hits
pandas.crosstab(hits_table.nMotifs>1, hits_table["Hyphae"])
Exactly half of the hyphae hits have two or more Cdc28 motifs. This compares to only 6.6% of those that are not hyphae hits, a 7.6fold enrichment.
%%R
fisher.test(hits_table$nMotif > 1, hits_table$Hyphae)
This is the most significant result yet, with a p-value of $3.2 \times 10^{-14}$ and an odds ratio of over 14.
Overall the MS hits are enriched for proteins that carry a perfect match to the Cdc28 binding consensus, with a 2.3 fold enrichment, an odds ratio of 3.5 and a p-value of $1\times10^{-11}$. The enrichment is even stronger for proteins that carry two seperate matches to the motif (4.3X, OR 6, p-value $1\times10^{15}$). These results are similar in yeast, but are even stronger in hyphae, with there being a 7.3 fold enrichment of hyphae hits with 2 seperate Cdc28 binding motifs (OR 14, $p=3.2\times10^{-14}$).