SlideShare a Scribd company logo
DISCOVERY	
  OF	
  FUNCTIONAL	
  PROTEIN	
  LINEAR	
  MOTIFS	
  
                                                            USING	
  A	
  GREEDY	
  ALGORITHM	
  AND	
  INFORMATION	
  THEORY	
  
                                                      LEANDRO	
  G.	
  RADUSKY§,	
  JULIANA	
  GLAVINA§,	
  MARIA	
  FATIMA	
  LADELFA¶,	
  MARTIN	
  MONTE¶	
  	
  
                                                                                          AND	
  IGNACIO	
  E.	
  SANCHEZ§	
  
                   §PROTEIN	
  PHYSIOLOGY	
  LABORATORY,	
  DEPARTAMENTO	
  DE	
  QUIMICA	
  BIOLOGICA,	
  FACULTAD	
  DE	
  CIENCIAS	
  EXACTAS	
  Y	
  NATURALES-­‐UNIVERSIDAD	
  DE	
  BUENOS	
  AIRES,	
  ARGENTINA	
  ¶MOLECULAR	
  

                            AND	
  CELL	
  BIOLOGY	
  LABORATORY,	
  DEPARTAMENTO	
  DE	
  QUIMICA	
  BIOLOGICA,	
  FACULTAD	
  DE	
  CIENCIAS	
  EXACTAS	
  Y	
  NATURALES-­‐UNIVERSIDAD	
  DE	
  BUENOS	
  AIRES,	
  ARGENTINA	
  .	
  	
  




INTRODUCTION	
  
The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many
globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is
likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear
motifs from protein-protein interaction datasets.	
  


                                          ALGORITHM	
                                                                                                                                             RESULTS	
  
1.	
  DATASET                                                                                                                              VALIDATION:	
  SEARCH	
  FOR	
  KNOWN	
  MOTIFS	
  
                                                                                              Protein
The algorithm takes as input the sequence of all the                                        under study                                    We have tested the ability of our algorithm to identify known functional linear motifs in
protein targets bound by the protein under study.                                                                                          sequence sets taken from the ELM database [6].
                                                                                                      Physically
The hypothesis is that any linear motif mediating                                                     interacts with                       Motif       14-3-3 type 1        Gamma-adaptin            Clathrin box        Mannosylation              CtBP                Dynein
the interaction will be overrepresented in the
sequence of these proteins.                                                                                                                                                  (DE)(DES)xF               L(ILM)x                                    Px(DEN)
                                                                                                          Several                           ELM       R(SFYW)xSxP                                                             WxxW                                    (QR)xTQT
                                                                                                                                                                            x(DE)(LVIMFD)            (ILMF)(DE)                                   L(VAST)
                                                                                                          Protein
The user also determines the length of the putative                                                       targets
                                                                                                                                          Dilimot         RSxSxP                DDxFxxF                  LIxLD                DGxW                DxPxDL                KxTQT
linear motif to be looked for, e.g., ten residues.
                                                                                                                                          Our
                                                                                                                                         method


2.	
  INPUT	
  FILTERS                                                                                                                     Our algorithm captures the known motif in six cases (top), suggesting significant sequence
                                                                                                                                           specificity in positions marked as “x” in the consensus. There is a partial match with the
1.  The presence of homologous proteins in the dataset would                                                                               known consensus in two cases (bottom left) and no match in three cases (bottom right).
    lead to spurious motif overrepresentation. We use the CD-                                                                              The performance is comparable to that of Dilimot [1], a similar software that describes
    HIT algorithm [2] to identify this kind of redundancy and                                                                              motifs as consensus sequences
    remove it from the input.
2.  Most functional linear motifs are located within disordered
                                                                                                                                            Motif            Integrin                TRAF6                  Motif          NR box                  EH1                   HP1
    protein domains [1]. Disordered regions are identified
    using the VSL software [3] and kept for analysis.                                                                                        ELM               RGD                     PxE                   ELM             LxLL          Fx(IV)xx(IL)(ILM)          PxVx(LM)

                                                                                                                                           Dilimot            RxDV                    PQE                  Dilimot        Not found              FxIxNI               KVPxVxL

3.	
  MOTIF	
  SEARCH                                input                                                                                  Our
                                                                                                                                           method
                                                                                                                                                                                                           Our
                                                                                                                                                                                                          method
                                                                                                                                                                                                                          Not found            Not found              Not found
                                                     Matrix M: sequences to be analyzed
Our software is an adaptation of a                   Integer L: motif length
method used for motif search in DNA
sequences [4], implemented in Python.                output
                                                                                                                                          CASE	
  STUDY:	
  NUCLEOLAR	
  LOCALIZATION	
  OF	
  MAGE	
  PROTEINS	
  
                                                     Matrix Res: All k-word alingments
It first calculates all possible alignments
of two k-words in the dataset.                       Algorithm                                                                            The MAGE (melanoma-associated antigen) family of proteins are plausible targets for
                                                                                                                                          anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2
Next, we offer all possible k-words to  {                                                                                                 protein is observed in both the nucleus and the nucleolus.
each growing alignment and incorporate                    M’ = ObtainAllKWords(M)
the one resulting in the highest score.                   Res = CreateAlignmentsOfTwoKWords (M’)
                                                                                                                                          Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar
                                                          While (Res) has changed
                                                          {
                                                                                                                                          proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of
We repeat this procedure until                               CurrentKWordss = ObtainAllKWords (M)                                         MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.
incorporation of new k-words does not                        For all alignments A in Res                                                                                                                                                            Truncated MAGE-B2-GFP
increase the score of any alignment.                         {                                                                                                                     GFP-MAGE-A2                       GFP-MAGE-B2
                                                                AddBestKword (A, CurrentKwords)
Last, we sort the alignments by their                        }
                                                          }
scores. The sorted list is the output of                  SortByScore (Res)
the search.                                               Print Res
                                                     }




4.	
  MOTIF	
  SCORING
                                                                                                                                             Transfected U2Os cells.
We use the information content [5] of each alignment to quantify the overrepresentation of                                                   Green: GFP tag, blue: DAPI.
the motif contained in each sequence alignment.                                                                                              Magnification 100x.

The uncertainty at a position of the alignment is:                               H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)

The information content at a position is the decrease in
uncertainty between a random sequence and the                                                                                              CONCLUDING	
  REMARKS	
  
observed sequences, with a correction e(n) for the                               Rsequence(l) = log220 +
sampling of a finite number of sequences:                                        Σ f(aa,l) log2 f(aa,l)-e(n) (bits)                        •  We have implemented an algorithm for the discovery of novel protein
                                                                                                                                              functional motifs within sets of unaligned sequences.
The information content of an alignment is the sum over
all positions:                                                                    Rsequence = Rsequence(l) (bits)                          •  The algorithm shows good performance in the recovery of known motifs.
                                                                                                                                           •  We propose a putative motif responsible for localization of MAGE proteins
                                                                                                                                              in the nucleolus.
5.	
  OUTPUT	
  
                                                                                                                                         REFERENCES	
  
                                                                                                                                         [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405.
We measure the similarity between two motifs as the Pearson correlation coefficient R                                                    [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682.
                                                                                                                                         [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182.
between the corresponding amino acid frequencies. The group alignments above the                                                         [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187.
                                                                                                                                         [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100.
desired value of R.                                                                                                                      [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80.
Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments.                                              [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625
                                                                                                                                         [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8.
                                                                                                                                         [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.

More Related Content

PDF
Malappuram district- Email ID of Village Officers.
PDF
Phone number of Village Officers, Kerala
PDF
BITS: Basics of sequence analysis
PPT
Subtypes of Associated Protein-DNA (Transcription Factor-Transcription Factor...
PDF
XPRIME: A Novel Motif Searching Method
PDF
Drablos Composite Motifs Bosc2009
PDF
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
PPT
RNA synthesis
Malappuram district- Email ID of Village Officers.
Phone number of Village Officers, Kerala
BITS: Basics of sequence analysis
Subtypes of Associated Protein-DNA (Transcription Factor-Transcription Factor...
XPRIME: A Novel Motif Searching Method
Drablos Composite Motifs Bosc2009
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
RNA synthesis

Similar to Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory (20)

PDF
Bairoch ISB closing-talk: CALIPHO
PDF
Unison: An Integrated Platform for Computational Biology Discovery
PPT
Blast fasta 4
PPTX
Bioinformatica t5-database searching
DOC
551report.doc
PDF
Mining for Novel TNF Ligands
PPTX
Protein motif pdf this is very useful for students
PPTX
Mapping protein to function
PDF
Automated Prokaryotic Annotation at JCVI
PPT
Kyle Jensen's MIT Ph.D. Thesis Proposal
PPTX
Allelic Imbalance for Pre-capture Whole Exome Sequencing
PPT
Detection of genetic motifs
PDF
Probabilistic refinement of cellular pathway models
PDF
Translational bioinformatics at VHIR: Understanding molecular damage in Fabry...
PPTX
Following the Evolution of New Protein Folds via Protodomains
PPT
Transcription
PPTX
Dot matrix seminar
PDF
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
PDF
Poster_EMBO_WORKSHOP_v1
PPT
Prediction of transcription factor binding to DNA using rule induction methods
Bairoch ISB closing-talk: CALIPHO
Unison: An Integrated Platform for Computational Biology Discovery
Blast fasta 4
Bioinformatica t5-database searching
551report.doc
Mining for Novel TNF Ligands
Protein motif pdf this is very useful for students
Mapping protein to function
Automated Prokaryotic Annotation at JCVI
Kyle Jensen's MIT Ph.D. Thesis Proposal
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Detection of genetic motifs
Probabilistic refinement of cellular pathway models
Translational bioinformatics at VHIR: Understanding molecular damage in Fabry...
Following the Evolution of New Protein Folds via Protodomains
Transcription
Dot matrix seminar
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Poster_EMBO_WORKSHOP_v1
Prediction of transcription factor binding to DNA using rule induction methods
Ad

More from Asociación Argentina de Bioinformática y Biología Computacional (15)

PDF
About using new descriptors for cheminformatics
PDF
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
PDF
Cooperatividad en la Expresión Génica: Abordaje Estocástico
PDF
Prediction of heparin binding sites on GAPDH
PDF
Predicting peptide/MHC interactions: Application to epitope identification an...
PDF
Design of degenerated primers from bioinformatics online software for putativ...
PDF
A structure-function analysis of s HSPs in plants
PPT
Modelado de la proteína p35 de toxoplasma gondii
PPT
Data balancing for phenotype classification based on SNPs
PDF
Gene selection via significant subset using silhouette index
PDF
Bolstered error estimation for discrete classifier applied to genomic signal ...
PDF
Biopython: Overview, State of the Art and Outlook
PDF
¿Cuál es la estabilidad relevante de las proteínas?
PPT
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
About using new descriptors for cheminformatics
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
Cooperatividad en la Expresión Génica: Abordaje Estocástico
Prediction of heparin binding sites on GAPDH
Predicting peptide/MHC interactions: Application to epitope identification an...
Design of degenerated primers from bioinformatics online software for putativ...
A structure-function analysis of s HSPs in plants
Modelado de la proteína p35 de toxoplasma gondii
Data balancing for phenotype classification based on SNPs
Gene selection via significant subset using silhouette index
Bolstered error estimation for discrete classifier applied to genomic signal ...
Biopython: Overview, State of the Art and Outlook
¿Cuál es la estabilidad relevante de las proteínas?
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Machine Learning_overview_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
sap open course for s4hana steps from ECC to s4
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
A comparative analysis of optical character recognition models for extracting...
Digital-Transformation-Roadmap-for-Companies.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
Advanced methodologies resolving dimensionality complications for autism neur...
Assigned Numbers - 2025 - Bluetooth® Document
Machine Learning_overview_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25-Week II
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
A Presentation on Artificial Intelligence
sap open course for s4hana steps from ECC to s4

Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory

  • 1. DISCOVERY  OF  FUNCTIONAL  PROTEIN  LINEAR  MOTIFS   USING  A  GREEDY  ALGORITHM  AND  INFORMATION  THEORY   LEANDRO  G.  RADUSKY§,  JULIANA  GLAVINA§,  MARIA  FATIMA  LADELFA¶,  MARTIN  MONTE¶     AND  IGNACIO  E.  SANCHEZ§   §PROTEIN  PHYSIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  ¶MOLECULAR   AND  CELL  BIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  .     INTRODUCTION   The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear motifs from protein-protein interaction datasets.   ALGORITHM   RESULTS   1.  DATASET VALIDATION:  SEARCH  FOR  KNOWN  MOTIFS   Protein The algorithm takes as input the sequence of all the under study We have tested the ability of our algorithm to identify known functional linear motifs in protein targets bound by the protein under study. sequence sets taken from the ELM database [6]. Physically The hypothesis is that any linear motif mediating interacts with Motif 14-3-3 type 1 Gamma-adaptin Clathrin box Mannosylation CtBP Dynein the interaction will be overrepresented in the sequence of these proteins. (DE)(DES)xF L(ILM)x Px(DEN) Several ELM R(SFYW)xSxP WxxW (QR)xTQT x(DE)(LVIMFD) (ILMF)(DE) L(VAST) Protein The user also determines the length of the putative targets Dilimot RSxSxP DDxFxxF LIxLD DGxW DxPxDL KxTQT linear motif to be looked for, e.g., ten residues. Our method 2.  INPUT  FILTERS Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the 1.  The presence of homologous proteins in the dataset would known consensus in two cases (bottom left) and no match in three cases (bottom right). lead to spurious motif overrepresentation. We use the CD- The performance is comparable to that of Dilimot [1], a similar software that describes HIT algorithm [2] to identify this kind of redundancy and motifs as consensus sequences remove it from the input. 2.  Most functional linear motifs are located within disordered Motif Integrin TRAF6 Motif NR box EH1 HP1 protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis. ELM RGD PxE ELM LxLL Fx(IV)xx(IL)(ILM) PxVx(LM) Dilimot RxDV PQE Dilimot Not found FxIxNI KVPxVxL 3.  MOTIF  SEARCH input Our method Our method Not found Not found Not found Matrix M: sequences to be analyzed Our software is an adaptation of a Integer L: motif length method used for motif search in DNA sequences [4], implemented in Python. output CASE  STUDY:  NUCLEOLAR  LOCALIZATION  OF  MAGE  PROTEINS   Matrix Res: All k-word alingments It first calculates all possible alignments of two k-words in the dataset. Algorithm The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2 Next, we offer all possible k-words to { protein is observed in both the nucleus and the nucleolus. each growing alignment and incorporate M’ = ObtainAllKWords(M) the one resulting in the highest score. Res = CreateAlignmentsOfTwoKWords (M’) Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar While (Res) has changed { proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of We repeat this procedure until CurrentKWordss = ObtainAllKWords (M) MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus. incorporation of new k-words does not For all alignments A in Res Truncated MAGE-B2-GFP increase the score of any alignment. { GFP-MAGE-A2 GFP-MAGE-B2 AddBestKword (A, CurrentKwords) Last, we sort the alignments by their } } scores. The sorted list is the output of SortByScore (Res) the search. Print Res } 4.  MOTIF  SCORING Transfected U2Os cells. We use the information content [5] of each alignment to quantify the overrepresentation of Green: GFP tag, blue: DAPI. the motif contained in each sequence alignment. Magnification 100x. The uncertainty at a position of the alignment is: H(l) = -Σ f(aa,l) log2 f(aa,l) (bits) The information content at a position is the decrease in uncertainty between a random sequence and the CONCLUDING  REMARKS   observed sequences, with a correction e(n) for the Rsequence(l) = log220 + sampling of a finite number of sequences: Σ f(aa,l) log2 f(aa,l)-e(n) (bits) •  We have implemented an algorithm for the discovery of novel protein functional motifs within sets of unaligned sequences. The information content of an alignment is the sum over all positions: Rsequence = Rsequence(l) (bits) •  The algorithm shows good performance in the recovery of known motifs. •  We propose a putative motif responsible for localization of MAGE proteins in the nucleolus. 5.  OUTPUT   REFERENCES   [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405. We measure the similarity between two motifs as the Pearson correlation coefficient R [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182. between the corresponding amino acid frequencies. The group alignments above the [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100. desired value of R. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.