SlideShare a Scribd company logo
Text mining for protein and small molecule relations Lars Juhl Jensen EMBL
Why?
Overview Entity recognition and identification Recognition: find the words that are names of entities Identification: figure out which entities they refer to Information extraction Simple statistical co-occurrence methods Natural Language Processing (NLP) Text mining Mining text for overlooked relations Discovery of global trends from text alone
Entity recognition Features Morphological: mixes letters and digits or ends on -ase  Context: followed by “protein” or “gene” Grammar: should occur as a noun Methodologies Manually crafted rule-based systems Machine learning (SVMs) But what can it be used for?
Entity identification A good synonyms list is the key Combine many sources Curate to eliminate stop words Flexible matching to handle orthographic variation Case variation:  CDC28 ,  Cdc28 , and  cdc28 Prefixes:  myc  and  c-myc Postfixes:  Cdc28  and  Cdc28p Spaces and hyphens:  cdc28  and  cdc-28 Latin vs. Greek letters:  TNF-alpha  and  TNFA
Example Mitotic cyclin ( Clb2 )-bound  Cdc28  (Cdk1 homolog) directly phosphorylated  Swe1  and this modification served as a priming step to promote subsequent  Cdc5 -dependent  Swe1  hyperphosphorylation and degradation Entities identified S. cerevisiae  proteins:  Clb2  (YPR119W),  Cdc28  (YBR160W),  Swe1  (YJL187C), and  Cdc5  (YMR001C)
Identification of small molecules We have compiled a list of 14 million synonyms for 4 million chemicals This list was compiled based on many resources: PubChem, KEGG, ChEBI, and SuperDrug A stop word list was manually curated for based on synonyms that occur 2000+ times in Medline Searching Medline with this list gives 12.5 million hits in 4.6 million abstracts The precision and recall has not been evaluated yet However, stop word curation has eliminated the most critical errors so fairly high precision is likely
Co-occurrence Relations are extracted for co-occurring entities Relations are always symmetric The type of relation is not given Scoring the relations More co-occurrences    more significant Ubiquitous entities    less significant Same sentence vs. same paragraph Simple, good recall, poor precision
Example Mitotic cyclin ( Clb2 )-bound  Cdc28  (Cdk1 homolog) directly phosphorylated  Swe1  and this modification served as a priming step to promote subsequent  Cdc5 -dependent  Swe1  hyperphosphorylation and degradation Relations Correct:  Clb2–Cdc28 ,  Clb2–Swe1 ,  Cdc28–Swe1 , and  Cdc5–Swe1 Wrong:  Clb2–Cdc5  and  Cdc28–Cdc5
NLP Information is extracted based on parsing and interpreting phrases or full sentences Good at extracting specific types of relations Handles directed relations Complex, good precision, poor recall
Example Mitotic cyclin ( Clb2 )-bound  Cdc28  (Cdk1 homolog) directly phosphorylated  Swe1  and this modification served as a priming step to promote subsequent  Cdc5 -dependent  Swe1  hyperphosphorylation and degradation Relations: Complex:  Clb2–Cdc28 Phosphorylation:  Clb2  Swe1 ,  Cdc28  Swe1 , and  Cdc5  Swe1
Syntacto-semantic tagging Part-of-speech Gene  and protein  names Cue words for entity recognition Cue words for relation extraction Named entity chunking A CASS grammar recognizes noun chunks related to gene expression: [ nxgene  The  GAL4   gene ] Relation chunking Our CASS grammar also extracts relations between entities: [ nxexpr  T he  expression  of   [ nxgene   the cytochrome  genes   [ nxpg   CYC1  and  CYC7 ]]] is  controlled   by [ nxpg   HAP1 ]
 
Extraction of relations between protein and small molecules Over 650,000 protein–chemical relations were identified using simple co-occurrence on Medline Benchmarking on DrugBank Of the 959 protein–drug interactions in DrugBank, literature mining evidence was found for 299 30% recall is thus our current upper limit Precision has not yet been evaluated We are also working on adapting our NLP system to extract protein–chemical relations
Text mining New relations can be inferred from published ones This can lead to actual discoveries if no person knows all the facts required for making the inference Combine facts from disconnected literatures Global trends can be discovered from literature Although all the detailed data is in the text, people may have missed the big picture Identify significant correlations Find temporal trends
 
Correlations “ Customers who bought this item also bought …” Correlation protein roles in networks Transcription factors are themselves transcriptionally regulated Kinases are themselves phosphorylated Many proteins are both regulated transcriptionally and post-translationally
Temporal trends
Buzzwords
Acknowledgments Charit é Mathias Dunkel Robert Preißner EML Research Jasmin Saric Isabel Rojas EMBL Heidelberg Rossitza Ouzounova Peer Bork Rob Russell Reinhard Schneider
Thank you!

More Related Content

PPTX
SALK seaside symposium
PPT
Data integration: The STITCH database of protein-small molecule interactions
PPTX
Exploiting NLP for Digital Disease Informatics
PDF
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
PPT
Integration of biomedical literature and databases
PPT
Open access - making the most of biomedical literature mining
PPT
Utilizing literature for biological discovery
PPT
One tagger, many uses - Illustrating the power of ontologies in named entity ...
SALK seaside symposium
Data integration: The STITCH database of protein-small molecule interactions
Exploiting NLP for Digital Disease Informatics
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Integration of biomedical literature and databases
Open access - making the most of biomedical literature mining
Utilizing literature for biological discovery
One tagger, many uses - Illustrating the power of ontologies in named entity ...

Similar to Text mining for protein and small molecule relations (20)

PPT
Biological literature mining - from information retrieval to biological disco...
PPT
Literature Mining and Systems Biology
PPT
Biomedical literature mining
PPT
Applied text mining
PPT
Applied text mining
PPT
Biomedical literature mining
PPT
Text mining and data integration
PPT
Mining literature and medical records
PPT
Literature mining and large-scale data integration
PPT
STRING - Prediction of a functional association network for the yeast mitocho...
PPT
Systems biology: Bioinformatics on complete biological system
PPT
Computational approaches to cell cycle analysis: Current research topics (tho...
PPT
Text mining
PPT
Transcriptomics and lexico-syntactic analysis
PPT
Systems biology - Bioinformatics on complete biological systems
PPT
STRING: Large-scale data and text mining
PPT
Large-scale data and text mining - Linking proteins, chemicals, and side effects
PPT
Systems biology: Bioinformatics on complete biological systems
PPT
Large-scale integration of data and text
PPT
Integration of biomedical literature and databases
Biological literature mining - from information retrieval to biological disco...
Literature Mining and Systems Biology
Biomedical literature mining
Applied text mining
Applied text mining
Biomedical literature mining
Text mining and data integration
Mining literature and medical records
Literature mining and large-scale data integration
STRING - Prediction of a functional association network for the yeast mitocho...
Systems biology: Bioinformatics on complete biological system
Computational approaches to cell cycle analysis: Current research topics (tho...
Text mining
Transcriptomics and lexico-syntactic analysis
Systems biology - Bioinformatics on complete biological systems
STRING: Large-scale data and text mining
Large-scale data and text mining - Linking proteins, chemicals, and side effects
Systems biology: Bioinformatics on complete biological systems
Large-scale integration of data and text
Integration of biomedical literature and databases
Ad

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
STRING & related databases: Large-scale integration of heterogeneous data
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization
Ad

Recently uploaded (20)

PDF
Unit 1 Cost Accounting - Cost sheet
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PDF
Outsourced Audit & Assurance in USA Why Globus Finanza is Your Trusted Choice
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
Types of control:Qualitative vs Quantitative
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
Training And Development of Employee .pdf
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
How to Get Funding for Your Trucking Business
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PDF
Nidhal Samdaie CV - International Business Consultant
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
WRN_Investor_Presentation_August 2025.pdf
PDF
IFRS Notes in your pocket for study all the time
DOCX
Business Management - unit 1 and 2
PPTX
Amazon (Business Studies) management studies
Unit 1 Cost Accounting - Cost sheet
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
Roadmap Map-digital Banking feature MB,IB,AB
Outsourced Audit & Assurance in USA Why Globus Finanza is Your Trusted Choice
Euro SEO Services 1st 3 General Updates.docx
Types of control:Qualitative vs Quantitative
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Training And Development of Employee .pdf
Power and position in leadershipDOC-20250808-WA0011..pdf
How to Get Funding for Your Trucking Business
unit 1 COST ACCOUNTING AND COST SHEET
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
Nidhal Samdaie CV - International Business Consultant
340036916-American-Literature-Literary-Period-Overview.ppt
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
WRN_Investor_Presentation_August 2025.pdf
IFRS Notes in your pocket for study all the time
Business Management - unit 1 and 2
Amazon (Business Studies) management studies

Text mining for protein and small molecule relations

  • 1. Text mining for protein and small molecule relations Lars Juhl Jensen EMBL
  • 3. Overview Entity recognition and identification Recognition: find the words that are names of entities Identification: figure out which entities they refer to Information extraction Simple statistical co-occurrence methods Natural Language Processing (NLP) Text mining Mining text for overlooked relations Discovery of global trends from text alone
  • 4. Entity recognition Features Morphological: mixes letters and digits or ends on -ase Context: followed by “protein” or “gene” Grammar: should occur as a noun Methodologies Manually crafted rule-based systems Machine learning (SVMs) But what can it be used for?
  • 5. Entity identification A good synonyms list is the key Combine many sources Curate to eliminate stop words Flexible matching to handle orthographic variation Case variation: CDC28 , Cdc28 , and cdc28 Prefixes: myc and c-myc Postfixes: Cdc28 and Cdc28p Spaces and hyphens: cdc28 and cdc-28 Latin vs. Greek letters: TNF-alpha and TNFA
  • 6. Example Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation Entities identified S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)
  • 7. Identification of small molecules We have compiled a list of 14 million synonyms for 4 million chemicals This list was compiled based on many resources: PubChem, KEGG, ChEBI, and SuperDrug A stop word list was manually curated for based on synonyms that occur 2000+ times in Medline Searching Medline with this list gives 12.5 million hits in 4.6 million abstracts The precision and recall has not been evaluated yet However, stop word curation has eliminated the most critical errors so fairly high precision is likely
  • 8. Co-occurrence Relations are extracted for co-occurring entities Relations are always symmetric The type of relation is not given Scoring the relations More co-occurrences  more significant Ubiquitous entities  less significant Same sentence vs. same paragraph Simple, good recall, poor precision
  • 9. Example Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation Relations Correct: Clb2–Cdc28 , Clb2–Swe1 , Cdc28–Swe1 , and Cdc5–Swe1 Wrong: Clb2–Cdc5 and Cdc28–Cdc5
  • 10. NLP Information is extracted based on parsing and interpreting phrases or full sentences Good at extracting specific types of relations Handles directed relations Complex, good precision, poor recall
  • 11. Example Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation Relations: Complex: Clb2–Cdc28 Phosphorylation: Clb2  Swe1 , Cdc28  Swe1 , and Cdc5  Swe1
  • 12. Syntacto-semantic tagging Part-of-speech Gene and protein names Cue words for entity recognition Cue words for relation extraction Named entity chunking A CASS grammar recognizes noun chunks related to gene expression: [ nxgene The GAL4 gene ] Relation chunking Our CASS grammar also extracts relations between entities: [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ]
  • 13.  
  • 14. Extraction of relations between protein and small molecules Over 650,000 protein–chemical relations were identified using simple co-occurrence on Medline Benchmarking on DrugBank Of the 959 protein–drug interactions in DrugBank, literature mining evidence was found for 299 30% recall is thus our current upper limit Precision has not yet been evaluated We are also working on adapting our NLP system to extract protein–chemical relations
  • 15. Text mining New relations can be inferred from published ones This can lead to actual discoveries if no person knows all the facts required for making the inference Combine facts from disconnected literatures Global trends can be discovered from literature Although all the detailed data is in the text, people may have missed the big picture Identify significant correlations Find temporal trends
  • 16.  
  • 17. Correlations “ Customers who bought this item also bought …” Correlation protein roles in networks Transcription factors are themselves transcriptionally regulated Kinases are themselves phosphorylated Many proteins are both regulated transcriptionally and post-translationally
  • 20. Acknowledgments Charit é Mathias Dunkel Robert Preißner EML Research Jasmin Saric Isabel Rojas EMBL Heidelberg Rossitza Ouzounova Peer Bork Rob Russell Reinhard Schneider