SlideShare a Scribd company logo
Biological Literature Mining Lars Juhl Jensen EMBL
Why?
Overview Information retrieval and text categorization Methodologies for finding and classifying texts Entity recognition and information extraction Identification of gene/protein/drug names in text Statistical and NLP methods for relation extraction Text- and data-mining Making discoveries from text alone Integration of text and other data types
Status IR, ER, and simple IE methods are fairly well established Advanced NLP-based IE systems are rapidly being improved Methods for text mining and text/data integration are still in their infancy
Evaluation Computational linguist lingo Recall = sensitivity Precision = specificity F-score = 2  recall  precision/(recall+precision) Best F-score    best method CASP-like assessments IR: TREC ER: BioCreAtIvE task 1 (IE: BioCreAtIvE task 2)
Corpora Plain text Publication abstracts: M EDLINE Full text papers: PubMed Central / Highwire Press Gene summaries: SGD, The Interactive Fly, OMIM, … Patents descriptions: various patent databases Tagged text Categorization: M EDLINE  MeSH terms Syntactic tagging: G ENIA Semantic tagging: G ENETAG
Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
Information Retrieval and Text Categorization Lars Juhl Jensen EMBL
Overview Ad hoc information retrieval The user enters a query/a set of keywords The system attempts to retrieve the relevant texts from a large text corpus (typically Medline) Text categorization A training set of texts is created in which texts are manually assigned to classes (often only yes/no) A machine learning methods is trained to classify texts This method can subsequently be used to classify a much larger text corpus
Ad hoc IR These systems are very useful since the user can provide any query The query is typically Boolean ( yeast  AND  cell cycle ) A few systems instead allow the relative weight of each search term to be specified by the user The art is to find the relevant papers even if they do not actually match the query Ideally our example sentence should be extracted by the query  yeast cell cycle  although none of these words are mentioned
 
 
 
 
 
 
Automatic query expansion In a typical query, the user will not have provided all relevant words and variants thereof By automatically expanding queries with additional search terms, recall can be improved Stemming removes common endings ( yeast  /  yeasts ) Thesauri can be used to expand queries with synonyms and/or abbreviations ( yeast  /  S. cerevisiae ) The next logical step is to use ontologies to make complex inferences ( yeast cell cycle  /  Cdc28  )
 
Document similarity The similarity of two documents can be defined based on their word content Each document can be represented by a word vector Words should be weighted based on their frequency and background frequency The most commonly used scheme is tf*idf weighting Document similarity can be used in ad hoc IR Rather than matching the query against each document only, the N most similar documents are also considered
Document clustering Unsupervised clustering algorithms can be applied to a document similarity matrix All pairwise document similarities are calculated Clusters of “similar documents” can be constructed using one of numerous standard clustering methods Practical uses of document clustering The “related documents” function in PubMed Logical organization of the documents found by IR
Text categorization These systems are a lot less flexible than ad hoc systems but can attain better accuracy Works on a pre-defined set of document classes Each class is defined by manually assigning a number of documents to it Method Rules may be manually crafted based on a very small set of manually classified documents Statistical machine learning methods can be trained on a large number of classified documents
Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Hints in the text Strong:  Cdc28  and  Swe1  (“cell cycle” and “yeast”) Weaker:  mitotic cyclin ,  Clb2 , and  Cdk1  ( “cell cycle)
Machine learning Input features Word content or bi-/tri-grams Part-of-speech tags Filtering (stop words, part-of-speech) Singular value decomposition Training Support vector machines are best suited Choice of kernel function Separate training and evaluation sets, cross validation
 
Summary Pros and cons of ad hoc IR systems Highly flexible as it is not limited by a training data set Can be very fast if the corpus is properly indexed The accuracy and recall depends strongly on the ability of the user to select the right keywords Some topics are not easily described by a query Pros and cons of text categorization methods Very high accuracy and recall can be attained Requires a separate training set for each category
Entity Recognition and Information Extraction Lars Juhl Jensen EMBL
Overview Entity recognition (ER) Finding the genes/proteins/drugs mentioned in a text Word sense disambiguation Information extraction (IE) Simple statistical co-occurrence methods Combining co-occurrence and text categorization Natural Language Processing (NLP)
Entity recognition An important but boring problem The genes/proteins/drugs mentioned within a given text must be identified Recognition vs. identification Recognition: find the words that are names of entities Identification: figure out which entities they refer to Recognition without identification is of limited use
Example Mitotic cyclin ( Clb2 )-bound  Cdc28  (Cdk1 homolog) directly phosphorylated  Swe1  and this modification served as a priming step to promote subsequent  Cdc5 -dependent  Swe1  hyperphosphorylation and degradation Entities identified S. cerevisiae  proteins:  Clb2  (YPR119W),  Cdc28  (YBR160W),  Swe1  (YJL187C), and  Cdc5  (YMR001C)
Recognition Features Morphological: mixes letters and digits or ends on -ase  Context: followed by “protein” or “gene” Grammar: should occur as a noun Methodologies Manually crafted rule-based systems Machine learning (SVMs) But what can it be used for?
Identification A good synonyms list is the key Combine many sources Curate to eliminate stop words Flexible matching to handle orthographic variation Case variation:  CDC28 ,  Cdc28 , and  cdc28 Prefixes:  myc  and  c-myc Postfixes:  Cdc28  and  Cdc28p Spaces and hyphens:  cdc28  and  cdc-28 Latin vs. Greek letters:  TNF-alpha  and  TNFA
Disambiguation The same word may mean many different things Entity names may also be common English words ( hairy ) or technical terms ( SDS ) Protein names may refer to related or unrelated proteins in other species ( cdc2 ) The meaning can be resolved from the context ER can distinguish between names and common words Disambiguating non-unique names is a hard problem Ambiguity between orthologs can be safely be ignored
 
 
 
 
 
Co-occurrence Relations are extracted for co-occurring entities Relations are always symmetric The type of relation is not given Scoring the relations More co-occurrences    more significant Ubiquitous entities    less significant Same sentence vs. same paragraph Simple, good recall, poor precision
Example Mitotic cyclin ( Clb2 )-bound  Cdc28  (Cdk1 homolog) directly phosphorylated  Swe1  and this modification served as a priming step to promote subsequent  Cdc5 -dependent  Swe1  hyperphosphorylation and degradation Relations Correct:  Clb2–Cdc28 ,  Clb2–Swe1 ,  Cdc28–Swe1 , and  Cdc5–Swe1 Wrong:  Clb2–Cdc5  and  Cdc28–Cdc5
 
Categorization Extracting specific types of relations Text categorization methods can be used to identify sentences that mention a certain type of relations Filtering can be done before or after relation extraction Well suited for database curation Text categorization can be reused High recall is most important Curators can compensate for the lack of precision
 
NLP Information is extracted based on parsing and interpreting phrases or full sentences Good at extracting specific types of relations Handles directed relations Complex, good precision, poor recall
Example Mitotic cyclin ( Clb2 )-bound  Cdc28  (Cdk1 homolog) directly phosphorylated  Swe1  and this modification served as a priming step to promote subsequent  Cdc5 -dependent  Swe1  hyperphosphorylation and degradation Relations: Complex:  Clb2–Cdc28 Phosphorylation:  Clb2  Swe1 ,  Cdc28  Swe1 , and  Cdc5  Swe1
Architecture Tokenization Entity recognition with synonyms list Word boundaries (multi words) Sentence boundaries (abbreviations) Part-of-speech tagging TreeTagger trained on G ENIA Semantic labeling Dictionary of regular expressions Entity and relation chunking Rule-based system implemented in CASS
Semantic labeling Gene  and protein  names Cue words for entity recognition Cue words for relation extraction Named entity chunking A CASS grammar recognizes noun chunks related to gene expression: [ nxgene  The  GAL4   gene ] Relation chunking Our CASS grammar also extracts relations between entities: [ nxexpr  T he  expression  of   [ nxgene   the cytochrome  genes   [ nxpg   CYC1  and  CYC7 ]]] is  controlled  by [ nxpg   HAP1 ]
[ expression_repression_active Btk regulates the  IL-2 gene ] [ dephosphorylation_nominal Dephosphorylation of Syk  and  Btk mediated by    SHP-1 ] [ phosphorylation_nominal phosphorylation of  Shc  by the hematopoietic cell-specific   tyrosine kinase  Syk ] [ phosphorylation_nominal the phosphorylation of the adapter protein  SHC by the Src-related kinase  Lyn ] [ phosphorylation_active Lyn also participates in [ phosphorylation  the tyrosine phosphorylation and activation of  syk ]] [ phosphorylation_active Lyn , [ negation  but not  Jak2 ] phosphorylated CrkL ] [ phosphorylation_active Lyn , [ negation  but not  Jak2 ] phosphorylated CrkL ] [ phosphorylation_active Lyn also  participates  in [ phosphorylation  the tyrosine  phosphorylation and activation of  syk ]] [ phosphorylation_nominal the  phosphorylation  of the adapter  protein   SHC by the Src-related  kinase   Lyn ] [ phosphorylation_nominal phosphorylation  of  Shc  by the hematopoietic cell-specific   tyrosine  kinase   Syk ] [ dephosphorylation_nominal Dephosphorylation  of Syk  and  Btk mediated  by    SHP-1 ] [ expression_repression_active IL-10 also decreased  [ expression  mRNA expression of  IL-2  and  IL18  cytokine receptors] [ expression_repression_active IL-10 also  decreased  [ expression   mRNA  expression  of  IL-2  and  IL18  cytokine  receptors ] [ expression_activation_passive [ expression   IL-13  expression] induced by    IL-2 + IL-18 ] [ expression_activation_passive [ expression   IL-13  expression ] induced by    IL-2  +  IL-18 ] [ expression_repression_active Btk regulates the  IL-2  gene ]
 
MedScan
Summary Entity recognition The best methods rely on curated synonyms lists Co-occurrence methods High recall but typically poor accuracy Cannot deal with directed interactions Natural language processing High accuracy but typically poor recall Rule development is time consuming
Text- and Data-mining Lars Juhl Jensen EMBL
Overview Pure text-mining Discovery of global trends Inference of overlooked relations Integration of text and other data sources Augmented text-mining methods Automated annotation of high-throughput data
Trends Most similar to existing data mining approaches Although all the detailed data is in the text, people may have missed the big picture Temporal trends Historical summaries Forecasting Correlations “ Customers who bought this item also bought …”
Time
Successful genes
Buzzwords
Correlations “ Customers who bought this item also bought …” Protein networks “ Proteins that regulate expression …” “ Proteins that control phosphorylation …” “ Proteins that are phosphorylated …” Co-author networks
Transcriptional networks 32 79 83 3592 Regulates Regulated P < 9  10 -9
Signaling pathways 11 27 44 3704 Phosphorylates Phosphorylated P < 2  10 -7
Multiple regulation 8 107 47 3625 Expression Phosphorylation P < 5  10 -4
 
Nuggets New relations can be inferred from published ones This can lead to actual discoveries if no person knows all the facts required for making the inference Combining facts from disconnected literatures Swanson’s pioneering work Fish oil and Reynaud's disease Magnesium and migraine
 
 
Integration Automatic annotation of high-throughput data Loads of fairly trivial methods Protein interaction networks Can unify many types of interactions Powerful as exploratory visualization tools More creative strategies Identification of candidate genes for genetic diseases Linking genes to traits based on species distributions
 
 
 
 
 
RCCs
Disease candidate genes Rank the genes within a chromosomal region to which a disease has been mapped Methods G2D Gene  Function  Chemical  Phenotype  Disease Uses M EDLINE  but not the text B ITOLA Gene  Words  Disease (similar to A RROWSMITH ) Hide and co-workers Gene  Tissue  Disease
G2D
 
 
 
Genotype–phenotype Genes can be linked to traits by comparing the species distributions of both Mainly works for prokaryotes Traits are represented by keywords Finding the species profiles Gene profiles are found by sequence similarity Keyword profiles are based co-occurrence with the species name in M EDLINE
 
 
Annotation High-throughput experiments of result in groups of related genes ER is used to find the associated abstracts The frequency of each word is counted in the abstracts Background frequencies of all words are pre-calculated A statistical test is used to rank the words (typically Fisher’s exact test) The same strategy can be applied to find MeSH terms associated with a gene cluster
Summary Mining for overlooked relations Few overlooked relations can be found from text alone Methods that combine text and other data types have much better discovery potential Annotation/interpretation of high-throughput data Molecular networks can be useful for gaining an overview of large expression data sets Literature can be used to find keywords for a group of genes, but this has few advantages over using GO terms
Outlook Lars Juhl Jensen EMBL
Death? Literature mining will not be made obsolete by <insert your favorite new technology here> Repositories are always made too late There will always be new types of relations Semantically tagged XML may replace ER (hopefully!) Semantically tagged XML will never tag everything Specific IE problems will become obsolete Protein function Physical protein interactions
Permission denied Open access Literature mining methods cannot retrieve, extract, or correlate information from text unless it is accessible Restricted access is already now the primary problem Standard formats Getting the text out of a PDF file is not trivial Many journals now store papers in XML format Where do I get all the patent text?!
Innovation The basic tools are now in place for IR, ER, and IE Development was driven by computational linguists  Text- and data-mining Biologists are needed Collaboration with linguists Lack of innovation Very few new ideas Text should be combined with other data
Acknowledgments EML Research Jasmin Saric Isabel Rojas EMBL Heidelberg Peer Bork Miguel Andrade Rossitza Ouzounova Jan Korbel Tobias Doerks
Exercises Lars Juhl Jensen EMBL
Information retrieval PubFinder http://www.glycosciences.de/tools/PubFinder/ Ideas Do a very specific search on PubMed that retrieves only around 10–20 relevant papers See if PubFinder is able to retrieve more Compare this with using the “Related Articles” function in PubMed
Entity recognition iHOP http://www.pdg.cnb.uam.es/UniPub/iHOP/ Ideas Compare iHOP vs. PubMed for finding papers related to a particular gene Use iHOP to construct a small literature-based network
Information extraction Relation extraction iProLINK ( http://pir.georgetown.edu/iprolink/ ) PreBIND ( http://prebind.bind.ca ) PubGene ( http://www.pubgene.org ) Ideas Check how complex sentences iProLINK can handle Check how well PreBIND can discriminate between physcial and other interactions (other interactions can be found with PubGene, ProLinks, or STRING)
Text mining A RROWSMITH http://arrowsmith.psych.uic.edu Ideas Fish oil and Reynaud's disease Magnesium and migraine Arginine and somatomedin C Estrogen and Alzheimer's disease
Integration 1 Protein networks S TRING  beta version ( http://string.embl.de :8080 ) ProLinks ( http://dip.doe-mbi.ucla.edu/pronav/ ) Ideas Use both tools to find functions for proteins of known and unknown function Use S TRING  to construct a network for a set of proteins  Try to reproduce the Ssn3–Msn2–Hsp104 link
Integration 2 Finding candidate disease genes G2D ( http://www.ogic.ca/projects/g2d_2/ ) B ITOLA  ( http://www.mf.uni-lj.si/bitola/ ) Ideas Take a look at the G2D results for some diseases where you know which types of genes would be sensible to suggest Compare the results with B ITOLA  (if you have the patience to figure out there interface!)
Integration 3 Annotation of expression data MedMiner ( http://discover.nci.nih.gov/textmining/ ) Ideas Stating the obvious … do the one thing that MedMiner can do …

More Related Content

PPTX
Mining Drug Targets, Structures and Activity Data
PPTX
Text Mining for Biocuration of Bacterial Infectious Diseases
PPT
Beyond Transparency: Success & Lessons From tambisBoston2003
PPTX
Information extraction from EHR
PDF
Domain Specific Named Entity Recognition Using Supervised Approach
PPT
NAISTビッグデータシンポジウム - 情報 松本先生
PDF
Automatic Prediction of Evidence-based Recommendations via Sentence-level Pol...
PPT
Mining Drug Targets, Structures and Activity Data
Text Mining for Biocuration of Bacterial Infectious Diseases
Beyond Transparency: Success & Lessons From tambisBoston2003
Information extraction from EHR
Domain Specific Named Entity Recognition Using Supervised Approach
NAISTビッグデータシンポジウム - 情報 松本先生
Automatic Prediction of Evidence-based Recommendations via Sentence-level Pol...

What's hot (20)

PPT
Cartic Ramakrishnan's dissertation defense
PDF
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
PPTX
Ibn Sina
PDF
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
PPTX
blast bioinformatics
PPTX
Experiences with logic programming in bioinformatics
PDF
Hyponymy extraction of domain ontology
PPTX
BLAST
PPTX
Ontology-based Data Integration
PPTX
Ontology For Data Integration
PPT
Drug design
PPTX
Blast 2013 1
PPT
The Past, Present and Future of Knowledge in Biology
PDF
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
PDF
Poster genome engineering & Synthetic Biology 2016
PPT
How the blast work
PPT
Acupulco cda access v3-1
PPT
PDF
[IJET-V2I3P19] Authors: Priyanka Sharma
PDF
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
Cartic Ramakrishnan's dissertation defense
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
Ibn Sina
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
blast bioinformatics
Experiences with logic programming in bioinformatics
Hyponymy extraction of domain ontology
BLAST
Ontology-based Data Integration
Ontology For Data Integration
Drug design
Blast 2013 1
The Past, Present and Future of Knowledge in Biology
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Poster genome engineering & Synthetic Biology 2016
How the blast work
Acupulco cda access v3-1
[IJET-V2I3P19] Authors: Priyanka Sharma
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
Ad

Similar to Biomedical literature mining (20)

PPT
Biomedical literature mining
PPT
Biological literature mining - from information retrieval to biological disco...
PPT
Literature Mining and Systems Biology
PPT
Utilizing literature for biological discovery
PPT
Text mining for protein and small molecule relations
PPT
Prediction of protein function
PPT
Systems biology: Bioinformatics on complete biological system
PPT
PPT
PPT
STRING: Protein association networks
PPT
STRING: protein association networks
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Network biology - Large-scale integration of data and text
PPT
Biomedical Search
PPT
Network biology: Large-scale integration of data and text
PPT
Text and data integration
PPT
Systems biology - Bioinformatics on complete biological systems
PPTX
Data Mining in Rediology reports
PPTX
A knowledge capture framework for domain specific search systems
PPTX
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
PPT
STRING: Large-scale data and text mining
Biomedical literature mining
Biological literature mining - from information retrieval to biological disco...
Literature Mining and Systems Biology
Utilizing literature for biological discovery
Text mining for protein and small molecule relations
Prediction of protein function
Systems biology: Bioinformatics on complete biological system
PPT
STRING: Protein association networks
STRING: protein association networks
STRING & related databases: Large-scale integration of heterogeneous data
Network biology - Large-scale integration of data and text
Biomedical Search
Network biology: Large-scale integration of data and text
Text and data integration
Systems biology - Bioinformatics on complete biological systems
Data Mining in Rediology reports
A knowledge capture framework for domain specific search systems
An Up-to-date Knowledge Base and Focused Exploration System for Human Perform...
STRING: Large-scale data and text mining
Ad

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Biomarker bioinformatics: Network-based candidate prioritization
PPT
The Art of Counting: Scoring and ranking co-occurrences in literature
PPT
Text-mining-based retrieval of protein networks
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Biomarker bioinformatics: Network-based candidate prioritization
The Art of Counting: Scoring and ranking co-occurrences in literature
Text-mining-based retrieval of protein networks

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Architecture types and enterprise applications.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
project resource management chapter-09.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Modernising the Digital Integration Hub
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
STKI Israel Market Study 2025 version august
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
The various Industrial Revolutions .pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
August Patch Tuesday
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Tartificialntelligence_presentation.pptx
Zenith AI: Advanced Artificial Intelligence
Enhancing emotion recognition model for a student engagement use case through...
Programs and apps: productivity, graphics, security and other tools
Univ-Connecticut-ChatGPT-Presentaion.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
Group 1 Presentation -Planning and Decision Making .pptx
Architecture types and enterprise applications.pdf
Web App vs Mobile App What Should You Build First.pdf
project resource management chapter-09.pdf
A comparative study of natural language inference in Swahili using monolingua...
Modernising the Digital Integration Hub
A novel scalable deep ensemble learning framework for big data classification...
STKI Israel Market Study 2025 version august
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
The various Industrial Revolutions .pptx
Chapter 5: Probability Theory and Statistics
August Patch Tuesday
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Tartificialntelligence_presentation.pptx

Biomedical literature mining

  • 1. Biological Literature Mining Lars Juhl Jensen EMBL
  • 3. Overview Information retrieval and text categorization Methodologies for finding and classifying texts Entity recognition and information extraction Identification of gene/protein/drug names in text Statistical and NLP methods for relation extraction Text- and data-mining Making discoveries from text alone Integration of text and other data types
  • 4. Status IR, ER, and simple IE methods are fairly well established Advanced NLP-based IE systems are rapidly being improved Methods for text mining and text/data integration are still in their infancy
  • 5. Evaluation Computational linguist lingo Recall = sensitivity Precision = specificity F-score = 2  recall  precision/(recall+precision) Best F-score  best method CASP-like assessments IR: TREC ER: BioCreAtIvE task 1 (IE: BioCreAtIvE task 2)
  • 6. Corpora Plain text Publication abstracts: M EDLINE Full text papers: PubMed Central / Highwire Press Gene summaries: SGD, The Interactive Fly, OMIM, … Patents descriptions: various patent databases Tagged text Categorization: M EDLINE MeSH terms Syntactic tagging: G ENIA Semantic tagging: G ENETAG
  • 7. Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
  • 8. Information Retrieval and Text Categorization Lars Juhl Jensen EMBL
  • 9. Overview Ad hoc information retrieval The user enters a query/a set of keywords The system attempts to retrieve the relevant texts from a large text corpus (typically Medline) Text categorization A training set of texts is created in which texts are manually assigned to classes (often only yes/no) A machine learning methods is trained to classify texts This method can subsequently be used to classify a much larger text corpus
  • 10. Ad hoc IR These systems are very useful since the user can provide any query The query is typically Boolean ( yeast AND cell cycle ) A few systems instead allow the relative weight of each search term to be specified by the user The art is to find the relevant papers even if they do not actually match the query Ideally our example sentence should be extracted by the query yeast cell cycle although none of these words are mentioned
  • 11.  
  • 12.  
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 17. Automatic query expansion In a typical query, the user will not have provided all relevant words and variants thereof By automatically expanding queries with additional search terms, recall can be improved Stemming removes common endings ( yeast / yeasts ) Thesauri can be used to expand queries with synonyms and/or abbreviations ( yeast / S. cerevisiae ) The next logical step is to use ontologies to make complex inferences ( yeast cell cycle / Cdc28 )
  • 18.  
  • 19. Document similarity The similarity of two documents can be defined based on their word content Each document can be represented by a word vector Words should be weighted based on their frequency and background frequency The most commonly used scheme is tf*idf weighting Document similarity can be used in ad hoc IR Rather than matching the query against each document only, the N most similar documents are also considered
  • 20. Document clustering Unsupervised clustering algorithms can be applied to a document similarity matrix All pairwise document similarities are calculated Clusters of “similar documents” can be constructed using one of numerous standard clustering methods Practical uses of document clustering The “related documents” function in PubMed Logical organization of the documents found by IR
  • 21. Text categorization These systems are a lot less flexible than ad hoc systems but can attain better accuracy Works on a pre-defined set of document classes Each class is defined by manually assigning a number of documents to it Method Rules may be manually crafted based on a very small set of manually classified documents Statistical machine learning methods can be trained on a large number of classified documents
  • 22. Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Hints in the text Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”) Weaker: mitotic cyclin , Clb2 , and Cdk1 ( “cell cycle)
  • 23. Machine learning Input features Word content or bi-/tri-grams Part-of-speech tags Filtering (stop words, part-of-speech) Singular value decomposition Training Support vector machines are best suited Choice of kernel function Separate training and evaluation sets, cross validation
  • 24.  
  • 25. Summary Pros and cons of ad hoc IR systems Highly flexible as it is not limited by a training data set Can be very fast if the corpus is properly indexed The accuracy and recall depends strongly on the ability of the user to select the right keywords Some topics are not easily described by a query Pros and cons of text categorization methods Very high accuracy and recall can be attained Requires a separate training set for each category
  • 26. Entity Recognition and Information Extraction Lars Juhl Jensen EMBL
  • 27. Overview Entity recognition (ER) Finding the genes/proteins/drugs mentioned in a text Word sense disambiguation Information extraction (IE) Simple statistical co-occurrence methods Combining co-occurrence and text categorization Natural Language Processing (NLP)
  • 28. Entity recognition An important but boring problem The genes/proteins/drugs mentioned within a given text must be identified Recognition vs. identification Recognition: find the words that are names of entities Identification: figure out which entities they refer to Recognition without identification is of limited use
  • 29. Example Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation Entities identified S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)
  • 30. Recognition Features Morphological: mixes letters and digits or ends on -ase Context: followed by “protein” or “gene” Grammar: should occur as a noun Methodologies Manually crafted rule-based systems Machine learning (SVMs) But what can it be used for?
  • 31. Identification A good synonyms list is the key Combine many sources Curate to eliminate stop words Flexible matching to handle orthographic variation Case variation: CDC28 , Cdc28 , and cdc28 Prefixes: myc and c-myc Postfixes: Cdc28 and Cdc28p Spaces and hyphens: cdc28 and cdc-28 Latin vs. Greek letters: TNF-alpha and TNFA
  • 32. Disambiguation The same word may mean many different things Entity names may also be common English words ( hairy ) or technical terms ( SDS ) Protein names may refer to related or unrelated proteins in other species ( cdc2 ) The meaning can be resolved from the context ER can distinguish between names and common words Disambiguating non-unique names is a hard problem Ambiguity between orthologs can be safely be ignored
  • 33.  
  • 34.  
  • 35.  
  • 36.  
  • 37.  
  • 38. Co-occurrence Relations are extracted for co-occurring entities Relations are always symmetric The type of relation is not given Scoring the relations More co-occurrences  more significant Ubiquitous entities  less significant Same sentence vs. same paragraph Simple, good recall, poor precision
  • 39. Example Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation Relations Correct: Clb2–Cdc28 , Clb2–Swe1 , Cdc28–Swe1 , and Cdc5–Swe1 Wrong: Clb2–Cdc5 and Cdc28–Cdc5
  • 40.  
  • 41. Categorization Extracting specific types of relations Text categorization methods can be used to identify sentences that mention a certain type of relations Filtering can be done before or after relation extraction Well suited for database curation Text categorization can be reused High recall is most important Curators can compensate for the lack of precision
  • 42.  
  • 43. NLP Information is extracted based on parsing and interpreting phrases or full sentences Good at extracting specific types of relations Handles directed relations Complex, good precision, poor recall
  • 44. Example Mitotic cyclin ( Clb2 )-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5 -dependent Swe1 hyperphosphorylation and degradation Relations: Complex: Clb2–Cdc28 Phosphorylation: Clb2  Swe1 , Cdc28  Swe1 , and Cdc5  Swe1
  • 45. Architecture Tokenization Entity recognition with synonyms list Word boundaries (multi words) Sentence boundaries (abbreviations) Part-of-speech tagging TreeTagger trained on G ENIA Semantic labeling Dictionary of regular expressions Entity and relation chunking Rule-based system implemented in CASS
  • 46. Semantic labeling Gene and protein names Cue words for entity recognition Cue words for relation extraction Named entity chunking A CASS grammar recognizes noun chunks related to gene expression: [ nxgene The GAL4 gene ] Relation chunking Our CASS grammar also extracts relations between entities: [ nxexpr T he expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7 ]]] is controlled by [ nxpg HAP1 ]
  • 47. [ expression_repression_active Btk regulates the IL-2 gene ] [ dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1 ] [ phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk ] [ phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn ] [ phosphorylation_active Lyn also participates in [ phosphorylation the tyrosine phosphorylation and activation of syk ]] [ phosphorylation_active Lyn , [ negation but not Jak2 ] phosphorylated CrkL ] [ phosphorylation_active Lyn , [ negation but not Jak2 ] phosphorylated CrkL ] [ phosphorylation_active Lyn also participates in [ phosphorylation the tyrosine phosphorylation and activation of syk ]] [ phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn ] [ phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk ] [ dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1 ] [ expression_repression_active IL-10 also decreased [ expression mRNA expression of IL-2 and IL18 cytokine receptors] [ expression_repression_active IL-10 also decreased [ expression mRNA expression of IL-2 and IL18 cytokine receptors ] [ expression_activation_passive [ expression IL-13 expression] induced by IL-2 + IL-18 ] [ expression_activation_passive [ expression IL-13 expression ] induced by IL-2 + IL-18 ] [ expression_repression_active Btk regulates the IL-2 gene ]
  • 48.  
  • 50. Summary Entity recognition The best methods rely on curated synonyms lists Co-occurrence methods High recall but typically poor accuracy Cannot deal with directed interactions Natural language processing High accuracy but typically poor recall Rule development is time consuming
  • 51. Text- and Data-mining Lars Juhl Jensen EMBL
  • 52. Overview Pure text-mining Discovery of global trends Inference of overlooked relations Integration of text and other data sources Augmented text-mining methods Automated annotation of high-throughput data
  • 53. Trends Most similar to existing data mining approaches Although all the detailed data is in the text, people may have missed the big picture Temporal trends Historical summaries Forecasting Correlations “ Customers who bought this item also bought …”
  • 54. Time
  • 57. Correlations “ Customers who bought this item also bought …” Protein networks “ Proteins that regulate expression …” “ Proteins that control phosphorylation …” “ Proteins that are phosphorylated …” Co-author networks
  • 58. Transcriptional networks 32 79 83 3592 Regulates Regulated P < 9  10 -9
  • 59. Signaling pathways 11 27 44 3704 Phosphorylates Phosphorylated P < 2  10 -7
  • 60. Multiple regulation 8 107 47 3625 Expression Phosphorylation P < 5  10 -4
  • 61.  
  • 62. Nuggets New relations can be inferred from published ones This can lead to actual discoveries if no person knows all the facts required for making the inference Combining facts from disconnected literatures Swanson’s pioneering work Fish oil and Reynaud's disease Magnesium and migraine
  • 63.  
  • 64.  
  • 65. Integration Automatic annotation of high-throughput data Loads of fairly trivial methods Protein interaction networks Can unify many types of interactions Powerful as exploratory visualization tools More creative strategies Identification of candidate genes for genetic diseases Linking genes to traits based on species distributions
  • 66.  
  • 67.  
  • 68.  
  • 69.  
  • 70.  
  • 71. RCCs
  • 72. Disease candidate genes Rank the genes within a chromosomal region to which a disease has been mapped Methods G2D Gene  Function  Chemical  Phenotype  Disease Uses M EDLINE but not the text B ITOLA Gene  Words  Disease (similar to A RROWSMITH ) Hide and co-workers Gene  Tissue  Disease
  • 73. G2D
  • 74.  
  • 75.  
  • 76.  
  • 77. Genotype–phenotype Genes can be linked to traits by comparing the species distributions of both Mainly works for prokaryotes Traits are represented by keywords Finding the species profiles Gene profiles are found by sequence similarity Keyword profiles are based co-occurrence with the species name in M EDLINE
  • 78.  
  • 79.  
  • 80. Annotation High-throughput experiments of result in groups of related genes ER is used to find the associated abstracts The frequency of each word is counted in the abstracts Background frequencies of all words are pre-calculated A statistical test is used to rank the words (typically Fisher’s exact test) The same strategy can be applied to find MeSH terms associated with a gene cluster
  • 81. Summary Mining for overlooked relations Few overlooked relations can be found from text alone Methods that combine text and other data types have much better discovery potential Annotation/interpretation of high-throughput data Molecular networks can be useful for gaining an overview of large expression data sets Literature can be used to find keywords for a group of genes, but this has few advantages over using GO terms
  • 82. Outlook Lars Juhl Jensen EMBL
  • 83. Death? Literature mining will not be made obsolete by <insert your favorite new technology here> Repositories are always made too late There will always be new types of relations Semantically tagged XML may replace ER (hopefully!) Semantically tagged XML will never tag everything Specific IE problems will become obsolete Protein function Physical protein interactions
  • 84. Permission denied Open access Literature mining methods cannot retrieve, extract, or correlate information from text unless it is accessible Restricted access is already now the primary problem Standard formats Getting the text out of a PDF file is not trivial Many journals now store papers in XML format Where do I get all the patent text?!
  • 85. Innovation The basic tools are now in place for IR, ER, and IE Development was driven by computational linguists Text- and data-mining Biologists are needed Collaboration with linguists Lack of innovation Very few new ideas Text should be combined with other data
  • 86. Acknowledgments EML Research Jasmin Saric Isabel Rojas EMBL Heidelberg Peer Bork Miguel Andrade Rossitza Ouzounova Jan Korbel Tobias Doerks
  • 87. Exercises Lars Juhl Jensen EMBL
  • 88. Information retrieval PubFinder http://www.glycosciences.de/tools/PubFinder/ Ideas Do a very specific search on PubMed that retrieves only around 10–20 relevant papers See if PubFinder is able to retrieve more Compare this with using the “Related Articles” function in PubMed
  • 89. Entity recognition iHOP http://www.pdg.cnb.uam.es/UniPub/iHOP/ Ideas Compare iHOP vs. PubMed for finding papers related to a particular gene Use iHOP to construct a small literature-based network
  • 90. Information extraction Relation extraction iProLINK ( http://pir.georgetown.edu/iprolink/ ) PreBIND ( http://prebind.bind.ca ) PubGene ( http://www.pubgene.org ) Ideas Check how complex sentences iProLINK can handle Check how well PreBIND can discriminate between physcial and other interactions (other interactions can be found with PubGene, ProLinks, or STRING)
  • 91. Text mining A RROWSMITH http://arrowsmith.psych.uic.edu Ideas Fish oil and Reynaud's disease Magnesium and migraine Arginine and somatomedin C Estrogen and Alzheimer's disease
  • 92. Integration 1 Protein networks S TRING beta version ( http://string.embl.de :8080 ) ProLinks ( http://dip.doe-mbi.ucla.edu/pronav/ ) Ideas Use both tools to find functions for proteins of known and unknown function Use S TRING to construct a network for a set of proteins Try to reproduce the Ssn3–Msn2–Hsp104 link
  • 93. Integration 2 Finding candidate disease genes G2D ( http://www.ogic.ca/projects/g2d_2/ ) B ITOLA ( http://www.mf.uni-lj.si/bitola/ ) Ideas Take a look at the G2D results for some diseases where you know which types of genes would be sensible to suggest Compare the results with B ITOLA (if you have the patience to figure out there interface!)
  • 94. Integration 3 Annotation of expression data MedMiner ( http://discover.nci.nih.gov/textmining/ ) Ideas Stating the obvious … do the one thing that MedMiner can do …