SlideShare a Scribd company logo
Mining text and data on chemicals




           Lars Juhl Jensen
three parts
text mining
data integration
medical records
Part 1
text mining
exponential growth
Mining text and data on chemicals
Mining text and data on chemicals
some things are constant
Mining text and data on chemicals
~45 seconds per paper
information retrieval
find the relevant papers
still too much to read
computer
as smart as a dog
teach it specific tricks
Mining text and data on chemicals
Mining text and data on chemicals
named entity recognition
identify the concepts
small molecules
proteins
diseases
comprehensive lexicon
synonyms
orthographic variation
“black list”
unfortunate names
Reflect
augmented browsing
browser add-on
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
            O’Donoghue et al., Journal of Web Semantics, 2010
Firefox
Internet Explorer
Google Chrome
Safari
Utopia Documents
web services
collaboration
Mining text and data on chemicals
Mining text and data on chemicals
Mining text and data on chemicals
SciVerse
Mining text and data on chemicals
Mining text and data on chemicals
Mining text and data on chemicals
Mining text and data on chemicals
Mining text and data on chemicals
information extraction
formalize the facts
co-mentioning
NLP
Natural Language Processing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction

[nxexpr The expression of
       [nxgene the cytochrome genes
           [nxpg CYC1 and CYC7]]]
   is controlled by
   [nxpg HAP1]
Part 2
data integration
STITCH
Kuhn et al., Nucleic Acids Research, 2012
~300,000 small molecules
~2.6 million proteins
1100+ genomes
experimental data
physical binding
chemical–protein
protein–protein
Mining text and data on chemicals
curated knowledge
drug targets
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
text mining
co-mentioning
Mining text and data on chemicals
NLP
Natural Language Processing
Mining text and data on chemicals
many data types
many databases
different formats
different identifiers
variable quality
not comparable
spread over many genomes
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
probabilistic scores
orthology transfer
combine the evidence
Part 3
patient records
a hard problem
in Danish
by busy doctors
about psychiatric patients
no lexicon
acronyms
typos
delusions
domain specific system
patient record excerpt
Negation

F20

F200
       Family
medication
adverse drug events
diagnoses
pharmacovigilance
patient stratification
Roque et al., PLoS Computational Biology, 2011
disease comorbidity
Roque et al., PLoS Computational Biology, 2011
DNA sequencing
genotype
phenotype
Acknowledgments

Reflect                    STITCH              EPJ-mining
Sune Frankild              Michael Kuhn        Francisco S Roque
Heiko Horn                 Damian Szklarczyk   Peter B Jensen
Evangelos Pafilis          Andrea              Robert Eriksson
Juan-Carlos Silla-Castro   Franceschini        Henriette Schmock
Michael Kuhn               Milan Simonovic     Marlene Dalgaard
Reinhardt Schneider        Alexander Roth      Massimo Andreatta
Sean O’Donoghue            Pablo Minguez       Thomas Hansen
                           Tobias Doerks       Karen Søeby
                           Manuel Stark        Søren Bredkjær
                           Christian von       Anders Juul
                           Mering              Thomas Werge
                           Peer Bork           Søren Brunak
larsjuhljensen

More Related Content

PPT
Protein and disease networks
PPT
Interaction networks - Prediction, data integration and text mining
PPT
Networks of proteins and diseases
PPT
Network biology
PPT
Network integration of data and text
PPT
Mining literature and medical records
PPT
Text-mining practical
PPT
Turning literature into databases
Protein and disease networks
Interaction networks - Prediction, data integration and text mining
Networks of proteins and diseases
Network biology
Network integration of data and text
Mining literature and medical records
Text-mining practical
Turning literature into databases

Viewers also liked (13)

PPT
Network biology: Large-scale data integration and text mining
PPT
The pragmatic text miner: From literature to electronic health records
PPT
Disease Systems Biology
PPT
Mining literature and medical records
PPT
Large-scale integration of data and text
PPT
Network biology
PPT
Using side effects for drug target identification
PPT
Network biology - Large-scale biomedical data and text mining
PPT
Data integration: The STITCH database of protein–small molecule interactions
PPT
Using side effects for drug target identification
PDF
MobilActif - Comment intégrer les questions SMS au sein de votre événement ?
PDF
MoWall by MobilActif - Animation Photo Interactive
PPT
Systems biology - Understanding biology at the systems level
Network biology: Large-scale data integration and text mining
The pragmatic text miner: From literature to electronic health records
Disease Systems Biology
Mining literature and medical records
Large-scale integration of data and text
Network biology
Using side effects for drug target identification
Network biology - Large-scale biomedical data and text mining
Data integration: The STITCH database of protein–small molecule interactions
Using side effects for drug target identification
MobilActif - Comment intégrer les questions SMS au sein de votre événement ?
MoWall by MobilActif - Animation Photo Interactive
Systems biology - Understanding biology at the systems level
Ad

Similar to Mining text and data on chemicals (20)

PPT
Networks of proteins and diseases
PPT
Mining biomedical texts
PPT
Network biology: Large-scale biomedical data and text mining
PPT
Systems biology - Bioinformatics on complete biological systems
PPT
Disease Systems Biology
PPT
Network biology: Large-scale data integration and text mining
PPT
Network biology - Large-scale data integration and text mining
PPT
Reflect and friends - Tools and resources for mining biomedical text
PPT
Advanced bioinformatics methods for proteomics
PPT
Mining molecules from text and data
PPT
Advanced bioinformatics methods for proteomics
PPT
Advanced bioinformatics methods for proteomics
PPT
Visualization of large-scale protein and disease networks
PPT
Large-scale biomedical data and text integration
PPT
Network biology: Large-scale data and text mining
PPT
Networks of proteins and diseases
PPT
The STRING database and related tools
PPT
Large-scale integration of data and text
PPT
Cellular Network Biology
PPT
Mining heaps of data and piles of papers
Networks of proteins and diseases
Mining biomedical texts
Network biology: Large-scale biomedical data and text mining
Systems biology - Bioinformatics on complete biological systems
Disease Systems Biology
Network biology: Large-scale data integration and text mining
Network biology - Large-scale data integration and text mining
Reflect and friends - Tools and resources for mining biomedical text
Advanced bioinformatics methods for proteomics
Mining molecules from text and data
Advanced bioinformatics methods for proteomics
Advanced bioinformatics methods for proteomics
Visualization of large-scale protein and disease networks
Large-scale biomedical data and text integration
Network biology: Large-scale data and text mining
Networks of proteins and diseases
The STRING database and related tools
Large-scale integration of data and text
Cellular Network Biology
Mining heaps of data and piles of papers
Ad

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
PPT
The Art of Counting: Scoring and ranking co-occurrences in literature
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
STRING & related databases: Large-scale integration of heterogeneous data
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization
The Art of Counting: Scoring and ranking co-occurrences in literature

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative analysis of optical character recognition models for extracting...
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25-Week II
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction

Mining text and data on chemicals