SlideShare a Scribd company logo
Large-scale integration of data and
text
Lars Juhl Jensen
data integration
text mining
molecular biology
medicine
association networks
guilt by association
Large-scale integration of data and text
STRING
Szklarczyk et al., Nucleic Acids Research, 2015string-db.org
2000+ genomes
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Large-scale integration of data and text
Large-scale integration of data and text
Large-scale integration of data and text
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
Large-scale integration of data and text
physical interactions
Jensen & Bork, Science, 2008
genetic interactions
Beyer et al., Nature Reviews Genetics, 2007
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
not same species
hard work
(Ph.D. students)
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
homology-based transfer
Franceschini et al., Nucleic Acids Research, 2013
missing most of the data
text mining
>10 km
too much to read
computer
as smart as a dog
teach it specific tricks
Large-scale integration of data and text
Large-scale integration of data and text
named entity recognition
comprehensive lexicon
cyclin dependent kinase 1
CDC2
flexible matching
cyclin dependent kinase 1
cyclin-dependent kinase 1
orthographic variation
CDC2
hCdc2
“black list”
SDS
information extraction
co-mentioning
within documents
within paragraphs
within sentences
NLP
Natural Language Processing
grammatical analysis
Gene and protein names
Cue words for entity
recognition
Verbs for relation extraction
[nxexpr The expression of
[nxgene the cytochrome
genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
Saric et al., Proceedings of ACL, 2004
more precise
worse recall
related web resources
STITCH
STRING + 300k chemicals
Kuhn et al., Nucleic Acids Research, 2014stitch-db.org
COMPARTMENTS
Binder et al., Database, 2014compartments.jensenlab.org
TISSUES
tissues.jensenlab.org Santos et al., submitted, 2015
DISEASES
diseases.jensenlab.org Frankild et al., Methods, 2015
general framework
curated knowledge
experimental data
text mining
computational predictions
common identifiers
quality scores
visualization
web resources
download files
why so many?
Swiss army knife syndrome
Large-scale integration of data and text
targeted resources
common infrastructure
medical data mining
Jensen et al., Nature Reviews Genetics, 2012
Large-scale integration of data and text
opt-out
opt-in
structured data
Jensen et al., Nature Reviews Genetics, 2012
civil registration system
established in 1968
Jensen et al., Nature Reviews Genetics, 2012
national discharge registry
14 years
6.2 million patients
119 million diagnoses
Jensen et al., Nature Reviews Genetics, 2012
guilt by association
naïve approach
comorbidity
Jensen et al., Nature Reviews Genetics, 2012
confounding factors
“known knowns”
gender
age
type of hospital encounter
Jensen et al., Nature Communications, 2014
“known unknowns”
smoking
diet
“unknown unknowns”
reporting biases
matched controls
temporal correlations
trajectories
Jensen et al., Nature Communications, 2014
trajectory networks
Jensen et al., Nature Communications, 2014
complex networks
key diagnoses
Jensen et al., Nature Communications, 2014
direct medical implications
medical text mining
pharmacovigilance
unstructured data
Large-scale integration of data and text
Danish
comprehensive lexicon
drugs
Clozapine
Clozapine
clozapi
n
clossapi
n
klozapin
e
chlosapi
n
chlosapi
ne
chlozapi
n
chlozapi
ne
klossapi
n
closapin
e
klozapi
nklosapi
n
adverse drug events
rule-based system
Eriksson et al., Drug Safety, 2014
Drug introduction Drug discontinuationAdverse event
Adverse eventNegative modifier Indication Pre-existing
condition
Adverse drug reaction Possible
adverse drug reaction
ADR of
additional drug
Eriksson et al., Drug Safety, 2014
Drug introduction Drug discontinuationAdverse eventIdentification start
Adverse eventNegative modifier Indication Pre-existing
condition
Adverse drug reaction Possible
adverse drug reaction
ADR of
additional drug
Eriksson et al., Drug Safety, 2014
Drug introduction Drug discontinuation
Adverse eventNegative modifier Indication Pre-existing
condition
Adverse drug reaction Possible
adverse drug reaction
Adverse event
ADR of
additional drug
Identification start
Eriksson et al., Drug Safety, 2014
Drug introduction Drug discontinuation
Adverse eventNegative modifier Indication Pre-existing
condition
Adverse drug reaction Possible
adverse drug reaction
Adverse event
ADR of
additional drug
Identification start
new adverse drug reactions
Eriksson et al., Drug Safety, 2014
Drug substance ADE p-value
Chlordiazepoxide Nystagmus 4.0e-8
Simvastatin Personality
changes
8.4e-8
Dipyridamole Visual impairment 4.4e-4
Citalopram Psychosis 8.8e-4
Bendroflumethiazi
de
Apoplexy 8.5e-3
estimate ADR frequencies
Eriksson et al., Drug Safety, 2014
Acknowledgments
STRING/STITCH
Michael Kuhn
Damian Szklarczyk
Andrea Franceschini
Milan Simonovic
Alexander Roth
Sune Pletscher-Frankild
Jianyi Lin
Pablo Minguez
Christian von Mering
Peer Bork
Text mining
Sune Pletscher-
Frankild
Jasmin Saric
Evangelos Pafilis
Alberto Santos
Janos Binder
Kalliopi Tsafou
Heiko Horn
Michael Kuhn
Reinhardt Schneider
Sean O’ Donoghue
EHR mining
Anders Boeck
Jensen
Robert Eriksson
Peter Bjødstrup
Jensen
Andreas Bok
Andersen
Sabrina Gade
Ellesøe
Henriette Schmock
Tudor Oprea
Pope Moseley
Thomas Werge
Søren Brunak

More Related Content

PPT
Large-scale integration of data and text
PPT
Large-scale integration of data and text
PPT
Large-scale data and text mining
PPT
Cellular Network Biology
PPT
Medical data and text mining - Linking diseases, drugs, and adverse reactions
PPT
Integration of diverse large-scale datasets
PPT
Large-scale integration of data and text
PPT
Gene association networks - Large-scale integration of data and text
Large-scale integration of data and text
Large-scale integration of data and text
Large-scale data and text mining
Cellular Network Biology
Medical data and text mining - Linking diseases, drugs, and adverse reactions
Integration of diverse large-scale datasets
Large-scale integration of data and text
Gene association networks - Large-scale integration of data and text

What's hot (14)

PPT
Data integration and functional association networks
PPT
Large-scale data and text mining
PPT
Using networks to derive function
PPT
Unraveling cellular phosphorylation networks using computational biology
PPT
Disease systems biology
PPT
Disease Systems Biology
PPT
Unraveling signal transduction networks through data integration
PPT
Network medicine - Integrating drugs, targets, diseases and side-effects
PDF
Traffic related air pollution and cognitive function in a cohort of older men
PDF
jon-vermeire--resume-final-draft JJV edits
PDF
CV Michelle Tourigny Linked IN 0216
PPT
Drug efficacy, safety and biologics discovery
PPT
Ctsa Award Conference Schedule Lettersize1 16 07
PDF
Epigenetics comparative biology
Data integration and functional association networks
Large-scale data and text mining
Using networks to derive function
Unraveling cellular phosphorylation networks using computational biology
Disease systems biology
Disease Systems Biology
Unraveling signal transduction networks through data integration
Network medicine - Integrating drugs, targets, diseases and side-effects
Traffic related air pollution and cognitive function in a cohort of older men
jon-vermeire--resume-final-draft JJV edits
CV Michelle Tourigny Linked IN 0216
Drug efficacy, safety and biologics discovery
Ctsa Award Conference Schedule Lettersize1 16 07
Epigenetics comparative biology
Ad

Viewers also liked (20)

PDF
Plan de trabajo becrea ceip europa córdoba 2015 2016
PPTX
PLANNING-Summary of experience
PPTX
Las 7 maravillas del mundo moderno
PPTX
Placas tectonicas
PPTX
BlocPower
PPTX
Desarrollo de nuevas tecnologias
PDF
Policiamento comunitário
PPTX
Algoritmos de transformación
PPTX
Editing yourself kelly schrank - spectrum 2013
PDF
Comparison of symmetrical and asymmetrical cascaded
PDF
GROUP_4E_A3
PPTX
Las aventuras de Tom sawyer
PPTX
Plan de compensacion cyvei
PPTX
Youtube
PPTX
Cambridgeshire
PDF
Comparison of stress between winkler bach theory and
PPT
Strategic Planning for Health Access California
PDF
الجامعة اللبنانية آلية الإعلام والتوثيق
PDF
Reflexiones para la semana de desarrollo institucional 2
Plan de trabajo becrea ceip europa córdoba 2015 2016
PLANNING-Summary of experience
Las 7 maravillas del mundo moderno
Placas tectonicas
BlocPower
Desarrollo de nuevas tecnologias
Policiamento comunitário
Algoritmos de transformación
Editing yourself kelly schrank - spectrum 2013
Comparison of symmetrical and asymmetrical cascaded
GROUP_4E_A3
Las aventuras de Tom sawyer
Plan de compensacion cyvei
Youtube
Cambridgeshire
Comparison of stress between winkler bach theory and
Strategic Planning for Health Access California
الجامعة اللبنانية آلية الإعلام والتوثيق
Reflexiones para la semana de desarrollo institucional 2
Ad

Similar to Large-scale integration of data and text (20)

PPT
Large-scale biomedical data and text integration
PPT
Medical data and text mining - Linking diseases, drugs, and adverse reactions
PPT
Networks of proteins and diseases
PPT
Data and Text Mining
PPT
Network biology
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology - A basis for large-scale biomedica data mining
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale biomedical data and text mining
PPT
Networks of proteins and diseases
PPT
Text and data mining
PPT
Unraveling signaling networks by data integration
PPT
Computational Biology - Signaling networks and drug repositioning
PPT
Gene association networks - Large-scale integration of data and text
PPT
Data and text mining of electronic health records
PPT
Medical data and text mining - Linking diseases, drugs, and adverse reactions
Large-scale biomedical data and text integration
Medical data and text mining - Linking diseases, drugs, and adverse reactions
Networks of proteins and diseases
Data and Text Mining
Network biology
Medical network analysis: Linking diseases and genes through data and text mi...
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Network biology - A basis for large-scale biomedica data mining
Network Biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale biomedical data and text mining
Networks of proteins and diseases
Text and data mining
Unraveling signaling networks by data integration
Computational Biology - Signaling networks and drug repositioning
Gene association networks - Large-scale integration of data and text
Data and text mining of electronic health records
Medical data and text mining - Linking diseases, drugs, and adverse reactions

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
PPT
The Art of Counting: Scoring and ranking co-occurrences in literature
PPT
Text-mining-based retrieval of protein networks
PPT
Gene association networks: Large-scale integration of data and text
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
STRING & related databases: Large-scale integration of heterogeneous data
Tagger: Rapid dictionary-based named entity recognition
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization
The Art of Counting: Scoring and ranking co-occurrences in literature
Text-mining-based retrieval of protein networks
Gene association networks: Large-scale integration of data and text

Recently uploaded (20)

PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
famous lake in india and its disturibution and importance
PDF
Sciences of Europe No 170 (2025)
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Introduction to Cardiovascular system_structure and functions-1
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Phytochemical Investigation of Miliusa longipes.pdf
INTRODUCTION TO EVS | Concept of sustainability
neck nodes and dissection types and lymph nodes levels
Introduction to Fisheries Biotechnology_Lesson 1.pptx
famous lake in india and its disturibution and importance
Sciences of Europe No 170 (2025)
7. General Toxicologyfor clinical phrmacy.pptx
An interstellar mission to test astrophysical black holes
2Systematics of Living Organisms t-.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
AlphaEarth Foundations and the Satellite Embedding dataset
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Cell Membrane: Structure, Composition & Functions
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

Large-scale integration of data and text