SlideShare a Scribd company logo
Text-mining practical
Lars Juhl Jensen
unix primer
the command line
some useful commands
cat
less
head -10
tail -10
grep ‘needle’
cut -f 2
sort
sort -nr
uniq -c
redirecting output
write to file
command > filename
using pipes
command1 | command2
putting it all together
cut -f 4 infile | sort | uniq -c |
sort -nr | head -100 > outfile
the task
disease gene finding
named entity recognition
human genes
gene prioritization
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
62,755 abstracts
65,588 abstracts
one directory with each set
one file with each abstract
dictionary
tab-delimited file
human genes
22,523 entities
synonyms
from many databases
orthographic variation
prefixes and suffixes
automatically generated
2,726,495 names
tagdir program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
named entity recognition
find unfortunate names
create “black list”
information extraction
co-mentioning
within abstracts
ank genes for each disease
find shared gene
Text-mining practical
a helping hand
“black list”
100+ matches
10+ matches
Text-mining practical
wrap up
Protein kinase B
PKB
Akt
AKT1
same protein
synonyms matter
“black list” is crucial
text mining is useful
not black magic
Thanks for your attention!
76

More Related Content

PPT
Text-mining practical
PPT
Text-mining practical
PPT
Text mining exercise
PPT
Text-mining practical
PDF
CRISPR-Cas9: The new frontier of Genome Engineering
PPTX
Biomedical data
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
PPT
Large-scale data and text mining
Text-mining practical
Text-mining practical
Text mining exercise
Text-mining practical
CRISPR-Cas9: The new frontier of Genome Engineering
Biomedical data
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Large-scale data and text mining

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
STRING & related databases: Large-scale integration of heterogeneous data
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
perinatal infections 2-171220190027.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Biomechanics of the Hip - Basic Science.pptx
PPTX
Application of enzymes in medicine (2).pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Overview of calcium in human muscles.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Seminar Hypertension and Kidney diseases.pptx
perinatal infections 2-171220190027.pptx
Placing the Near-Earth Object Impact Probability in Context
Biomechanics of the Hip - Basic Science.pptx
Application of enzymes in medicine (2).pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Science Quipper for lesson in grade 8 Matatag Curriculum
7. General Toxicologyfor clinical phrmacy.pptx
6.1 High Risk New Born. Padetric health ppt
Introcution to Microbes Burton's Biology for the Health
Overview of calcium in human muscles.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
CORDINATION COMPOUND AND ITS APPLICATIONS
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Ad

Text-mining practical