SlideShare a Scribd company logo
Text-mining practical
Lars Juhl Jensen
unix primer
the command line
some useful commands
cat
less
head -10
tail -10
grep ‘needle’
cut -f 2
sort
sort -nr
uniq -c
redirecting output
write to file
command > filename
using pipes
command1 | command2
putting it all together
cut -f 4 infile | sort | uniq -c |
sort -nr | head -100 > outfile
the task
disease gene finding
named entity recognition
human genes
gene prioritization
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
82,373 abstracts
89,904 abstracts
one file with each set
one line per abstract
dictionary
tab-delimited file
human genes
21,929 entities
synonyms
from many databases
orthographic variation
prefixes and suffixes
automatically generated
2,920,042 names
tagcorpus program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
named entity recognition
find unfortunate names
create “black list”
information extraction
co-mentioning
within abstracts
ank genes for each disease
find shared gene
Text-mining practical
wrap up
Protein kinase B
PKB
Akt
AKT1
same protein
synonyms matter
“black list” is crucial
text mining is useful
not black magic
Thanks for your attention

More Related Content

PPT
Text-mining practical
PPT
Text-mining practical
PDF
CRISPR-Cas9: The new frontier of Genome Engineering
PPTX
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
PPT
Text mining exercise
PPT
Text-mining practical
PPT
Text and data mining
PPT
Gene association networks - Large-scale integration of data and text
Text-mining practical
Text-mining practical
CRISPR-Cas9: The new frontier of Genome Engineering
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
Text mining exercise
Text-mining practical
Text and data mining
Gene association networks - Large-scale integration of data and text

Viewers also liked (9)

PPT
Network biology - Large-scale integration of data and text
PPT
Medical data and text mining - Linking diseases, drugs, and adverse reactions
PPT
Protein association networks with STRING
PPT
Medical data and text mining - Linking diseases, drugs, and adverse reactions
PPT
Medical informatics - Linking diseases, drugs, and adverse reactions
PPT
STRING - Protein networks from data and text mining
PPT
Text mining
PPT
One tagger, many uses - Illustrating the power of ontologies in named entity ...
PPT
Open data and open access - A biomedical data- and text-mining perspective
Network biology - Large-scale integration of data and text
Medical data and text mining - Linking diseases, drugs, and adverse reactions
Protein association networks with STRING
Medical data and text mining - Linking diseases, drugs, and adverse reactions
Medical informatics - Linking diseases, drugs, and adverse reactions
STRING - Protein networks from data and text mining
Text mining
One tagger, many uses - Illustrating the power of ontologies in named entity ...
Open data and open access - A biomedical data- and text-mining perspective
Ad

Similar to Text-mining practical (20)

PPT
Essential UNIX skills for biologists
PDF
Linux intro 3 grep + Unix piping
PDF
Introduction to UNIX Command-Lines with examples
PDF
SGN Introduction to UNIX Command-line 2015 part 2
PPT
Pragmatic text mining: From literature to electronic health records
PPT
The pragmatic text miner: From literature to electronic health records
PDF
SGN Introduction to UNIX Command-line 2015 part 1
ODP
Love Your Command Line
PPTX
Handling Files Under Unix.pptx
PPTX
Handling Files Under Unix.pptx
PDF
Unix Command-Line Cheat Sheet BTI2014
PDF
Unix command
PDF
Course 102: Lecture 12: Basic Text Handling
PPT
The pragmatic text miner: It’s just another type of poorly standardized data
PPT
The pragmatic text miner - It's just another type of poorly standardized data
PDF
Unit 8 text processing tools
PPT
The Literature Text Mining Approach In Cancer Research
PPT
Text mining for organism and environment names
PPTX
Unix training session 2
Essential UNIX skills for biologists
Linux intro 3 grep + Unix piping
Introduction to UNIX Command-Lines with examples
SGN Introduction to UNIX Command-line 2015 part 2
Pragmatic text mining: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
SGN Introduction to UNIX Command-line 2015 part 1
Love Your Command Line
Handling Files Under Unix.pptx
Handling Files Under Unix.pptx
Unix Command-Line Cheat Sheet BTI2014
Unix command
Course 102: Lecture 12: Basic Text Handling
The pragmatic text miner: It’s just another type of poorly standardized data
The pragmatic text miner - It's just another type of poorly standardized data
Unit 8 text processing tools
The Literature Text Mining Approach In Cancer Research
Text mining for organism and environment names
Unix training session 2
Ad

More from Lars Juhl Jensen (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPT
Network visualization: A crash course on using Cytoscape
PPT
STRING & STITCH : Network integration of heterogeneous data
PPT
Biomedical text mining: Automatic processing of unstructured text
PPT
Medical network analysis: Linking diseases and genes through data and text mi...
PPT
Network Biology: A crash course on STRING and Cytoscape
PPT
Cellular networks
PPT
Cellular Network Biology: Large-scale integration of data and text
PPT
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
PPT
STRING & related databases: Large-scale integration of heterogeneous data
PPT
Tagger: Rapid dictionary-based named entity recognition
PPT
Network Biology: Large-scale integration of data and text
PPT
Medical text mining: Linking diseases, drugs, and adverse reactions
PPT
Network biology: Large-scale integration of data and text
PPT
Medical data and text mining: Linking diseases, drugs, and adverse reactions
PPT
Cellular Network Biology
PPT
Network biology: Large-scale integration of data and text
PPT
Biomarker bioinformatics: Network-based candidate prioritization
One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Simple text-mining strategies for biomedicine
Extract 2.0: Text-mining-assisted interactive annotation
Network visualization: A crash course on using Cytoscape
STRING & STITCH : Network integration of heterogeneous data
Biomedical text mining: Automatic processing of unstructured text
Medical network analysis: Linking diseases and genes through data and text mi...
Network Biology: A crash course on STRING and Cytoscape
Cellular networks
Cellular Network Biology: Large-scale integration of data and text
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
STRING & related databases: Large-scale integration of heterogeneous data
Tagger: Rapid dictionary-based named entity recognition
Network Biology: Large-scale integration of data and text
Medical text mining: Linking diseases, drugs, and adverse reactions
Network biology: Large-scale integration of data and text
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Cellular Network Biology
Network biology: Large-scale integration of data and text
Biomarker bioinformatics: Network-based candidate prioritization

Recently uploaded (20)

PPTX
Pharmacology of Autonomic nervous system
PPT
protein biochemistry.ppt for university classes
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
famous lake in india and its disturibution and importance
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
An interstellar mission to test astrophysical black holes
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
2. Earth - The Living Planet earth and life
PDF
Biophysics 2.pdffffffffffffffffffffffffff
Pharmacology of Autonomic nervous system
protein biochemistry.ppt for university classes
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Phytochemical Investigation of Miliusa longipes.pdf
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
2Systematics of Living Organisms t-.pptx
BIOMOLECULES PPT........................
famous lake in india and its disturibution and importance
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
An interstellar mission to test astrophysical black holes
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The KM-GBF monitoring framework – status & key messages.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
HPLC-PPT.docx high performance liquid chromatography
Classification Systems_TAXONOMY_SCIENCE8.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
2. Earth - The Living Planet earth and life
Biophysics 2.pdffffffffffffffffffffffffff

Text-mining practical