Text-mining practical

Download as PPT, PDF

0 likes256 views

Lars Juhl Jensen

The document provides an overview of using text-mining techniques like named entity recognition and information extraction to identify disease genes from biomedical abstracts. It describes retrieving over 170,000 abstracts related to prostate cancer and schizophrenia and using tools like grep, cut, sort, and uniq to analyze the text and identify human genes mentioned in the abstracts as well as their co-occurrence between diseases. The goal is to leverage these methods to help prioritize candidate genes for further study.

Text-mining practical
Lars Juhl Jensen

unix primer

the command line

some useful commands

cat

less

head -10

tail -10

grep ‘needle’

cut -f 2

sort

sort -nr

uniq -c

redirecting output

write to file

command > filename

using pipes

command1 | command2

putting it all together

cut -f 4 infile | sort | uniq -c |
sort -nr | head -100 > outfile

the task

disease gene finding

named entity recognition

human genes

gene prioritization

what I have done

information retrieval

two diseases

prostate cancer

schizophrenia

two sets of documents

82,373 abstracts

89,904 abstracts

one file with each set

one line per abstract

dictionary

tab-delimited file

human genes

21,929 entities

synonyms

from many databases

orthographic variation

prefixes and suffixes

automatically generated

2,920,042 names

tagcorpus program

flexible matching

upper- and lower-case

spaces and hyphens

tab-delimited output

what you will do

named entity recognition

find unfortunate names

create “black list”

information extraction

co-mentioning

within abstracts

ank genes for each disease

find shared gene

Text-mining practical

wrap up

Protein kinase B

PKB

Akt

AKT1

same protein

synonyms matter

“black list” is crucial

text mining is useful

not black magic

Thanks for your attention

Ad

Recommended

PPT

Text-mining practical

Lars Juhl Jensen

PPT

Text-mining practical

Lars Juhl Jensen

PDF

CRISPR-Cas9: The new frontier of Genome Engineering

PPTX

Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...

PPT

Text mining exercise

Lars Juhl Jensen

PPT

Text-mining practical

Lars Juhl Jensen

PPT

Text and data mining

Lars Juhl Jensen

PPT

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

PPT

Network biology - Large-scale integration of data and text

Lars Juhl Jensen

PPT

Medical data and text mining - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

Protein association networks with STRING

Lars Juhl Jensen

PPT

Medical data and text mining - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

Medical informatics - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

STRING - Protein networks from data and text mining

Lars Juhl Jensen

PPT

Text mining

Lars Juhl Jensen

PPT

One tagger, many uses - Illustrating the power of ontologies in named entity ...

Lars Juhl Jensen

PPT

Open data and open access - A biomedical data- and text-mining perspective

Lars Juhl Jensen

PPT

Essential UNIX skills for biologists

Yannick Pouliot

PDF

Linux intro 3 grep + Unix piping

Giovanni Marco Dall'Olio

PDF

Introduction to UNIX Command-Lines with examples

Noé Fernández-Pozo

PDF

SGN Introduction to UNIX Command-line 2015 part 2

PPT

Pragmatic text mining: From literature to electronic health records

Lars Juhl Jensen

PPT

The pragmatic text miner: From literature to electronic health records

Lars Juhl Jensen

PDF

SGN Introduction to UNIX Command-line 2015 part 1

ODP

Love Your Command Line

PPTX

Handling Files Under Unix.pptx

PPTX

Handling Files Under Unix.pptx

PDF

Unix Command-Line Cheat Sheet BTI2014

Noé Fernández-Pozo

PDF

Unix command

PDF

Course 102: Lecture 12: Basic Text Handling

Ahmed El-Arabawy

More Related Content

PPT

Text-mining practical

Lars Juhl Jensen

PPT

Text-mining practical

Lars Juhl Jensen

PDF

CRISPR-Cas9: The new frontier of Genome Engineering

PPTX

Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...

PPT

Text mining exercise

Lars Juhl Jensen

PPT

Text-mining practical

Lars Juhl Jensen

PPT

Text and data mining

Lars Juhl Jensen

PPT

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

Text-mining practical

Lars Juhl Jensen

Text-mining practical

Lars Juhl Jensen

CRISPR-Cas9: The new frontier of Genome Engineering

Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...

Text mining exercise

Lars Juhl Jensen

Text-mining practical

Lars Juhl Jensen

Text and data mining

Lars Juhl Jensen

Gene association networks - Large-scale integration of data and text

Lars Juhl Jensen

Viewers also liked (9)

PPT

Network biology - Large-scale integration of data and text

Lars Juhl Jensen

PPT

Medical data and text mining - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

Protein association networks with STRING

Lars Juhl Jensen

PPT

Medical data and text mining - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

Medical informatics - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

STRING - Protein networks from data and text mining

Lars Juhl Jensen

PPT

Text mining

Lars Juhl Jensen

PPT

One tagger, many uses - Illustrating the power of ontologies in named entity ...

Lars Juhl Jensen

PPT

Open data and open access - A biomedical data- and text-mining perspective

Lars Juhl Jensen

Network biology - Large-scale integration of data and text

Lars Juhl Jensen

Medical data and text mining - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

Protein association networks with STRING

Lars Juhl Jensen

Medical data and text mining - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

Medical informatics - Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

STRING - Protein networks from data and text mining

Lars Juhl Jensen

Text mining

Lars Juhl Jensen

One tagger, many uses - Illustrating the power of ontologies in named entity ...

Lars Juhl Jensen

Open data and open access - A biomedical data- and text-mining perspective

Lars Juhl Jensen

Ad

Similar to Text-mining practical (20)

PPT

Essential UNIX skills for biologists

Yannick Pouliot

PDF

Linux intro 3 grep + Unix piping

Giovanni Marco Dall'Olio

PDF

Introduction to UNIX Command-Lines with examples

Noé Fernández-Pozo

PDF

SGN Introduction to UNIX Command-line 2015 part 2

PPT

Pragmatic text mining: From literature to electronic health records

Lars Juhl Jensen

PPT

The pragmatic text miner: From literature to electronic health records

Lars Juhl Jensen

PDF

SGN Introduction to UNIX Command-line 2015 part 1

ODP

Love Your Command Line

PPTX

Handling Files Under Unix.pptx

PPTX

Handling Files Under Unix.pptx

PDF

Unix Command-Line Cheat Sheet BTI2014

Noé Fernández-Pozo

PDF

Unix command

PDF

Course 102: Lecture 12: Basic Text Handling

Ahmed El-Arabawy

PPT

Ch05

Mike Qaissaunee

PPT

The pragmatic text miner: It’s just another type of poorly standardized data

Lars Juhl Jensen

PPT

The pragmatic text miner - It's just another type of poorly standardized data

Lars Juhl Jensen

PDF

Unit 8 text processing tools

PPT

The Literature Text Mining Approach In Cancer Research

Lars Juhl Jensen

PPT

Text mining for organism and environment names

Lars Juhl Jensen

PPTX

Unix training session 2

Anil Kumar Kapil,PMP®

Essential UNIX skills for biologists

Yannick Pouliot

Linux intro 3 grep + Unix piping

Giovanni Marco Dall'Olio

Introduction to UNIX Command-Lines with examples

Noé Fernández-Pozo

SGN Introduction to UNIX Command-line 2015 part 2

Pragmatic text mining: From literature to electronic health records

Lars Juhl Jensen

The pragmatic text miner: From literature to electronic health records

Lars Juhl Jensen

SGN Introduction to UNIX Command-line 2015 part 1

Love Your Command Line

Handling Files Under Unix.pptx

Handling Files Under Unix.pptx

Unix Command-Line Cheat Sheet BTI2014

Noé Fernández-Pozo

Unix command

Course 102: Lecture 12: Basic Text Handling

Ahmed El-Arabawy

Ch05

Mike Qaissaunee

The pragmatic text miner: It’s just another type of poorly standardized data

Lars Juhl Jensen

The pragmatic text miner - It's just another type of poorly standardized data

Lars Juhl Jensen

Unit 8 text processing tools

The Literature Text Mining Approach In Cancer Research

Lars Juhl Jensen

Text mining for organism and environment names

Lars Juhl Jensen

Unix training session 2

Anil Kumar Kapil,PMP®

Ad

More from Lars Juhl Jensen (20)

PPT

One tagger, many uses: Illustrating the power of dictionary-based named entit...

Lars Juhl Jensen

PPT

One tagger, many uses: Simple text-mining strategies for biomedicine

Lars Juhl Jensen

PPT

Extract 2.0: Text-mining-assisted interactive annotation

Lars Juhl Jensen

PPT

Network visualization: A crash course on using Cytoscape

Lars Juhl Jensen

PPT

STRING & STITCH: Network integration of heterogeneous data

Lars Juhl Jensen

PPT

Biomedical text mining: Automatic processing of unstructured text

Lars Juhl Jensen

PPT

Medical network analysis: Linking diseases and genes through data and text mi...

Lars Juhl Jensen

PPT

Network Biology: A crash course on STRING and Cytoscape

Lars Juhl Jensen

PPT

Cellular networks

Lars Juhl Jensen

PPT

Cellular Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

PPT

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Lars Juhl Jensen

PPT

STRING & related databases: Large-scale integration of heterogeneous data

Lars Juhl Jensen

PPT

Tagger: Rapid dictionary-based named entity recognition

Lars Juhl Jensen

PPT

Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

PPT

Medical text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

PPT

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

PPT

Cellular Network Biology

Lars Juhl Jensen

PPT

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

PPT

Biomarker bioinformatics: Network-based candidate prioritization

Lars Juhl Jensen

One tagger, many uses: Illustrating the power of dictionary-based named entit...

Lars Juhl Jensen

One tagger, many uses: Simple text-mining strategies for biomedicine

Lars Juhl Jensen

Extract 2.0: Text-mining-assisted interactive annotation

Lars Juhl Jensen

Network visualization: A crash course on using Cytoscape

Lars Juhl Jensen

STRING & STITCH: Network integration of heterogeneous data

Lars Juhl Jensen

Biomedical text mining: Automatic processing of unstructured text

Lars Juhl Jensen

Medical network analysis: Linking diseases and genes through data and text mi...

Lars Juhl Jensen

Network Biology: A crash course on STRING and Cytoscape

Lars Juhl Jensen

Cellular networks

Lars Juhl Jensen

Cellular Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Lars Juhl Jensen

STRING & related databases: Large-scale integration of heterogeneous data

Lars Juhl Jensen

Tagger: Rapid dictionary-based named entity recognition

Lars Juhl Jensen

Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

Medical text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

Cellular Network Biology

Lars Juhl Jensen

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

Biomarker bioinformatics: Network-based candidate prioritization

Lars Juhl Jensen

Recently uploaded (20)

PPTX

Pharmacology of Autonomic nervous system

PPT

protein biochemistry.ppt for university classes

PDF

Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

PDF

Phytochemical Investigation of Miliusa longipes.pdf

IrfanShahirSharafi

PPTX

Taita Taveta Laboratory Technician Workshop Presentation.pptx

PPTX

Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...

Muhammad Sajid Afridi

PPTX

2Systematics of Living Organisms t-.pptx

PPTX

BIOMOLECULES PPT........................

vachieagrawal1221

PPTX

famous lake in india and its disturibution and importance

PDF

ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf

PDF

An interstellar mission to test astrophysical black holes

PDF

Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

PPTX

The KM-GBF monitoring framework – status & key messages.pptx

pensoftservices

PPTX

TOTAL hIP ARTHROPLASTY Presentation.pptx

PDF

Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...

PDF

HPLC-PPT.docx high performance liquid chromatography

darshanambiga1633

PPTX

Classification Systems_TAXONOMY_SCIENCE8.pptx

DOCX

Q1_LE_Mathematics 8_Lesson 5_Week 5.docx

marcusaviso1101

PPTX

2. Earth - The Living Planet earth and life

markjustinebarolobau

PDF

Biophysics 2.pdffffffffffffffffffffffffff

Pharmacology of Autonomic nervous system

protein biochemistry.ppt for university classes

Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

Phytochemical Investigation of Miliusa longipes.pdf

IrfanShahirSharafi

Taita Taveta Laboratory Technician Workshop Presentation.pptx

Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...

Muhammad Sajid Afridi

2Systematics of Living Organisms t-.pptx

BIOMOLECULES PPT........................

vachieagrawal1221

famous lake in india and its disturibution and importance

ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf

An interstellar mission to test astrophysical black holes

Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

The KM-GBF monitoring framework – status & key messages.pptx

pensoftservices

TOTAL hIP ARTHROPLASTY Presentation.pptx

Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...

HPLC-PPT.docx high performance liquid chromatography

darshanambiga1633

Classification Systems_TAXONOMY_SCIENCE8.pptx

Q1_LE_Mathematics 8_Lesson 5_Week 5.docx

marcusaviso1101

2. Earth - The Living Planet earth and life

markjustinebarolobau

Biophysics 2.pdffffffffffffffffffffffffff

Text-mining practical

1. Text-mining practical Lars Juhl Jensen

3. the command line

4. some useful commands

9. grep ‘needle’

14. redirecting output

15. write to file

16. command > filename

17. using pipes

18. command1 | command2

19. putting it all together

20. cut -f 4 infile | sort | uniq -c | sort -nr | head -100 > outfile

22. disease gene finding

23. named entity recognition

24. human genes

25. gene prioritization

26. what I have done

27. information retrieval

28. two diseases

29. prostate cancer

30. schizophrenia

31. two sets of documents

32. 82,373 abstracts

33. 89,904 abstracts

34. one file with each set

35. one line per abstract

37. tab-delimited file

38. human genes

39. 21,929 entities

41. from many databases

42. orthographic variation

43. prefixes and suffixes

44. automatically generated

45. 2,920,042 names

46. tagcorpus program

47. flexible matching

48. upper- and lower-case

49. spaces and hyphens

50. tab-delimited output

51. what you will do

52. named entity recognition

53. find unfortunate names

54. create “black list”

55. information extraction

56. co-mentioning

57. within abstracts

58. ank genes for each disease

59. find shared gene

62. Protein kinase B

66. same protein

67. synonyms matter

68. “black list” is crucial

69. text mining is useful

70. not black magic

71. Thanks for your attention