TheContentMine: Mining for Everyone
Peter Murray-Rust
BL_Labs, London, 2014-11-27
The Right to Read is the Right to Mine
http://contentmine.org
ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN
ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO
Ebola Collaborators (Atlanta)
Roxanne Further Moore, Jessie
Gunter, April Clyburne-Sherin
Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
or bats
[Bb]ats?
Sudden onset [Ss]uddens+onset (space/s)
Panthera leoor
Gorilla gorilla
[A-Z][a-z]+s+[a-z]+
(ranges of letters)
Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs
Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>
Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button

More Related Content

PPTX
Petermrbl20141127
PPTX
Asking the scientific literature to tell us about metabolism
PPTX
ContentMining in Neuroscience
PPTX
Mining the scientific literature for plants and chemistry
PPTX
Architecture of ContentMine Components contentmine.org
PDF
ContentMine (EMBL-EBI Industry Programme)
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Towards Responsible Content Mining: A Cambridge perspective
Petermrbl20141127
Asking the scientific literature to tell us about metabolism
ContentMining in Neuroscience
Mining the scientific literature for plants and chemistry
Architecture of ContentMine Components contentmine.org
ContentMine (EMBL-EBI Industry Programme)
Amanuens.is HUmans and machines annotating scholarly literature
Towards Responsible Content Mining: A Cambridge perspective

Viewers also liked (12)

PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
High throughput mining of the scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
Content Mining at Wellcome Trust
PPTX
High throughput mining of the scholarly literature; talk at NIH
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
PPTX
Mining Scientific Images
PPTX
Asking the scientific literature to tell us about metabolism
PPTX
Content Mining of Science in Cambridge
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
High throughput mining of the scholarly literature
PPTX
Mining Scientific Diagrams for facts
Automatic Extraction of Knowledge from Biomedical literature
High throughput mining of the scholarly literature
Open software and knowledge for MIOSS
Content Mining at Wellcome Trust
High throughput mining of the scholarly literature; talk at NIH
ContentMining for France and Europe; Lessons from 2 years in UK
Mining Scientific Images
Asking the scientific literature to tell us about metabolism
Content Mining of Science in Cambridge
Can Computers understand the scientific literature (includes compscie material)
High throughput mining of the scholarly literature
Mining Scientific Diagrams for facts
Ad

More from TheContentMine (20)

PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPSX
Cochrane workshop 2016
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
Content Mining of Science and Medicine
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
The culture of researchData
PPTX
Mining Scientific Diagrams for facts
PPTX
Digital Scholarship: Enlightenment or Devastated Landscape?
PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
OpenNotebookScience NOW!
PPTX
Making Theses USEFUL
PPTX
Open Data and Open Science
PPTX
ContentMine and WikiData
PPTX
Mining Scientific Images
PPTX
ContentMine: Open Data and Social Machines
PPTX
Disruptive Communities and Technology
PPTX
Embrace the Open Revolution
Amanuens.is HUmans and machines annotating scholarly literature
Open software and knowledge for MIOSS
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from Biomedical literature
Cochrane workshop 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
Content Mining of Science and Medicine
ContentMine + EPMC: Finding Zika!
The culture of researchData
Mining Scientific Diagrams for facts
Digital Scholarship: Enlightenment or Devastated Landscape?
Open Knowledge and University of Cambridge European Bioinformatics Institute
OpenNotebookScience NOW!
Making Theses USEFUL
Open Data and Open Science
ContentMine and WikiData
Mining Scientific Images
ContentMine: Open Data and Social Machines
Disruptive Communities and Technology
Embrace the Open Revolution
Ad

Recently uploaded (20)

PDF
CuO Nps photocatalysts 15156456551564161
PDF
Cosmology using numerical relativity - what hapenned before big bang?
PPTX
Preformulation.pptx Preformulation studies-Including all parameter
PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PDF
Science Form five needed shit SCIENEce so
PPT
LEC Synthetic Biology and its application.ppt
PDF
5.Physics 8-WBS_Light.pdfFHDGJDJHFGHJHFTY
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
limit test definition and all limit tests
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PPTX
Substance Disorders- part different drugs change body
PPTX
ELISA(Enzyme linked immunosorbent assay)
PDF
Metabolic Acidosis. pa,oakw,llwla,wwwwqw
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
bone as a tissue presentation micky.pptx
PDF
Social preventive and pharmacy. Pdf
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PPTX
diabetes and its complications nephropathy neuropathy
PPTX
2currentelectricity1-201006102815 (1).pptx
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
CuO Nps photocatalysts 15156456551564161
Cosmology using numerical relativity - what hapenned before big bang?
Preformulation.pptx Preformulation studies-Including all parameter
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
Science Form five needed shit SCIENEce so
LEC Synthetic Biology and its application.ppt
5.Physics 8-WBS_Light.pdfFHDGJDJHFGHJHFTY
A powerpoint on colorectal cancer with brief background
limit test definition and all limit tests
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
Substance Disorders- part different drugs change body
ELISA(Enzyme linked immunosorbent assay)
Metabolic Acidosis. pa,oakw,llwla,wwwwqw
Presentation1 INTRODUCTION TO ENZYMES.pptx
bone as a tissue presentation micky.pptx
Social preventive and pharmacy. Pdf
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
diabetes and its complications nephropathy neuropathy
2currentelectricity1-201006102815 (1).pptx
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)

TheContentMine: Mining for Everyone

  • 1. TheContentMine: Mining for Everyone Peter Murray-Rust BL_Labs, London, 2014-11-27
  • 2. The Right to Read is the Right to Mine http://contentmine.org
  • 3. ContentMine • 1-2 year Shuttleworth Funding from 2014-03 • Free to everyone, Open Source, updated daily • Structured Text, and Image/Diagram Mining • Workshops for training and training trainers • Bottom-up community development – Bioscience (EuropePMC, BBSRC) – Disease Ebola – Astrophysics (Stray Toaster) – Chemistry (TSB, EBI, PennState - Citeseer) • We fight for Justice and Freedom
  • 4. ContentMine People • Jenny Molloy • Ross Mounce • Peter Murray-Rust + volunteers (Bioscience, disease) • Richard Smith-Unna + 20 quickscrape volunteers • Steph Unna • Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones) • Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth) • Advisory Board RSN
  • 5. ContentMine Workshops (1-hour -> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US Upcoming • JISC • LIBER • BL • Wellcome Trust • WHO
  • 6. Ebola Collaborators (Atlanta) Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
  • 7. Regular Expressions (Easier than Crosswords or Sudoku) Ebola Ebola Mali (not Malicious) MaliW (end of word) Bat or bat [Bb]at (alternatives) bat or bats bats? (optional letter) Bat or Bats or bat or bats [Bb]ats? Sudden onset [Ss]uddens+onset (space/s) Panthera leoor Gorilla gorilla [A-Z][a-z]+s+[a-z]+ (ranges of letters)
  • 8. Ebola regex • <compoundRegex title="ebola"> • <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> • <regex weight="1.0" fields="marburg">(Marburg)</regex> • <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> • <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> • <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> • <regex weight="0.5" fields="guinea">(Guinea)</regex> • <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> • <regex weight="0.5" fields="liberia">(Liberia)</regex> • <regex weight="0.5" fields="mali">(Mali)W</regex> • <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> • <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> • <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> • <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> • <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> • <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> • </compoundRegex> I 15 mins to create, 15 mins to install and test Or run online at CottageLabs
  • 9. Results of Regex on Ebola • <resultsList xmlns="http://www.xml-cml.org/ami"> • <results xmlns=""> • <source xmlns="http://www.xml-cml.org/ami" • name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" • lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> • <regex xmlns="" weight="1.0" fields="[ebola]"> • <pattern>(Ebola)</pattern> • </regex> • <hits xmlns=""> • <hit ebola="Ebola" /> • </hits> • </regex> • </result> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" • lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> • <regex xmlns="" weight="0.5" fields="[sierra_leone]"> • <pattern>(Sierras+Leone)</pattern> • </regex> • <hits xmlns=""> • <hit sierra_leone="Sierra Leone" /> • </hits> • </regex> • </result>
  • 10. Demo of Content Mining ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
  • 11. Bacterial WP_phylogenetic tree Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves) WP: Clostridium_butyricum Genbank ID American Type Culture Collection
  • 12. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button