SlideShare a Scribd company logo
The Content Mine
Peter Murray-Rust[*]
University of Cambridge, Open Knowledge,
& Shuttleworth Fellow
OKFest, Berlin, 2014-07-15, DE
[*] and Michelle Brook, Jenny Molloy, Ross Mounce,
Richard Smith-Unna, Mark MacGillivray, Emanuel
Toliv
Liberating facts for humanity*
• Public science 500,000,000,000 USD per year
• 85% of medical research is wasted (bad design,
lost data, non-communication)
• ContentMine will liberate 100,000,000 facts per
year from scientific literature
• Crawl, Scrape, Extract, Republish
• Open Data CC 0, Open Standards, Open Source
• COLLABORATIVE, any data-rich discipline
• [*] Closed data means people die
But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points
Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction
Chemical Computer Vision
1 sec to turn this into semantic science
PROPERTIES (Name-Value-Units-Error)
Name Value Units
NV U NV U N V
U
N
E
V E U
Note CML supports value ranges and errors
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
Parsing chemical sentences
http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per Alström 4
PDF 
HTML 
Styles , superscripts
And diåcritics
preserved!
AMI
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus
Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Linked Open Data – the world’s knowledge
very little physical science 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma
AMI
23.12
34.54
37.21
38.55
Posterior
probability
AMI can MEASURE
Branch lengths!
NexML
Genus Family
HTML
We can do any data…
… pixel analysis …

More Related Content

PPTX
Making Theses USEFUL
PPTX
ContentMine and WikiData
PPTX
OpenNotebookScience NOW!
PPTX
The Content Mine (presented at UKSG)
PPTX
Open Notebook Science
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
Open data and Open Science
PPTX
ContentMine: Open Data and Social Machines
Making Theses USEFUL
ContentMine and WikiData
OpenNotebookScience NOW!
The Content Mine (presented at UKSG)
Open Notebook Science
ContentMine: Liberating scholarship from Open publications and theses
Open data and Open Science
ContentMine: Open Data and Social Machines

What's hot (20)

PPTX
Content Mining for Machines and Humans
PPTX
Disruptive Communities and Technology
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Petermrjisc20141201
PPTX
Embrace the Open Revolution
PPTX
ContentMine and WikiData
PPTX
Content Mining at Wellcome Trust
PPTX
Copyright Reform and Open Data
PPTX
ContentMining in Neuroscience
PPTX
Principles and practice of Open Science
PDF
Open scholarship [a FOSTER open science talk]
PPTX
Content Mining at Wellcome Trust
PDF
Sharing re-usable phylogenetic data: we're not there yet
PPTX
The culture of researchData
PPTX
Making Theses USEFUL
PPT
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
PPTX
Content Mining of Science in Europe
PDF
Open Access for Early Career Researchers
PPTX
Ontologies in Physical Science
Content Mining for Machines and Humans
Disruptive Communities and Technology
Can Computers understand the scientific literature (includes compscie material)
Petermrjisc20141201
Embrace the Open Revolution
ContentMine and WikiData
Content Mining at Wellcome Trust
Copyright Reform and Open Data
ContentMining in Neuroscience
Principles and practice of Open Science
Open scholarship [a FOSTER open science talk]
Content Mining at Wellcome Trust
Sharing re-usable phylogenetic data: we're not there yet
The culture of researchData
Making Theses USEFUL
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Content Mining of Science in Europe
Open Access for Early Career Researchers
Ontologies in Physical Science
Ad

Similar to Csvconf (20)

PPTX
Rapid biomedical search
PPTX
ContentMine: Open Data and Social Machines
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Open Data and Open Science
PPTX
ContentMining in Neuroscience
PPTX
ContentMining in Neuroscience
PDF
Weller pleasures+perils social media
PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
The culture of researchData
PPTX
The End(s) of e-Research
PPT
Open Data in a Big Data World: easy to say, but hard to do?
PDF
World Wide Research Reshaping The Sciences And Humanities New William H Dutton
PDF
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
PDF
Data, Science, Society - Claudio Gutierrez, University of Chile
PPTX
State of the Art Informatics for Research Reproducibility, Reliability, and...
PPTX
Open data: Enhancing preservation, reproducibility, and innovation
Rapid biomedical search
ContentMine: Open Data and Social Machines
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
Can Computers understand the scientific literature (includes compscie material)
Open Data and Open Science
ContentMining in Neuroscience
ContentMining in Neuroscience
Weller pleasures+perils social media
Open Knowledge and University of Cambridge European Bioinformatics Institute
The culture of researchData
The End(s) of e-Research
Open Data in a Big Data World: easy to say, but hard to do?
World Wide Research Reshaping The Sciences And Humanities New William H Dutton
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Data, Science, Society - Claudio Gutierrez, University of Chile
State of the Art Informatics for Research Reproducibility, Reliability, and...
Open data: Enhancing preservation, reproducibility, and innovation
Ad

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
PPTX
WikiFactMine: Science for Everyone
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Scientific search for everyone
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism
WikiFactMine: Science for Everyone

Csvconf