Csvconf

The Content Mine
Peter Murray-Rust[*]
University of Cambridge, Open Knowledge,
& Shuttleworth Fellow
OKFest, Berlin, 2014-07-15, DE
[*] and Michelle Brook, Jenny Molloy, Ross Mounce,
Richard Smith-Unna, Mark MacGillivray, Emanuel
Toliv

Liberating facts for humanity*
• Public science 500,000,000,000 USD per year
• 85% of medical research is wasted (bad design,
lost data, non-communication)
• ContentMine will liberate 100,000,000 facts per
year from scientific literature
• Crawl, Scrape, Extract, Republish
• Open Data CC 0, Open Standards, Open Source
• COLLABORATIVE, any data-rich discipline
• [*] Closed data means people die

But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow

UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points

Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction

Chemical Computer Vision
1 sec to turn this into semantic science

PROPERTIES (Name-Value-Units-Error)
Name Value Units
NV U NV U N V
U
N
E
V E U
Note CML supports value ranges and errors

“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places

http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis

Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.

Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per Alström 4
PDF 
HTML 
Styles , superscripts
And diåcritics
preserved!
AMI

PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus
Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2

Linked Open Data – the world’s knowledge
very little physical science 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples

Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma
AMI
23.12
34.54
37.21
38.55
Posterior
probability
AMI can MEASURE
Branch lengths!
NexML
Genus Family
HTML

Csvconf

More Related Content

What's hot (20)

Similar to Csvconf (20)

More from petermurrayrust (20)

Csvconf