SlideShare a Scribd company logo
Royal Society 2018,
London, UK, 2018-06-12
Extracting Data from Early Scientific Diagrams
Peter Murray-Rust1,2
[1]University of Cambridge
[2]TheContentMine
ContentMine extracts data from modern diagrams on a high-throughput scale.
Do the same tools work for C19th scientific diagrams?
Text, data on maps, chemical formulae, plots, phylogenetics,
Images from ContentMine CC BY and Wikimedia CC BY-SA , ProcRoySoc PD
pm286@cam.ac.uk
peter@contentmine.org
Modern Diagram Mining
4500 separate images
Phylogenetic tree
supertree
A machine-compiled microbial
supertree from figure-mining
thousands of papers,
Ross Mounce, Peter Murray-
Rust, Matthew A Wills, 2017
https://riojournal.com/article/
13589/
Original scanned bitmap
Extracted by Tesseract
Errors all due to untrained punctuation
Text Mining bitmap
Original: V.D. vs Temperature Automatic Extraction: Grid, curve and points
Plot Mining
About 50% accurateProbably hand-drawn plot
binarization segmentation
OCR
Extraction of chemistry: lines correct, some atom corruption
3
4
15
3
4
5
21
1
1
John’s Snow’s map of
deaths in Broad Street 1854
Onion-ring pixel analysis
Segmentation and area/object count
https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
https://en.wikipedia.org/wiki/Tree_of_life_(biology)
Darwin’s Phylogenetic Tree
Binarized notebook
segmentation
Topological
tree
https://commons.wikimedia.org/wiki/File:See_No_Evil,_Hear_No_Evil,_Speak_No_Evil.jpg Parodied by User:petermr CC-BY SA
3 11 13
European Copyright Made Simple
Mine no
Content
Link no
Content
Upload no
Content
Julia Reda’s explanation https://juliareda.eu/eu-copyright-reform/ and write to your MEP
Peter Murray-Rust
All ContentMine Software is Free/Open
• Contentmine.org
• http://github.com/contentmine
• http://github.com/petermr
• http://discuss.contentmine.org/t/d3-text-processing-
infrastructure/486/14
• Main site: Contentmine.org
• Software: http://github.com/contentmine (production) and
http://github.com/peterm (development forks)
• Discussion and open notebooks: http://discuss.contentmine.org
– http://discuss.contentmine.org/t/extracting-science-from-early-scientific-
documents/613/
– http://discuss.contentmine.org/t/extracting-data-from-early-scientific-
maps/614

More Related Content

PDF
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PPTX
NSF Quantum Leap Poster 2019
PPTX
APS March Meeting Nathan Frey 2020
PPT
ESWC 2009 In-Use Track: SCOVO
PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
NSF Quantum Leap Poster 2019
APS March Meeting Nathan Frey 2020
ESWC 2009 In-Use Track: SCOVO
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?

More from petermurrayrust (20)

PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
PPTX
WikiFactMine: Science for Everyone
PPTX
ContentMining and Copyright at CopyCamp2017
PPTX
Big Data and ContentMining for Libraries
PPTX
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
PDF
WikiFactMine for Plant Chemistry
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Scientific search for everyone
Openplant2018 Poster; Semantic searching
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism
WikiFactMine: Science for Everyone
ContentMining and Copyright at CopyCamp2017
Big Data and ContentMining for Libraries
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
WikiFactMine for Plant Chemistry
Ad

Recently uploaded (20)

PDF
. Radiology Case Scenariosssssssssssssss
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
An interstellar mission to test astrophysical black holes
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Microbiology with diagram medical studies .pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
2. Earth - The Living Planet Module 2ELS
. Radiology Case Scenariosssssssssssssss
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Phytochemical Investigation of Miliusa longipes.pdf
An interstellar mission to test astrophysical black holes
TOTAL hIP ARTHROPLASTY Presentation.pptx
Sciences of Europe No 170 (2025)
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
INTRODUCTION TO EVS | Concept of sustainability
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
microscope-Lecturecjchchchchcuvuvhc.pptx
Cell Membrane: Structure, Composition & Functions
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Microbiology with diagram medical studies .pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
2. Earth - The Living Planet Module 2ELS
Ad

Extracting science from the archive

  • 1. Royal Society 2018, London, UK, 2018-06-12 Extracting Data from Early Scientific Diagrams Peter Murray-Rust1,2 [1]University of Cambridge [2]TheContentMine ContentMine extracts data from modern diagrams on a high-throughput scale. Do the same tools work for C19th scientific diagrams? Text, data on maps, chemical formulae, plots, phylogenetics, Images from ContentMine CC BY and Wikimedia CC BY-SA , ProcRoySoc PD pm286@cam.ac.uk peter@contentmine.org
  • 2. Modern Diagram Mining 4500 separate images Phylogenetic tree supertree A machine-compiled microbial supertree from figure-mining thousands of papers, Ross Mounce, Peter Murray- Rust, Matthew A Wills, 2017 https://riojournal.com/article/ 13589/
  • 3. Original scanned bitmap Extracted by Tesseract Errors all due to untrained punctuation Text Mining bitmap
  • 4. Original: V.D. vs Temperature Automatic Extraction: Grid, curve and points Plot Mining About 50% accurateProbably hand-drawn plot
  • 5. binarization segmentation OCR Extraction of chemistry: lines correct, some atom corruption
  • 6. 3 4 15 3 4 5 21 1 1 John’s Snow’s map of deaths in Broad Street 1854 Onion-ring pixel analysis Segmentation and area/object count https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
  • 8. https://commons.wikimedia.org/wiki/File:See_No_Evil,_Hear_No_Evil,_Speak_No_Evil.jpg Parodied by User:petermr CC-BY SA 3 11 13 European Copyright Made Simple Mine no Content Link no Content Upload no Content Julia Reda’s explanation https://juliareda.eu/eu-copyright-reform/ and write to your MEP Peter Murray-Rust
  • 9. All ContentMine Software is Free/Open • Contentmine.org • http://github.com/contentmine • http://github.com/petermr • http://discuss.contentmine.org/t/d3-text-processing- infrastructure/486/14 • Main site: Contentmine.org • Software: http://github.com/contentmine (production) and http://github.com/peterm (development forks) • Discussion and open notebooks: http://discuss.contentmine.org – http://discuss.contentmine.org/t/extracting-science-from-early-scientific- documents/613/ – http://discuss.contentmine.org/t/extracting-data-from-early-scientific- maps/614