SlideShare a Scribd company logo
Content Mining of Science in Europe
Peter Murray-Rust,
ContentMine.org, University of Cambridge & Open Forum Europe
OFA, Brussels, BE 2015-10-22
What is mining?
Why is it useful?
How YOU can do it without using publishers’ APIs
Copyright and restrictive practices are still a major problem
The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
My European Heroes
Young People(ContentMine)
NEELIE KROES
Use Cases of ContentMining
• Epidemiology of obesity (Cambridge U)
• (OKF, OpenTrials) Mapping clinical trials
repositories to reports in scientific literature
• Mining chemical reactions from patents
• Creating a bacterial supertree-of-life from
4500 papers
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
ContentMine-ing strategy
• Discover. Crawl the COMPLETE relevant literature.
=> bibliography
• Scrape (download). ALL papers
• Index papers => Facts
• Search/analyze papers => complex science
• Extract, Annotate, Aggregate (“Transformative”)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Facts in context
daily IUCN endangered species news
en.wikipedia.org CC By-SA
ContentMine Fact of The Day
• Fact of the day
• Endangered species in recent science
• Facts
• Bubbles
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
“Root”
4500 papers each
with 1 tree
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga
_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te
rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat
um:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
Supertree for 924 species
Tree
Supertree created from 4300 papers
Copyright and Mining
• PMR-premise: You cannot do reproducible
scientific mining and avoid violating copyright.
• UK (“Hargreaves”) 2014 legislation:
– “personal” “non-commercial*” “research” “data
analytics”
– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
Publishing and ICT
Trust these as much as you trust these
Elsevier Microsoft
Mendeley (Elsevier) Facebook
Digital Science/Macmillan Apple
Wiley
etc
Etc.
STM Publishers prevent Mining
• FUD & disinformation about legality (Elsevier)
• Monopolies on infrastructure (“API”s, CCC
Rightfind)
• Technical obstruction (Wiley Captcha,
Macmillan Readcube)
• Restrictive contracts with libraries (ALL) [1]
• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories
in a way that would [… ] have the potential to substitute and/or replicate
any other existing Elsevier products, services and/or solutions.
WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHA
User has to type words
ContentMine working with Libraries
• Cambridge: Library, Plant Sciences,
Epidemiology, Chemistry
• Cochrane Collaboration on Systematic Reviews
of Clinical Trials
• FutureTDM (H2020, LIBER)
• Running workshops and training

More Related Content

PPTX
ContentMining at Cambridge
PPTX
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
PPT
The Emerging Global Community of Microbial Metagenomics Researchers
PPTX
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
PPT
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
PPT
Microbial Metagenomics and Human Health
PPTX
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
PPTX
Global surveillance One World – One Health
ContentMining at Cambridge
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
The Emerging Global Community of Microbial Metagenomics Researchers
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Microbial Metagenomics and Human Health
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Global surveillance One World – One Health

What's hot (20)

PPT
Microbial Metagenomics Drives a New Cyberinfrastructure
PPTX
[13.07.07] albertsen mewe13 metagenomics
PDF
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
PDF
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
PPT
Reframing Phylogenomics
PPT
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
PPT
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
PPTX
metagenomics
PPT
Advancing the Metagenomics Revolution
PDF
Microbiome 2013
PPTX
Metagenomics and it’s applications
PDF
Poster
PPTX
Analysis of binning tool in metagenomics
PDF
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
PPT
Metagenomic
PPT
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
PPT
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
PPTX
Basics of bioinformatics
PPTX
[2013.09.27] extracting genomes from metagenomes
PPTX
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Microbial Metagenomics Drives a New Cyberinfrastructure
[13.07.07] albertsen mewe13 metagenomics
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
Reframing Phylogenomics
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
metagenomics
Advancing the Metagenomics Revolution
Microbiome 2013
Metagenomics and it’s applications
Poster
Analysis of binning tool in metagenomics
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
Metagenomic
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Basics of bioinformatics
[2013.09.27] extracting genomes from metagenomes
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Ad

Viewers also liked (13)

DOCX
Barcelona vs roma
PPTX
Amc Square Learning
PPTX
презентация4
PDF
Newspaper2 12-13
DOC
China abnormal traffic rules:jump amber light twice , lose your license
PDF
Tutorial Penggunaan MariaDB Untuk Pemula
DOC
Programacion curricular
PDF
Perintah dasar linux
PDF
Evaluación final 3° 2015 2016
PPT
Le périmètre : l'exemple de Côte à Côte - Journée marketing territorial Cap'Com
PPSX
Orientaciones para la planificación curricular ebr2016
PPTX
Mateo e Etham
PPSX
Situaciones significativas
Barcelona vs roma
Amc Square Learning
презентация4
Newspaper2 12-13
China abnormal traffic rules:jump amber light twice , lose your license
Tutorial Penggunaan MariaDB Untuk Pemula
Programacion curricular
Perintah dasar linux
Evaluación final 3° 2015 2016
Le périmètre : l'exemple de Côte à Côte - Journée marketing territorial Cap'Com
Orientaciones para la planificación curricular ebr2016
Mateo e Etham
Situaciones significativas
Ad

Similar to Content Mining of Science in Europe (20)

PPTX
Content Mining of Science in Cambridge
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining in Neuroscience
PPTX
ContentMining in Neuroscience
PPTX
ContentMining in Neuroscience
PPTX
Why ContentMining is useful
PPTX
Why ContentMining is useful
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
ContentMine (TDM) at JISC Digifest
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
Content Mining at Wellcome Trust
PPTX
Content Mining at Wellcome Trust
PPTX
ContentMine and WikiData
PPTX
ContentMine and WikiData
PPTX
ContentMining and Clinical Trials
PPTX
ContentMining and Clinical Trials
PPTX
ContentMine: Open Data and Social Machines
PPTX
Content Mining for Machines and Humans
PPTX
Content Mining for Machines and Humans
Content Mining of Science in Cambridge
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
ContentMining in Neuroscience
ContentMining in Neuroscience
ContentMining in Neuroscience
Why ContentMining is useful
Why ContentMining is useful
Liberating facts from the scientific literature - Jisc Digifest 2016
ContentMine (TDM) at JISC Digifest
Liberating facts from the scientific literature - Jisc Digifest 2016
Content Mining at Wellcome Trust
Content Mining at Wellcome Trust
ContentMine and WikiData
ContentMine and WikiData
ContentMining and Clinical Trials
ContentMining and Clinical Trials
ContentMine: Open Data and Social Machines
Content Mining for Machines and Humans
Content Mining for Machines and Humans

More from TheContentMine (20)

PPTX
High throughput mining of the scholarly literature
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPSX
Cochrane workshop 2016
PPTX
Content Mining of Science and Medicine
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
The culture of researchData
PPTX
Mining Scientific Diagrams for facts
PPTX
Digital Scholarship: Enlightenment or Devastated Landscape?
PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
OpenNotebookScience NOW!
PPTX
Making Theses USEFUL
PPTX
Open Data and Open Science
PPTX
Mining Scientific Images
PPTX
ContentMine: Open Data and Social Machines
PPTX
Disruptive Communities and Technology
PPTX
Embrace the Open Revolution
High throughput mining of the scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
Open software and knowledge for MIOSS
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from Biomedical literature
Cochrane workshop 2016
Content Mining of Science and Medicine
ContentMine + EPMC: Finding Zika!
The culture of researchData
Mining Scientific Diagrams for facts
Digital Scholarship: Enlightenment or Devastated Landscape?
Open Knowledge and University of Cambridge European Bioinformatics Institute
Can Computers understand the scientific literature (includes compscie material)
OpenNotebookScience NOW!
Making Theses USEFUL
Open Data and Open Science
Mining Scientific Images
ContentMine: Open Data and Social Machines
Disruptive Communities and Technology
Embrace the Open Revolution

Recently uploaded (20)

PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
2. Earth - The Living Planet earth and life
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
An interstellar mission to test astrophysical black holes
PDF
diccionario toefl examen de ingles para principiante
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Cell Membrane: Structure, Composition & Functions
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
2. Earth - The Living Planet earth and life
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
An interstellar mission to test astrophysical black holes
diccionario toefl examen de ingles para principiante
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
The scientific heritage No 166 (166) (2025)
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
HPLC-PPT.docx high performance liquid chromatography
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ECG_Course_Presentation د.محمد صقران ppt
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField

Content Mining of Science in Europe

Editor's Notes

  • #2: Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.