SlideShare a Scribd company logo
Content Mining of Science in Cambridge
Peter Murray-Rust,
Dept of Chemistry, University of Cambridge
libraries@cambridge, Cambridge, UK 2016-01-07
What is mining?
Why is it useful?
Open Access and UK “Hargreaves” legislation
How Cambridge can become a world leader
The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
Use Cases of ContentMining
• Epidemiology of obesity (Cambridge U)
• (OKF, OpenTrials) Mapping clinical trials
repositories to reports in scientific literature
• Mining chemical reactions from patents
• Creating a bacterial supertree-of-life from
4500 papers
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
ContentMine-ing strategy
• Discover. Crawl the COMPLETE relevant literature.
=> bibliography
• Scrape (download). ALL papers
• Index papers => Facts
• Search/analyze papers => complex science
• Extract, Annotate, Aggregate (“Transformative”)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Facts in context
daily IUCN endangered species news
en.wikipedia.org CC By-SA
ContentMine Fact of The Day
• Fact of the day
• Endangered species in recent science
• Facts
• Bubbles
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
“Root”
4500 papers each
with 1 tree
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga
_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te
rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat
um:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
Supertree for 924 species
Tree
Supertree created from 4300 papers
Copyright and Mining
• UK (“Hargreaves”) 2014 legislation:
– “personal” “non-commercial*” “research” “data
analytics”
– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
STM Publishers prevent Mining
• FUD & disinformation about legality (Elsevier)
• Monopolies on infrastructure (“API”s, CCC
Rightfind)
• Technical obstruction (Wiley Captcha,
Macmillan Readcube)
• Restrictive contracts with libraries (ALL) [1]
• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories
in a way that would [… ] have the potential to substitute and/or replicate
any other existing Elsevier products, services and/or solutions.
WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHA
User has to type words
ContentMine working with Libraries
• Cambridge: Library, Plant Sciences, Public Health,
Chemistry
• Cochrane Collaboration on Systematic Reviews of
Clinical Trials
• FutureTDM (H2020, LIBER)
• Running workshops and training
• We have dedicated servers running in chemistry
My European Heroes
Young People(ContentMine)
NEELIE KROES

More Related Content

PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
Content Mining of Science in Europe
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
High throughput mining of the scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
ContentMine + EPMC: Finding Zika!
Can Computers understand the scientific literature (includes compscie material)
Automatic Extraction of Knowledge from Biomedical literature
Content Mining of Science in Europe
Amanuens.is HUmans and machines annotating scholarly literature
Automatic Extraction of Knowledge from Biomedical literature
High throughput mining of the scholarly literature
Open software and knowledge for MIOSS
ContentMine + EPMC: Finding Zika!

What's hot (20)

PPTX
Mining the scientific literature for plants and chemistry
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
ContentMine (TDM) at JISC Digifest
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
Automatic Extraction of Knowledge from the Literature
PPSX
Cochrane workshop 2016
PPTX
Open software and knowledge for MIOSS
PPTX
High throughput mining of the scholarly literature
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
High throughput mining of the plant-science literature
PPTX
Content Mining of Science and Medicine
PPTX
Digital Scholarship: Enlightenment or Devastated Landscape?
PPTX
Cochrane workshop2016
PPTX
High throughput mining of the scholarly literature; talk at NIH
PPTX
Towards Responsible Content Mining: A Cambridge perspective
PPTX
Biovision2017 Accessing the scientific literature
PPTX
ContentMining in Neuroscience
PPTX
The culture of researchData
PPTX
Can machines understand the scientific literature
Mining the scientific literature for plants and chemistry
Amanuens.is HUmans and machines annotating scholarly literature
ContentMine (TDM) at JISC Digifest
ContentMine + EPMC: Finding Zika!
Automatic Extraction of Knowledge from the Literature
Cochrane workshop 2016
Open software and knowledge for MIOSS
High throughput mining of the scholarly literature
Liberating facts from the scientific literature - Jisc Digifest 2016
Automatic Extraction of Knowledge from the Literature
High throughput mining of the plant-science literature
Content Mining of Science and Medicine
Digital Scholarship: Enlightenment or Devastated Landscape?
Cochrane workshop2016
High throughput mining of the scholarly literature; talk at NIH
Towards Responsible Content Mining: A Cambridge perspective
Biovision2017 Accessing the scientific literature
ContentMining in Neuroscience
The culture of researchData
Can machines understand the scientific literature
Ad

Viewers also liked (12)

PPTX
ContentMining in Neuroscience
PPTX
TheContentMine: Mining for Everyone
PPTX
Asking the scientific literature to tell us about metabolism
PPTX
Making Theses USEFUL
PPTX
Mining Scientific Images
PPTX
Mining Scientific Diagrams for facts
PPTX
OpenNotebookScience NOW!
PPTX
Open Data and Open Science
PPTX
ContentMine and WikiData
PDF
ContentMine (EMBL-EBI Industry Programme)
PPTX
Architecture of ContentMine Components contentmine.org
PPTX
Content Mining at Wellcome Trust
ContentMining in Neuroscience
TheContentMine: Mining for Everyone
Asking the scientific literature to tell us about metabolism
Making Theses USEFUL
Mining Scientific Images
Mining Scientific Diagrams for facts
OpenNotebookScience NOW!
Open Data and Open Science
ContentMine and WikiData
ContentMine (EMBL-EBI Industry Programme)
Architecture of ContentMine Components contentmine.org
Content Mining at Wellcome Trust
Ad

Similar to Content Mining of Science in Cambridge (20)

PPTX
Content Mining of Science in Europe
PPTX
Content Mining of Science in Europe
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
PPTX
Why ContentMining is useful
PPTX
Why ContentMining is useful
PDF
Workhop Mozfest15 - Content-Mining for Transparency of Drug Research
PPT
The Human Microbiome in Sports Performance and Health
PPTX
Content Mining at Wellcome Trust
PPTX
Mining facts from the plant science iterature
PDF
Davis_CapStat_130123-WEB
PPTX
Recent biotechnology innovations
PPSX
Microbiology Today: Innovations and Transforming Horizons. Prof.S.P.Singh.Inv...
PPTX
Impact Through Innovation: The Wellcome Sanger Institute
PPT
Using Supercomputers and Supernetworks to Explore the Ocean of Life
PPTX
Genome sequencing and the development of our current information library
PPTX
ContentMining in Neuroscience
PPTX
Scientific search for everyone
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
Content Mining of Science in Europe
Content Mining of Science in Europe
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Why ContentMining is useful
Why ContentMining is useful
Workhop Mozfest15 - Content-Mining for Transparency of Drug Research
The Human Microbiome in Sports Performance and Health
Content Mining at Wellcome Trust
Mining facts from the plant science iterature
Davis_CapStat_130123-WEB
Recent biotechnology innovations
Microbiology Today: Innovations and Transforming Horizons. Prof.S.P.Singh.Inv...
Impact Through Innovation: The Wellcome Sanger Institute
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Genome sequencing and the development of our current information library
ContentMining in Neuroscience
Scientific search for everyone
ContentMining for France and Europe; Lessons from 2 years in UK

More from TheContentMine (9)

PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
ContentMine: Open Data and Social Machines
PPTX
Disruptive Communities and Technology
PPTX
Embrace the Open Revolution
PPTX
Content Mining for Machines and Humans
PPTX
Overview of Practical Content Mining
PPTX
Copyright Reform and Open Data
PPTX
ContentMining and Clinical Trials
PPTX
ContentMine: Liberating scholarship from Open publications and theses
Open Knowledge and University of Cambridge European Bioinformatics Institute
ContentMine: Open Data and Social Machines
Disruptive Communities and Technology
Embrace the Open Revolution
Content Mining for Machines and Humans
Overview of Practical Content Mining
Copyright Reform and Open Data
ContentMining and Clinical Trials
ContentMine: Liberating scholarship from Open publications and theses

Recently uploaded (20)

PPT
protein biochemistry.ppt for university classes
PDF
The scientific heritage No 166 (166) (2025)
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Placing the Near-Earth Object Impact Probability in Context
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
An interstellar mission to test astrophysical black holes
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
2. Earth - The Living Planet earth and life
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
Introduction to Cardiovascular system_structure and functions-1
protein biochemistry.ppt for university classes
The scientific heritage No 166 (166) (2025)
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Placing the Near-Earth Object Impact Probability in Context
POSITIONING IN OPERATION THEATRE ROOM.ppt
An interstellar mission to test astrophysical black holes
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Classification Systems_TAXONOMY_SCIENCE8.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
7. General Toxicologyfor clinical phrmacy.pptx
6.1 High Risk New Born. Padetric health ppt
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
2. Earth - The Living Planet Module 2ELS
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Phytochemical Investigation of Miliusa longipes.pdf
2. Earth - The Living Planet earth and life
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Introduction to Cardiovascular system_structure and functions-1

Content Mining of Science in Cambridge

Editor's Notes

  • #2: Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.