SlideShare a Scribd company logo
MIOSS 2016, EBI, UK, 2016-
05-17
Open Content +
Open Programs
Peter Murray-Rust1,2
[1]University of Cambridge
[2]TheContentMine
pm286 AT cam DOT ac DOT uk
Open
Software
Articles
Data
Infrastructure
Themes
• Open :
– Faster
– Better
– Agile
– Inclusive
– Re-usable
Pomerantz, J. and Peek, R. 2016. Fifty shades of open
http://dx.doi.org/10.5210/fm.v21i5.6360
Open source. Open access. Open society. Open knowledge. Open government. Even open food. The word “open” has been applied to a wide variety of
words to create new terms, some of which make sense, and some not so much. This essay disambiguates the many meanings of the word “open” as
it is used in a wide range of contexts.
• Demos:
– OSCAR, OPSIN, etc.
– Content Mining and Annotation
Open Source Demos
• Centre for Molecular Informatics
– OSCAR (chemical entity recognition)
– Opsin (name2structure)
– ChemicalTagger (chemical language parsing)
– [OSRIC] (chemical image interpretation)
• ContentMine
– getpapers, quickscrape, norma, ami, canary
– Mining the complete scholarly literature
• 10,000 articles per day
• > 1 million facts per day
Open: [state+private] investment
Rufus Pollock “The Open Information age” TBP 2016 [1]
CMI Software (OSCAR, OPSIN, ChemicalTagger, [OSRIC]): sponsors
– 2006 …
– Unilever
– Nature PG, RSC, Int Union of Cryst.
– EPSRC
– OMII
– NCI
– [Microsoft]
– JISC
– CambridgeIP
– Linguamatics
– … 2016
• Peter Corbett, Joe Townsend, Chris Waudby, Sam Adams, David Jessop, Lezan
Hawizy, Nico Adams, Mark Williamson, Andy Howlett, Daniel Lowe…
[1] https://www.youtube.com/watch?v=D2oNxhn6POA
Community
Mat Todd (Sydney) and MANY collaborators
http://opensourcemalaria.org/ (Chrome for interactivity)
Mat Todd, Univ Sydney, runs an Open Notebook community
to create new antimalarials.
Interactive OPEN chemical search tool from cheminfo.org
Interactive OPEN molecular display Jmol (Bob Hanson et al)
Interactive OPEN chemical search tool from cheminfo.org
data is associated with the proposed
scientific endeavour prior to or at the
point of creation rather than by
annotating the data with commentary
after the experiment has taken place
University of Southampton
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Parsing chemical sentences
This could be extended to much other scientific language
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
“OSRIC”
Is anyone interested
In taking this further?
@Senficon (Julia Reda) :Text & Data mining in times of
#copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevi
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink
I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my
university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post
Julia Reda, Pirate MEP, running ContentMine
software to liberate science 2016-04-16
The Right to Read is the Right to Mine
http://contentmine.org
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
Latest 20150908
Mining for phytochemicals
• getpapers –q carvone –o carvone –x –k 100
Search for “carvone”, output directory, XML, limit hits to 100
• –q carvne –o carvone –x –k 100
Search for “carvone”, output directory, XML, limit hits to 100
getpapers

More Related Content

PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
High throughput mining of the scholarly literature
PPTX
High throughput mining of the scholarly literature
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
Automatic Extraction of Knowledge from Biomedical literature
Amanuens.is HUmans and machines annotating scholarly literature
Open software and knowledge for MIOSS
Amanuens.is HUmans and machines annotating scholarly literature
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature

What's hot (20)

PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Content Mining of Science in Cambridge
PPSX
Cochrane workshop 2016
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
Digital Scholarship: Enlightenment or Devastated Landscape?
PPTX
Mining Scientific Images
PPTX
Mining the scientific literature for plants and chemistry
PPTX
The culture of researchData
PPTX
Biovision2017 Accessing the scientific literature
PPTX
Towards Responsible Content Mining: A Cambridge perspective
PPTX
High throughput mining of the scholarly literature; talk at NIH
PPTX
Cochrane workshop2016
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Content Mining of Science in Europe
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
ContentMine: Mining the Scientific Literature
PPTX
ContentMine (TDM) at JISC Digifest
PPTX
ContentMining in Neuroscience
PPTX
ContentMine + EPMC: Finding Zika!
Can Computers understand the scientific literature (includes compscie material)
Automatic Extraction of Knowledge from the Literature
Content Mining of Science in Cambridge
Cochrane workshop 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
Digital Scholarship: Enlightenment or Devastated Landscape?
Mining Scientific Images
Mining the scientific literature for plants and chemistry
The culture of researchData
Biovision2017 Accessing the scientific literature
Towards Responsible Content Mining: A Cambridge perspective
High throughput mining of the scholarly literature; talk at NIH
Cochrane workshop2016
Automatic Extraction of Science and Medicine from the scholarly literature
Content Mining of Science in Europe
ContentMine + EPMC: Finding Zika!
ContentMine: Mining the Scientific Literature
ContentMine (TDM) at JISC Digifest
ContentMining in Neuroscience
ContentMine + EPMC: Finding Zika!
Ad

Viewers also liked (6)

PPTX
ContentMine and WikiData
PPTX
Content Mining of Science and Medicine
PPTX
Open Data and Open Science
PPTX
Making Theses USEFUL
PPTX
OpenNotebookScience NOW!
PPTX
Mining Scientific Diagrams for facts
ContentMine and WikiData
Content Mining of Science and Medicine
Open Data and Open Science
Making Theses USEFUL
OpenNotebookScience NOW!
Mining Scientific Diagrams for facts
Ad

Similar to Open software and knowledge for MIOSS (20)

PPTX
Rapid biomedical search
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
WikiFactMine: Science for Everyone
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Digital Scholarship
PPTX
Big Data and ContentMining for Libraries
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
The culture of researchData
PPTX
The Culture of Research Data, by Peter Murray-Rust
PPTX
Principles and practice of Open Science
PPTX
Principles and practice of Open Science
PPTX
Principles and practice of Open Science
PDF
Content Mining
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMine: Open Data and Social Machines
Rapid biomedical search
Paradise Lost and The Right to Read is the Right to Mine
WikiFactMine: Science for Everyone
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Disrupting the Publisher-Academic Complex
Digital Scholarship
Big Data and ContentMining for Libraries
Automatic Extraction of Science and Medicine from the scholarly literature
Liberating facts from the scientific literature - Jisc Digifest 2016
The culture of researchData
The Culture of Research Data, by Peter Murray-Rust
Principles and practice of Open Science
Principles and practice of Open Science
Principles and practice of Open Science
Content Mining
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
ContentMine: Open Data and Social Machines

More from TheContentMine (11)

PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
ContentMine: Open Data and Social Machines
PPTX
Disruptive Communities and Technology
PPTX
Embrace the Open Revolution
PPTX
Content Mining for Machines and Humans
PPTX
TheContentMine: Mining for Everyone
PPTX
Overview of Practical Content Mining
PPTX
Copyright Reform and Open Data
PPTX
ContentMining and Clinical Trials
PPTX
Content Mining at Wellcome Trust
PPTX
ContentMine: Liberating scholarship from Open publications and theses
Open Knowledge and University of Cambridge European Bioinformatics Institute
ContentMine: Open Data and Social Machines
Disruptive Communities and Technology
Embrace the Open Revolution
Content Mining for Machines and Humans
TheContentMine: Mining for Everyone
Overview of Practical Content Mining
Copyright Reform and Open Data
ContentMining and Clinical Trials
Content Mining at Wellcome Trust
ContentMine: Liberating scholarship from Open publications and theses

Recently uploaded (20)

PDF
An interstellar mission to test astrophysical black holes
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPT
protein biochemistry.ppt for university classes
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
famous lake in india and its disturibution and importance
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Introduction to Cardiovascular system_structure and functions-1
An interstellar mission to test astrophysical black holes
The KM-GBF monitoring framework – status & key messages.pptx
The scientific heritage No 166 (166) (2025)
INTRODUCTION TO EVS | Concept of sustainability
Biophysics 2.pdffffffffffffffffffffffffff
HPLC-PPT.docx high performance liquid chromatography
neck nodes and dissection types and lymph nodes levels
Classification Systems_TAXONOMY_SCIENCE8.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
protein biochemistry.ppt for university classes
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
famous lake in india and its disturibution and importance
bbec55_b34400a7914c42429908233dbd381773.pdf
Introduction to Fisheries Biotechnology_Lesson 1.pptx
2Systematics of Living Organisms t-.pptx
Placing the Near-Earth Object Impact Probability in Context
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Introduction to Cardiovascular system_structure and functions-1

Open software and knowledge for MIOSS

  • 1. MIOSS 2016, EBI, UK, 2016- 05-17 Open Content + Open Programs Peter Murray-Rust1,2 [1]University of Cambridge [2]TheContentMine pm286 AT cam DOT ac DOT uk Open Software Articles Data Infrastructure
  • 2. Themes • Open : – Faster – Better – Agile – Inclusive – Re-usable Pomerantz, J. and Peek, R. 2016. Fifty shades of open http://dx.doi.org/10.5210/fm.v21i5.6360 Open source. Open access. Open society. Open knowledge. Open government. Even open food. The word “open” has been applied to a wide variety of words to create new terms, some of which make sense, and some not so much. This essay disambiguates the many meanings of the word “open” as it is used in a wide range of contexts. • Demos: – OSCAR, OPSIN, etc. – Content Mining and Annotation
  • 3. Open Source Demos • Centre for Molecular Informatics – OSCAR (chemical entity recognition) – Opsin (name2structure) – ChemicalTagger (chemical language parsing) – [OSRIC] (chemical image interpretation) • ContentMine – getpapers, quickscrape, norma, ami, canary – Mining the complete scholarly literature • 10,000 articles per day • > 1 million facts per day
  • 4. Open: [state+private] investment Rufus Pollock “The Open Information age” TBP 2016 [1] CMI Software (OSCAR, OPSIN, ChemicalTagger, [OSRIC]): sponsors – 2006 … – Unilever – Nature PG, RSC, Int Union of Cryst. – EPSRC – OMII – NCI – [Microsoft] – JISC – CambridgeIP – Linguamatics – … 2016 • Peter Corbett, Joe Townsend, Chris Waudby, Sam Adams, David Jessop, Lezan Hawizy, Nico Adams, Mark Williamson, Andy Howlett, Daniel Lowe… [1] https://www.youtube.com/watch?v=D2oNxhn6POA
  • 6. Mat Todd (Sydney) and MANY collaborators http://opensourcemalaria.org/ (Chrome for interactivity) Mat Todd, Univ Sydney, runs an Open Notebook community to create new antimalarials.
  • 7. Interactive OPEN chemical search tool from cheminfo.org
  • 8. Interactive OPEN molecular display Jmol (Bob Hanson et al)
  • 9. Interactive OPEN chemical search tool from cheminfo.org
  • 10. data is associated with the proposed scientific endeavour prior to or at the point of creation rather than by annotating the data with commentary after the experiment has taken place University of Southampton
  • 11. Natural Language Processing Part of speech tagging (Wordnet, Brown Corpus, etc.)
  • 13. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  • 14. Parsing chemical sentences This could be extended to much other scientific language
  • 15. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 16. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY: AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent) “OSRIC” Is anyone interested In taking this further?
  • 17. @Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism: "Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevi er-stopped-me-doing-my-research/ … #opencon #TDM Elsevier stopped me doing my research Chris Hartgerink
  • 18. I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress. To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1]. In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers. Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day. Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university. I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research. [1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2 Chris Hartgerink’s blog post
  • 19. Julia Reda, Pirate MEP, running ContentMine software to liberate science 2016-04-16
  • 20. The Right to Read is the Right to Mine http://contentmine.org
  • 22. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts Latest 20150908
  • 23. Mining for phytochemicals • getpapers –q carvone –o carvone –x –k 100 Search for “carvone”, output directory, XML, limit hits to 100 • –q carvne –o carvone –x –k 100 Search for “carvone”, output directory, XML, limit hits to 100 getpapers