SlideShare a Scribd company logo
xmlsummerschool.com Slide 1
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 Unported License
summer school
Is “Publishing destroying science?”
XML: Liberating science for citizens
Peter Murray-Rust
ContentMine.org
XMLSummerSchool
Trends and Transients September 2019
summer school
xmlsummerschool.com Slide 2
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Citizen OpenNoteBook science
World citizens spend 500 Billion USD on STEM[1] research.
85% is wasted
Most citizens don’t have access
We are facing existential threats
• Climate
• Antibiotic resistance
• Biodiversity loss
The scientific literature may already contain answers
Let’s look for them…
OpenNoteBook is doing Everything on the web Immediately
[1] Scientific Technical Engineering Medical
summer school
xmlsummerschool.com Slide 3
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Thanks and Volunteers
• Ambarish Kumar (New Delhi, IN)
• Daniel Nuest (WWU, DE)
• Simon Worthington (TiB, DE)
• Tiago Lubiana (USP, BR)
• Emanuel Faria (Brasilia)
• Graham Klyne (Oxford, UK)
Peter Flynn (UCC, IE)
summer school
xmlsummerschool.com Slide 4
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Resources
Resources for XMLSS19
• http://github.com/petermr/xmlopensci
• http://github.com/petermr/CEVOpen
• http://github.com/petermr/climate
• http://github.com/petermr/opennotebook
• http://github.com/petermr/ami3
• Contains code, resources and Issues
(including conversations)
• https://media.ed.ac.uk/media/1_46h85ltt
(Disrupting the Publisher-Academic Complex)
• Scientific search for Everyone
(https://www.slideshare.net/petermurrayrust/scientific-search-for-everyone )
summer school
xmlsummerschool.com Slide 5
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Trends: Good and Ungood
Existential threats (climate, disease)
Good
• Open science and Open Access
• Free/Open Source
• Distributed collaboration
• Github
• Wikidata, Linked Open Data and universal identifiers
Ungood
• Publisher-Academic Complex
• North-South divide (knowledge neocolonialism)
summer school
xmlsummerschool.com Slide 6
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Politics of Publishing
summer school
xmlsummerschool.com Slide 7
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will accelerate
research, enrich education, share the learning of the rich
with the poor and the poor with the rich, make this
literature as useful as it can be, and lay the foundation for
uniting humanity in a common intellectual conversation
and quest for knowledge.
(Budapest Open Access Initiative, 2003)
summer school
xmlsummerschool.com Slide 8
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
[1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia
Glory+?
$$, MS
review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]
Infrastructure
“The scholarly poor”
summer school
xmlsummerschool.com Slide 9
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
summer school
xmlsummerschool.com Slide 10
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Demo: Plants, oils, medicines
summer school
xmlsummerschool.com Slide 11
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Citizen OpenNoteBook science
About 2 million STEM articles per year (excl. theses, reports)
70% behind paywalls
Most in PDF (arghhh!)
BUT…
NLM (US NIH) requires its output to be Open and XML
So we’re going to read a few million articles.
Start at http://europepmc.org
And use http://ContentMine.org software packages
And http://github.com/petermr/opennotebook
summer school
xmlsummerschool.com Slide 12
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Citizen OpenNoteBook science (plant oils)
Plants emit “essential oils” (volatile chemicals) which often
have antibacterial activities http://github.com/petermr/CEVOpen
NOTE: all items are identified by a Wikidata item (Q number)
summer school
xmlsummerschool.com Slide 13
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Search EuropePMC with getpapers for plant oils
Download 200 papers (30 sec) from EuropePMC
getpapers
–q “((essential oil) AND (compound))” #EPMC
--xml #output format
–-outdir oil200/ #makes directory (CProject)
–k 200 #cutoff
http://github.com/ContentMine/getpapers
summer school
xmlsummerschool.com Slide 14
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Download EuropePMC XML into CProject and CTrees
Oil186 CProject
├── PMC4391421 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
├── PMC5080681 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
├── PMC5132230 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
├── PMC5203915 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
NOTE: Tree automatically created by `getpapers`
Data can be translated automatically by jats2html.xsl
Ctrees are extensible. Let’s create HTML...
summer school
xmlsummerschool.com Slide 15
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Citizen OpenNoteBook science (JATS XML)
AMI is ContentMine’s Open XML toolkit
Converts FROM XML and TO XML where possible
Transform JATS-XML to HTML (normally integrated)
ami-transform –p oil200
--transform nlm2html.xsl #stylesheet
--input fulltext.xml
--output scholarly.html
http://github.com/petermr/ami3
NOTE: JATS used to be called NLM. In our demo of
ami-search this is all hidden.
summer school
xmlsummerschool.com Slide 16
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
PART of ContentMine stylesheet for JATS2HTML
<xsl:template match=“jats:article">
<html:head>
<html:style type="text/css” >
a {background : #ffffff;}
article {border-style : dotted;}
</html:style>
</html:head>
<html:body> <!-- PULL for main sections -->
<xsl:apply-templates select=”jats:front"/>
<xsl:apply-templates select=”jats:body"/>
<xsl:apply-templates select=”jats:back"/>
<!-- PUSH for unknown elements -->
<xsl:for-each select="*[not(
self::jats:front | self::jats:body | self::jats:back)]">
<xsl:call-template name=“unknownElement”/>
</xsl:for-each>
</html:body>
</xsl:template>
NOTE: JATS has ca 250 elements, not always used wisely
summer school
xmlsummerschool.com Slide 17
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
ScholarlyHTML showing JATS markup
<table>
<caption>
<th>
<tr>
<xref>
<p>
<div>
<ref>
summer school
xmlsummerschool.com Slide 18
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Result of applying NLM2HTML
Oil186 CProject
├── PMC4391421 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
| ├── scholarly.html created by jats2html.xsl
├── PMC5080681 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
| ├── scholarly.html created by jats2html.xsl
├── PMC5132230 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
| ├── scholarly.html created by jats2html.xsl
NOTE: lazy creation (a la “make”)
summer school
xmlsummerschool.com Slide 19
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
ContentMine Dictionaries for searching
<dictionary title="myterpenes">
<desc>terpenes searched from Wikipedia</desc>
<entry id=”C127"
term="(−)-menthol"
wikipedia="/wiki/menthol”
wikidata="Q407418” />
<entry id="C2198"
term="thymol"
wikipedia="/wiki/thymol"
wikidata="Q408883” />
</dictionary>
summer school
xmlsummerschool.com Slide 20
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Search scholarly.html for dictionary words
Search for compounds, species…
ami-search
–p oil1000 #CProject with 1000 papers
--dictionary #list of dictionaries
species #species by regex
country #200 countries
funders #15000 funders
disease #5000 diseases
mydictionary/compounds.xml #2100 compounds
mydictionary/plants.xml #1800 plants
summer school
xmlsummerschool.com Slide 21
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
ContentMine CProject and CTrees
Oil186 CProject
├── PMC4391421 CTree
│ ├── eupmc_result.json download metadata
│ ├── fulltext.xml full JATS text of article
│ ├── scholarly.html transformed text (to be searched)
│ ├── results search tree
│ │ ├── search dictionary search
│ │ │ ├── compound compound dictionary
│ │ │ │ └── results.xml results of search
│ │ │ ├── country country dictionary
│ │ │ │ └── results.xml results of search
│ │ │ ├── disease country dictionary
│ │ │ │ └── empty.xml no results of search
...
│ │ ├── species hardcoded pseudo-search
│ │ │ └── binomial species (Rattus norvegicus)
│ │ │ └── results.xml results of search
│ │ └── word hardcoded word searches
│ │ └── frequencies word cloud
│ │ ├── results.xml results of search
│ │ └── results.html display of search
summer school
xmlsummerschool.com Slide 22
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
ContentMine XML Search results
<snippetsTree> <!-- CProject -->
<snippets
file="oil186/PMC5080681/results/species/binomial/results.xml"
> <!-- Ctree -->
<result
pre="ify the chemical composition, and to assess
anthelmintic, antimicrobial and antioxidant effects of "
exact="Thymus bovei”
xpath="/html[1]/body[1]/div[2]/div[5]/p[1]" <!-- locates
snippet -->
match="Thymus bovei"
post=" essential oil. "
name="binomial"/>
…
NOTE: Each snippet with
• semantic origin “species/binomial/” links to dictionary
• separate XML result element
• W3C annotations (pre, exact, post)
• XPath locates it in document (W3C Annotations Rec.)
summer school
xmlsummerschool.com Slide 23
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Word frequencies
summer school
xmlsummerschool.com Slide 24
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Ami-search100ARTICLES
8 DICTIONARIES
summer school
xmlsummerschool.com Slide 25
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Ami-search
summer school
xmlsummerschool.com Slide 26
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
WIKIPEDIA
summer school
xmlsummerschool.com Slide 27
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
WIKIDATA
summer school
xmlsummerschool.com Slide 28
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Cooccurrence in 186 documents:
Plant oils and antibacterial activity
summer school
xmlsummerschool.com Slide 29
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
CONSEQUENCES: Good and Bad
Universally available ! …
… or not 
summer school
xmlsummerschool.com Slide 30
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
All the world’s 5 million FAIR Open Scientific articles (* 0.1 MB =
0.5 TB), indexed by ContentMine . Disk 30 GBP Raspberry Pi3.
50 GBP
CC BY, PeterMR
Disk
Raspberry PI
Power
summer school
xmlsummerschool.com Slide 31
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
cc by-nc-sa license LabHack and Alliance
Earth
1 APC = 1900 USD
1 bioreactor = 25 USD
1 Raspberry PI 55 USD
1 submission to bioRxiv
Free (10 USD hidden)
“a PCR machine in the
UK is around £6000 but
in Zimbabwe about
$33000 - try convincing
someone to pay APCs
when they have to try
CITIZENS!
Zimbabwe. LabHack team
from Harare Institute of
Technology.
summer school
xmlsummerschool.com Slide 32
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
@Senficon (Julia Reda) :Text & Data mining in times
of #copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/
… #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink
summer school
xmlsummerschool.com Slide 33
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Other XML possibilities
Chemistry!
And
“Turning PDFs into XML … is like
turning hamburgers back into a cow”
anon => Michael Kay (== > PeterMR)
summer school
xmlsummerschool.com Slide 34
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Chem4Word creates CML (XML for chemistry)
Text linked directly to chemistry
summer school
xmlsummerschool.com Slide 35
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Chemistry Add-In for Word (Chem4W
Chemical structures in Word documents
summer school
xmlsummerschool.com Slide 36
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
So why is Chem4Word different?
Open Source!
Easy to use
‘Open Chemistry’
CML-based
Mineable!
Semantically-rich!
summer school
xmlsummerschool.com Slide 37
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Worldwide Usage
400K+ downloads
Colours show Word Versions in use
summer school
xmlsummerschool.com Slide 38
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
New ACME Editor
Advanced CML-based Molecule Editor
Drop-in component
Can be used in any WPF or Winforms app.
Apache 2.0 License
Will soon support reactions!
summer school
xmlsummerschool.com Slide 39
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
Who makes Chem4Word?
Now a small team of volunteers
Clyde Davies Mike Williams Andy Wright Joe Townsend
https://www.chem4word.co.uk
summer school
xmlsummerschool.com Slide 40
Licensed under a Creative Commons
Attribution-Noncommercial-Share Alike 4.0 International License
BUT only 10% of scholarly articles are XML
The rest are HTML (usually of the worst kind)
Or PDF
And the images are PNGs and JPGs
You can’t do anything with that???
Oh yes we can…
We can turn Hamburgers into Cows!
But for that you’ll have to come to PMR’s presentation

More Related Content

PDF
Open access: What's in there for me? And some ideas for advocacy programmes
PPT
University of Cape Town OpenContent - Open Educational Resources Directory La...
PPTX
Open content introduction
PPT
Open access for researchers, policy makers and research managers, libraries
PPTX
Why Use Open Educational Resources
PDF
Open Education: Putting Students First
PPT
CaRILLO - Nicola Siminson/JorumOpen
PPTX
Why open education matters in South Africa
Open access: What's in there for me? And some ideas for advocacy programmes
University of Cape Town OpenContent - Open Educational Resources Directory La...
Open content introduction
Open access for researchers, policy makers and research managers, libraries
Why Use Open Educational Resources
Open Education: Putting Students First
CaRILLO - Nicola Siminson/JorumOpen
Why open education matters in South Africa

What's hot (14)

PDF
Multiplying the impact of online instruction - "The Obviousness of Open Policy"
PDF
Open education: What does it mean to us, to South Africa and to you?
PDF
Parthenos Webinar e-Humanties and e-Heritage Research Infrastructures: Beyond...
PPTX
Costs of Closed Science
PPT
iTunesu Youtube and other online collections
PPTX
Creative Commons for Education, Science, Government, Culture, Media and Platf...
PDF
Collaborative Culture
ODP
Open Education Movement. When Digital Technologiees Meet Free Culture
PPTX
Slides by Dr Eoin O'Dell at Copyright Law for Digital Teaching and Learning, ...
PPT
SISJS - Open Education as an Enabler for Collaboration, Flexibility, and Glob...
PPTX
ABCs of CC
PPTX
Open educational resources: finding, using, sharing?
PDF
Porter_Open_Con_2016_TO
PPT
Podcasting in 2007
Multiplying the impact of online instruction - "The Obviousness of Open Policy"
Open education: What does it mean to us, to South Africa and to you?
Parthenos Webinar e-Humanties and e-Heritage Research Infrastructures: Beyond...
Costs of Closed Science
iTunesu Youtube and other online collections
Creative Commons for Education, Science, Government, Culture, Media and Platf...
Collaborative Culture
Open Education Movement. When Digital Technologiees Meet Free Culture
Slides by Dr Eoin O'Dell at Copyright Law for Digital Teaching and Learning, ...
SISJS - Open Education as an Enabler for Collaboration, Flexibility, and Glob...
ABCs of CC
Open educational resources: finding, using, sharing?
Porter_Open_Con_2016_TO
Podcasting in 2007
Ad

Similar to XML for science; its huge potential; but are pubiishers preventing it? (20)

PPTX
ContentMine: Open Data and Social Machines
PPTX
Can Computers understand the scientific literature (includes compscie material)
PDF
Workshop 5: Uptake of, and concepts in text and data mining
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
ODP
Giant EduGraph -- ACCS09 Talk
PDF
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
PPTX
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
PPT
Text Mining for Chemistry and Building a Public Platform for Document Markup
PPT
Linking chemistry: wider lessons for how we publish research
PPTX
High throughput mining of the scholarly literature: journals and theses
PPTX
Semantic Web in Physical Science
PPTX
ContentMine: Open Data and Social Machines
KEY
Semantic Web and Linked Open Data
PDF
Exploring opposing viewpoints in context
PPT
Hands-on Edunet
PDF
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
PPT
Eng112 Library Workshop 2
PPTX
Mining the scientific literature for plants and chemistry
ContentMine: Open Data and Social Machines
Can Computers understand the scientific literature (includes compscie material)
Workshop 5: Uptake of, and concepts in text and data mining
Can Computers understand the scientific literature (includes compscie material)
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Giant EduGraph -- ACCS09 Talk
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Text Mining for Chemistry and Building a Public Platform for Document Markup
Linking chemistry: wider lessons for how we publish research
High throughput mining of the scholarly literature: journals and theses
Semantic Web in Physical Science
ContentMine: Open Data and Social Machines
Semantic Web and Linked Open Data
Exploring opposing viewpoints in context
Hands-on Edunet
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
Eng112 Library Workshop 2
Mining the scientific literature for plants and chemistry
Ad

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
PPTX
WikiFactMine: Science for Everyone
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Scientific search for everyone
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism
WikiFactMine: Science for Everyone

Recently uploaded (20)

PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
diccionario toefl examen de ingles para principiante
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPT
Chemical bonding and molecular structure
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Comparative Structure of Integument in Vertebrates.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
diccionario toefl examen de ingles para principiante
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
TOTAL hIP ARTHROPLASTY Presentation.pptx
An interstellar mission to test astrophysical black holes
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Chemical bonding and molecular structure
. Radiology Case Scenariosssssssssssssss
ECG_Course_Presentation د.محمد صقران ppt
Biophysics 2.pdffffffffffffffffffffffffff
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Derivatives of integument scales, beaks, horns,.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
AlphaEarth Foundations and the Satellite Embedding dataset
microscope-Lecturecjchchchchcuvuvhc.pptx
2. Earth - The Living Planet Module 2ELS
Comparative Structure of Integument in Vertebrates.pptx

XML for science; its huge potential; but are pubiishers preventing it?

  • 1. xmlsummerschool.com Slide 1 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported License summer school Is “Publishing destroying science?” XML: Liberating science for citizens Peter Murray-Rust ContentMine.org XMLSummerSchool Trends and Transients September 2019
  • 2. summer school xmlsummerschool.com Slide 2 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Citizen OpenNoteBook science World citizens spend 500 Billion USD on STEM[1] research. 85% is wasted Most citizens don’t have access We are facing existential threats • Climate • Antibiotic resistance • Biodiversity loss The scientific literature may already contain answers Let’s look for them… OpenNoteBook is doing Everything on the web Immediately [1] Scientific Technical Engineering Medical
  • 3. summer school xmlsummerschool.com Slide 3 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Thanks and Volunteers • Ambarish Kumar (New Delhi, IN) • Daniel Nuest (WWU, DE) • Simon Worthington (TiB, DE) • Tiago Lubiana (USP, BR) • Emanuel Faria (Brasilia) • Graham Klyne (Oxford, UK) Peter Flynn (UCC, IE)
  • 4. summer school xmlsummerschool.com Slide 4 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Resources Resources for XMLSS19 • http://github.com/petermr/xmlopensci • http://github.com/petermr/CEVOpen • http://github.com/petermr/climate • http://github.com/petermr/opennotebook • http://github.com/petermr/ami3 • Contains code, resources and Issues (including conversations) • https://media.ed.ac.uk/media/1_46h85ltt (Disrupting the Publisher-Academic Complex) • Scientific search for Everyone (https://www.slideshare.net/petermurrayrust/scientific-search-for-everyone )
  • 5. summer school xmlsummerschool.com Slide 5 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Trends: Good and Ungood Existential threats (climate, disease) Good • Open science and Open Access • Free/Open Source • Distributed collaboration • Github • Wikidata, Linked Open Data and universal identifiers Ungood • Publisher-Academic Complex • North-South divide (knowledge neocolonialism)
  • 6. summer school xmlsummerschool.com Slide 6 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Politics of Publishing
  • 7. summer school xmlsummerschool.com Slide 7 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 8. summer school xmlsummerschool.com Slide 8 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License [1] The Military-Industrial-Academic complex (1961) (Dwight D Eisenhower, US President) Publishers Academia Glory+? $$, MS review Taxpayer Student Researcher $$ $$ in-kind The Publisher-Academic complex[1] Infrastructure “The scholarly poor”
  • 9. summer school xmlsummerschool.com Slide 9 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License
  • 10. summer school xmlsummerschool.com Slide 10 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Demo: Plants, oils, medicines
  • 11. summer school xmlsummerschool.com Slide 11 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Citizen OpenNoteBook science About 2 million STEM articles per year (excl. theses, reports) 70% behind paywalls Most in PDF (arghhh!) BUT… NLM (US NIH) requires its output to be Open and XML So we’re going to read a few million articles. Start at http://europepmc.org And use http://ContentMine.org software packages And http://github.com/petermr/opennotebook
  • 12. summer school xmlsummerschool.com Slide 12 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Citizen OpenNoteBook science (plant oils) Plants emit “essential oils” (volatile chemicals) which often have antibacterial activities http://github.com/petermr/CEVOpen NOTE: all items are identified by a Wikidata item (Q number)
  • 13. summer school xmlsummerschool.com Slide 13 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Search EuropePMC with getpapers for plant oils Download 200 papers (30 sec) from EuropePMC getpapers –q “((essential oil) AND (compound))” #EPMC --xml #output format –-outdir oil200/ #makes directory (CProject) –k 200 #cutoff http://github.com/ContentMine/getpapers
  • 14. summer school xmlsummerschool.com Slide 14 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Download EuropePMC XML into CProject and CTrees Oil186 CProject ├── PMC4391421 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article ├── PMC5080681 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article ├── PMC5132230 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article ├── PMC5203915 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article NOTE: Tree automatically created by `getpapers` Data can be translated automatically by jats2html.xsl Ctrees are extensible. Let’s create HTML...
  • 15. summer school xmlsummerschool.com Slide 15 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Citizen OpenNoteBook science (JATS XML) AMI is ContentMine’s Open XML toolkit Converts FROM XML and TO XML where possible Transform JATS-XML to HTML (normally integrated) ami-transform –p oil200 --transform nlm2html.xsl #stylesheet --input fulltext.xml --output scholarly.html http://github.com/petermr/ami3 NOTE: JATS used to be called NLM. In our demo of ami-search this is all hidden.
  • 16. summer school xmlsummerschool.com Slide 16 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License PART of ContentMine stylesheet for JATS2HTML <xsl:template match=“jats:article"> <html:head> <html:style type="text/css” > a {background : #ffffff;} article {border-style : dotted;} </html:style> </html:head> <html:body> <!-- PULL for main sections --> <xsl:apply-templates select=”jats:front"/> <xsl:apply-templates select=”jats:body"/> <xsl:apply-templates select=”jats:back"/> <!-- PUSH for unknown elements --> <xsl:for-each select="*[not( self::jats:front | self::jats:body | self::jats:back)]"> <xsl:call-template name=“unknownElement”/> </xsl:for-each> </html:body> </xsl:template> NOTE: JATS has ca 250 elements, not always used wisely
  • 17. summer school xmlsummerschool.com Slide 17 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License ScholarlyHTML showing JATS markup <table> <caption> <th> <tr> <xref> <p> <div> <ref>
  • 18. summer school xmlsummerschool.com Slide 18 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Result of applying NLM2HTML Oil186 CProject ├── PMC4391421 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article | ├── scholarly.html created by jats2html.xsl ├── PMC5080681 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article | ├── scholarly.html created by jats2html.xsl ├── PMC5132230 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article | ├── scholarly.html created by jats2html.xsl NOTE: lazy creation (a la “make”)
  • 19. summer school xmlsummerschool.com Slide 19 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License ContentMine Dictionaries for searching <dictionary title="myterpenes"> <desc>terpenes searched from Wikipedia</desc> <entry id=”C127" term="(−)-menthol" wikipedia="/wiki/menthol” wikidata="Q407418” /> <entry id="C2198" term="thymol" wikipedia="/wiki/thymol" wikidata="Q408883” /> </dictionary>
  • 20. summer school xmlsummerschool.com Slide 20 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Search scholarly.html for dictionary words Search for compounds, species… ami-search –p oil1000 #CProject with 1000 papers --dictionary #list of dictionaries species #species by regex country #200 countries funders #15000 funders disease #5000 diseases mydictionary/compounds.xml #2100 compounds mydictionary/plants.xml #1800 plants
  • 21. summer school xmlsummerschool.com Slide 21 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License ContentMine CProject and CTrees Oil186 CProject ├── PMC4391421 CTree │ ├── eupmc_result.json download metadata │ ├── fulltext.xml full JATS text of article │ ├── scholarly.html transformed text (to be searched) │ ├── results search tree │ │ ├── search dictionary search │ │ │ ├── compound compound dictionary │ │ │ │ └── results.xml results of search │ │ │ ├── country country dictionary │ │ │ │ └── results.xml results of search │ │ │ ├── disease country dictionary │ │ │ │ └── empty.xml no results of search ... │ │ ├── species hardcoded pseudo-search │ │ │ └── binomial species (Rattus norvegicus) │ │ │ └── results.xml results of search │ │ └── word hardcoded word searches │ │ └── frequencies word cloud │ │ ├── results.xml results of search │ │ └── results.html display of search
  • 22. summer school xmlsummerschool.com Slide 22 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License ContentMine XML Search results <snippetsTree> <!-- CProject --> <snippets file="oil186/PMC5080681/results/species/binomial/results.xml" > <!-- Ctree --> <result pre="ify the chemical composition, and to assess anthelmintic, antimicrobial and antioxidant effects of " exact="Thymus bovei” xpath="/html[1]/body[1]/div[2]/div[5]/p[1]" <!-- locates snippet --> match="Thymus bovei" post=" essential oil. " name="binomial"/> … NOTE: Each snippet with • semantic origin “species/binomial/” links to dictionary • separate XML result element • W3C annotations (pre, exact, post) • XPath locates it in document (W3C Annotations Rec.)
  • 23. summer school xmlsummerschool.com Slide 23 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Word frequencies
  • 24. summer school xmlsummerschool.com Slide 24 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Ami-search100ARTICLES 8 DICTIONARIES
  • 25. summer school xmlsummerschool.com Slide 25 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Ami-search
  • 26. summer school xmlsummerschool.com Slide 26 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License WIKIPEDIA
  • 27. summer school xmlsummerschool.com Slide 27 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License WIKIDATA
  • 28. summer school xmlsummerschool.com Slide 28 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Cooccurrence in 186 documents: Plant oils and antibacterial activity
  • 29. summer school xmlsummerschool.com Slide 29 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License CONSEQUENCES: Good and Bad Universally available ! … … or not 
  • 30. summer school xmlsummerschool.com Slide 30 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License All the world’s 5 million FAIR Open Scientific articles (* 0.1 MB = 0.5 TB), indexed by ContentMine . Disk 30 GBP Raspberry Pi3. 50 GBP CC BY, PeterMR Disk Raspberry PI Power
  • 31. summer school xmlsummerschool.com Slide 31 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License cc by-nc-sa license LabHack and Alliance Earth 1 APC = 1900 USD 1 bioreactor = 25 USD 1 Raspberry PI 55 USD 1 submission to bioRxiv Free (10 USD hidden) “a PCR machine in the UK is around £6000 but in Zimbabwe about $33000 - try convincing someone to pay APCs when they have to try CITIZENS! Zimbabwe. LabHack team from Harare Institute of Technology.
  • 32. summer school xmlsummerschool.com Slide 32 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License @Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism: "Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ … #opencon #TDM Elsevier stopped me doing my research Chris Hartgerink
  • 33. summer school xmlsummerschool.com Slide 33 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Other XML possibilities Chemistry! And “Turning PDFs into XML … is like turning hamburgers back into a cow” anon => Michael Kay (== > PeterMR)
  • 34. summer school xmlsummerschool.com Slide 34 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Chem4Word creates CML (XML for chemistry) Text linked directly to chemistry
  • 35. summer school xmlsummerschool.com Slide 35 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Chemistry Add-In for Word (Chem4W Chemical structures in Word documents
  • 36. summer school xmlsummerschool.com Slide 36 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License So why is Chem4Word different? Open Source! Easy to use ‘Open Chemistry’ CML-based Mineable! Semantically-rich!
  • 37. summer school xmlsummerschool.com Slide 37 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Worldwide Usage 400K+ downloads Colours show Word Versions in use
  • 38. summer school xmlsummerschool.com Slide 38 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License New ACME Editor Advanced CML-based Molecule Editor Drop-in component Can be used in any WPF or Winforms app. Apache 2.0 License Will soon support reactions!
  • 39. summer school xmlsummerschool.com Slide 39 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License Who makes Chem4Word? Now a small team of volunteers Clyde Davies Mike Williams Andy Wright Joe Townsend https://www.chem4word.co.uk
  • 40. summer school xmlsummerschool.com Slide 40 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License BUT only 10% of scholarly articles are XML The rest are HTML (usually of the worst kind) Or PDF And the images are PNGs and JPGs You can’t do anything with that??? Oh yes we can… We can turn Hamburgers into Cows! But for that you’ll have to come to PMR’s presentation