SlideShare a Scribd company logo
Data Enhancing the
         RSC Archive
   Colin Batchelor, Ken Karapetyan, Alexey
Pshenichov, Dave Sharpe, Jon Steele, Valery
           Tkachenko and Antony Williams
              ACS New Orleans April 2013
Overview
•   The big picture
•   Where we’ve been
•   Statistics as well as semantics
•   New directions in experimental data
•   Where we’re going
The big picture
We have journal articles going back to 1841 and the
aim is to extract:
•Every small molecule we can (graphics and text)
•Reactions
•Spectra
•Data in tables
and classify every paper in a way that makes sense
to the reader.
Background
• RSC Publishing moved to an all-XML workflow
  at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.
RSC Advances

New high-volume journal covering all of chemistry
  launched in 2011.

Need a sensible way of navigating all this.

http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects
Strategy

• Use topic modelling: latent Dirichlet allocation (LDA)
  and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.




• Publishing expertise gives us 12 broad subjects that
  will be intuitive to users
• Merge first set to form second
• Tweak
Classify that classification
Generated 128 topics based on 2009 and 2010’s
 articles (> 20000 papers).

Generated Wordle images (www.wordle.net) of
 the topics for internal staff.
Digitally enabling the RSC archive
Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were
  rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings
  and given names.

Examples…
Examples
1: “kinetics” → Physical
2: “coordination complexes” → Inorganic
3: “general materials” → Materials
4: “misc. organic” → Organic
5: “bacteria” → Biological + Food and health
6: “theoretical” → Physical
7: “cells” → Bio
8: “water and solution chemistry” → Physical
9: “gels” → Materials
10: “inorganic material properties” → Physical + Inorganic + Materials
11: “general organic” → Organic
12: “coordination chemistry” → Inorganic
13: “photochemistry” → Inorganic + Materials + Energy
“Very useful!”
 “Superb!”
“… will make it
easier for
readers to
identify papers
which might be
interesting to
them.”
What now?
Shortly rolling out the subject classification to
other general journals:
•Chemical Communications
•Chemical Science
•Journal of Materials Chemistry A, B and C
•New Journal of Chemistry
Beyond Prospect: further steps in
           text-mining
Migration to Oscar 4
https://bitbucket.org/wwmm/oscar4/wiki/Home
Multiple name to structure engines
      OPSIN, ACD/Labs, Lexichem
ACD/Labs Dictionary
Better disambiguation
Parallelization with Hadoop
Structure validation and standardization (see later)
Reaction extraction from text (see later)
On an experimental
run with names from
Organic and
Biomolecular Chemistry

Is any structure
returned at all by a
given n2s engine?

Lexichem = a (2798)
ACD = b (3049)
OPSIN = c (3309)
Structure
disagreements

Out of 2588 names
where at least one of
the engines differed
or didn’t return a
result:

A = ACD
(1538 in total)
B = Lexichem
(1301 in total)
C = OPSIN
(2097 in total)
Iterations
With the Hadoop cluster, we can mine
thousands of articles a night.

We’re initially iterating over the material back to
2000, for which we have native XML. Then it’s a
case of going back and testing out the OCRed
material.
http://cv.beta.rsc-us.org/
This is the beta site for
•Extracting chemical structures from ChemDraw
files
•Most importantly: structure validation and
standardization

We will be using this for all of the extracted
structures.
Digitally enabling the RSC archive
Digitally enabling the RSC archive
Reaction extraction from text



We have had some preliminary experience of this with Daniel
Lowe (NextMove, formerly Cambridge)’s ChemicalTagger
work.

To go to ChemSpider Reactions:
       http://csr.dev.rsc-us.org/
Experimental data
We’ve already seen the possibilities for
extracting data from organic experimental
sections, but what about other sorts of data?

Given chemical structures and extracted data
we may be able to start building models and
making them available.
New directions in experimental
             data (1)
We are working with William Brouwer (Penn
State) to extract data from graphs.

Obviously this is faute de mieux and we’d rather
have the original data, but we’re giving a flavour
of what might be possible.
Recent Work
Digitized Spectrum
Comparison of Spectra
And now on ChemSpider…
Digitally enabling the RSC archive
New directions in experimental
             data (2)
Dye solar cell data is every bit as systematic as
organic experimental sections.
Human curation of results
Previously: built into partly-manual annotation
workflow.

Currently: macro-scale, iterative.

Coming: Challenger
DERA
• DERA will unveil from our archive
  – Chemicals
  – Reactions
  – Figures
  – Spectra/Analytical Data
  – Property Data

  – And yes….it will need curation and filtering!

More Related Content

PDF
Open-source tools for generating and analyzing large materials data sets
PDF
Assessing Galaxy's ability to express scientific workflows in bioinformatics
PDF
Materials design using knowledge from millions of journal articles via natura...
PDF
Discovering advanced materials for energy applications by mining the scientif...
PDF
Scientific Workflow Systems for accessible, reproducible research
PDF
DuraMat Data Management and Analytics
PDF
Natural Language Processing for Materials Design - What Can We Extract From t...
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Open-source tools for generating and analyzing large materials data sets
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Materials design using knowledge from millions of journal articles via natura...
Discovering advanced materials for energy applications by mining the scientif...
Scientific Workflow Systems for accessible, reproducible research
DuraMat Data Management and Analytics
Natural Language Processing for Materials Design - What Can We Extract From t...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...

What's hot (20)

PDF
Software tools to facilitate materials science research
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
PDF
Accelerating materials design through natural language processing
PDF
ICME Workshop Jul 2014 - The Materials Project
PPT
Bio solr building a better search for bioinformatics
PPTX
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
PDF
Open science 2014
PPT
Computation and Knowledge
PPT
Peer Review and Science2.0
PDF
Open Source Tools for Materials Informatics
PDF
Sharing massive data analysis: from provenance to linked experiment reports
PPT
Dealing with the complex challenge of managing diverse analytical chemistry d...
PPT
PDF
Is 20TB really Big Data?
PDF
Applications of Natural Language Processing to Materials Design
PDF
Reaxys structure searching
PDF
Machine learning for materials design: opportunities, challenges, and methods
PPTX
Websci17 final
PPTX
Continuous modeling - automating model building on high-performance e-Infrast...
PDF
SWAT4LS 2014 SLIDE by Yamamoto
Software tools to facilitate materials science research
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerating materials design through natural language processing
ICME Workshop Jul 2014 - The Materials Project
Bio solr building a better search for bioinformatics
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
Open science 2014
Computation and Knowledge
Peer Review and Science2.0
Open Source Tools for Materials Informatics
Sharing massive data analysis: from provenance to linked experiment reports
Dealing with the complex challenge of managing diverse analytical chemistry d...
Is 20TB really Big Data?
Applications of Natural Language Processing to Materials Design
Reaxys structure searching
Machine learning for materials design: opportunities, challenges, and methods
Websci17 final
Continuous modeling - automating model building on high-performance e-Infrast...
SWAT4LS 2014 SLIDE by Yamamoto
Ad

Viewers also liked (8)

PPTX
Nuevos soportes 6c y d
PPT
A product-focused introduction to Machine Learning
PPSX
Sifət (1) powerpoint
PDF
10 Insightful Quotes On Designing A Better Customer Experience
PDF
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
PDF
Learn BEM: CSS Naming Convention
PDF
SEO: Getting Personal
PDF
The Outcome Economy
Nuevos soportes 6c y d
A product-focused introduction to Machine Learning
Sifət (1) powerpoint
10 Insightful Quotes On Designing A Better Customer Experience
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Learn BEM: CSS Naming Convention
SEO: Getting Personal
The Outcome Economy
Ad

Similar to Digitally enabling the RSC archive (20)

PPT
Digitally enabling the RSC archive
PPTX
Learning Systems for Science
PDF
The Materials Project: A Community Data Resource for Accelerating New Materia...
PDF
Materials discovery through theory, computation, and machine learning
PDF
Discovering new functional materials for clean energy and beyond using high-t...
PDF
Predicting Molecular Properties
PDF
Discovering advanced materials for energy applications (with high-throughput ...
PPT
ChemSpider reactions – delivering a free community resource of chemical synth...
PPT
OpenSciNY Open Notebook Science
PDF
The Materials Project: Applications to energy storage and functional materia...
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
PPT
The eCrystals Federation
PDF
Materials Project computation and database infrastructure
PDF
Mining Big datasets to create and validate machine learning models
PPTX
Serving the medicinal chemistry community with Royal Society of Chemistry che...
PPTX
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
PPTX
Acs denver dirks potenzone 30 aug2011
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
PDF
Software tools for calculating materials properties in high-throughput (pymat...
Digitally enabling the RSC archive
Learning Systems for Science
The Materials Project: A Community Data Resource for Accelerating New Materia...
Materials discovery through theory, computation, and machine learning
Discovering new functional materials for clean energy and beyond using high-t...
Predicting Molecular Properties
Discovering advanced materials for energy applications (with high-throughput ...
ChemSpider reactions – delivering a free community resource of chemical synth...
OpenSciNY Open Notebook Science
The Materials Project: Applications to energy storage and functional materia...
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
The eCrystals Federation
Materials Project computation and database infrastructure
Mining Big datasets to create and validate machine learning models
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Acs denver dirks potenzone 30 aug2011
Combining density functional theory calculations, supercomputing, and data-dr...
Software tools for calculating materials properties in high-throughput (pymat...

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
August Patch Tuesday
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mushroom cultivation and it's methods.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Hybrid model detection and classification of lung cancer
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
A Presentation on Touch Screen Technology
PDF
Encapsulation_ Review paper, used for researhc scholars
Accuracy of neural networks in brain wave diagnosis of schizophrenia
NewMind AI Weekly Chronicles - August'25-Week II
Agricultural_Statistics_at_a_Glance_2022_0.pdf
August Patch Tuesday
Univ-Connecticut-ChatGPT-Presentaion.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Mushroom cultivation and it's methods.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Web App vs Mobile App What Should You Build First.pdf
cloud_computing_Infrastucture_as_cloud_p
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Hybrid model detection and classification of lung cancer
Tartificialntelligence_presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
TLE Review Electricity (Electricity).pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
1. Introduction to Computer Programming.pptx
A Presentation on Touch Screen Technology
Encapsulation_ Review paper, used for researhc scholars

Digitally enabling the RSC archive

  • 1. Data Enhancing the RSC Archive Colin Batchelor, Ken Karapetyan, Alexey Pshenichov, Dave Sharpe, Jon Steele, Valery Tkachenko and Antony Williams ACS New Orleans April 2013
  • 2. Overview • The big picture • Where we’ve been • Statistics as well as semantics • New directions in experimental data • Where we’re going
  • 3. The big picture We have journal articles going back to 1841 and the aim is to extract: •Every small molecule we can (graphics and text) •Reactions •Spectra •Data in tables and classify every paper in a way that makes sense to the reader.
  • 4. Background • RSC Publishing moved to an all-XML workflow at the turn of the millennium. • We digitized the backfile (to 1841) in 2005. • We launched Project Prospect in 2007. • We acquired ChemSpider in 2009.
  • 5. RSC Advances New high-volume journal covering all of chemistry launched in 2011. Need a sensible way of navigating all this. http://www.rsc.org/advances http://www.rsc.org/RSCAdvancesSubjects
  • 6. Strategy • Use topic modelling: latent Dirichlet allocation (LDA) and Gibbs sampling to determine a set of “true” topics Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235. • Publishing expertise gives us 12 broad subjects that will be intuitive to users • Merge first set to form second • Tweak
  • 7. Classify that classification Generated 128 topics based on 2009 and 2010’s articles (> 20000 papers). Generated Wordle images (www.wordle.net) of the topics for internal staff.
  • 9. Classify that classification: results 7 topics (75, 57, 65, 67, 82, 113, 123) were rejected for being nonsense. 1 topic (127) was rejected for being too general. 120 topics were classified under the 12 headings and given names. Examples…
  • 10. Examples 1: “kinetics” → Physical 2: “coordination complexes” → Inorganic 3: “general materials” → Materials 4: “misc. organic” → Organic 5: “bacteria” → Biological + Food and health 6: “theoretical” → Physical 7: “cells” → Bio 8: “water and solution chemistry” → Physical 9: “gels” → Materials 10: “inorganic material properties” → Physical + Inorganic + Materials 11: “general organic” → Organic 12: “coordination chemistry” → Inorganic 13: “photochemistry” → Inorganic + Materials + Energy
  • 11. “Very useful!” “Superb!” “… will make it easier for readers to identify papers which might be interesting to them.”
  • 12. What now? Shortly rolling out the subject classification to other general journals: •Chemical Communications •Chemical Science •Journal of Materials Chemistry A, B and C •New Journal of Chemistry
  • 13. Beyond Prospect: further steps in text-mining Migration to Oscar 4 https://bitbucket.org/wwmm/oscar4/wiki/Home Multiple name to structure engines OPSIN, ACD/Labs, Lexichem ACD/Labs Dictionary Better disambiguation Parallelization with Hadoop Structure validation and standardization (see later) Reaction extraction from text (see later)
  • 14. On an experimental run with names from Organic and Biomolecular Chemistry Is any structure returned at all by a given n2s engine? Lexichem = a (2798) ACD = b (3049) OPSIN = c (3309)
  • 15. Structure disagreements Out of 2588 names where at least one of the engines differed or didn’t return a result: A = ACD (1538 in total) B = Lexichem (1301 in total) C = OPSIN (2097 in total)
  • 16. Iterations With the Hadoop cluster, we can mine thousands of articles a night. We’re initially iterating over the material back to 2000, for which we have native XML. Then it’s a case of going back and testing out the OCRed material.
  • 17. http://cv.beta.rsc-us.org/ This is the beta site for •Extracting chemical structures from ChemDraw files •Most importantly: structure validation and standardization We will be using this for all of the extracted structures.
  • 20. Reaction extraction from text We have had some preliminary experience of this with Daniel Lowe (NextMove, formerly Cambridge)’s ChemicalTagger work. To go to ChemSpider Reactions: http://csr.dev.rsc-us.org/
  • 21. Experimental data We’ve already seen the possibilities for extracting data from organic experimental sections, but what about other sorts of data? Given chemical structures and extracted data we may be able to start building models and making them available.
  • 22. New directions in experimental data (1) We are working with William Brouwer (Penn State) to extract data from graphs. Obviously this is faute de mieux and we’d rather have the original data, but we’re giving a flavour of what might be possible.
  • 26. And now on ChemSpider…
  • 28. New directions in experimental data (2) Dye solar cell data is every bit as systematic as organic experimental sections.
  • 29. Human curation of results Previously: built into partly-manual annotation workflow. Currently: macro-scale, iterative. Coming: Challenger
  • 30. DERA • DERA will unveil from our archive – Chemicals – Reactions – Figures – Spectra/Analytical Data – Property Data – And yes….it will need curation and filtering!