SlideShare a Scribd company logo
ChemExtractor:
Enhanced Rule-Based Capture and
Identification of PDF Based Property Data
Stuart J. Chalk, Department of Chemistry
University of North Florida
schalk@unf.edu
253rd ACS Meeting April 2017
Outline
 Motivation
 Research Approach
 Analyzing Tabular Data
 Types of Data
 Regular Expressions (Regex)
 Rules and Rulesets
 Examples of Data Extraction
 Contextualizing Data
 Data Storage (MySQL)
 Data Representation (SciData)
 Conclusion
Funding for this project provided by
Motivation
 The Landholt-Börnstein Database is +450 volumes of
curated chemical property data (18?? to date)
 With the move to data-driven science it is imperative we
leverage the time invested in, and scientific quality of, the
curation of this data
 This high value data is locked in PDF files 
 This data can be made more useful if it is extracted with
its metadata (chemical system, original reference,
physical property with unit, etc.)
 Optical Character Recognition (OCR) is a standard
process to extract text from scanned images
 Accurate extraction of text using OCR in a page layout
format not only captures the text but also the inferred
relationships between tabular data
 Utilize regular expression (regex) analysis of tabular
text to capture data and its position relative to other
data
 Contextualize captured data with metadata encoded
based on layout of table and string format
Research Approach
Analyzing Tabular Data
Chemical
Metadata
Series of
Property Data
Condition
Series Condition
Property Data
Reference
 Properties and Units
 Conditions, data, supplemental data
 Equation coefficients and variables
 Chemical properties (MW, BP, MP)
 Annotations
 Table headers, column headers, data notes
 Chemical Metadata
 Formula, name, CASRN
 Context Metadata
 Table #, refcodes, property headers, component #
Types of Data
 Implemented in every programming language
 Relatively uniform implementation
 Uses syntax to create regex string that matches
and/or captures characters in string
 Groups of characters – ., [A-Z], [a-z], [0-9]
 Character classes w, d, s, h
 Repetition - ? (optional), + (1 or more), * ( or more)
 Capture group – ()
Regular Expressions (Regex)
http://regex101.com
Regex
Example 1
Regex
Example 2
http://regex101.com
http://regex101.com
Regex
Example 3
 We write rules for specific line formats constructed using
 Rule templates – define basic structure of a line, i.e.
how many blocks of text that need to be captured
 Rule snippets – define small regex strings that capture a particular
format of text
 Rulesets are list of rules with associated actions indicating
sequentaily process text line by line
Rules and Rulesets
^@B1@h+@B2@h*@B3@h*@B4@h*@B5@h*@B6@$
(?:[A-Z][a-z]{0,2}d{0,3}h?)+(?:.?d*.?d*H2O)
Rules and Rulesets
Rules and Rulesets
Rules and Rulesets
Rules and Rulesets
Examples of Data Extraction
Examples of Data Extraction
Examples of Data Extraction
Examples of Data Extraction
 This data…
 … is detected by rule ‘Temperature (K) & refcode’
Contextualizing the Data
T/K 298.15 () () () () () () () () 81V1 ()
Data Storage (MySQL)
Data Storage (MySQL)
Data Presentation
Data Presentation
Data Representation (SciData)
 Higher accuracy capture of property data and equation data
(350,000 property data points, 10,000 equations)
 Integrated chemical and reference metadata
 Can be applied to other PDF-based curated datasets
 Open website for community use this summer
 Implementation for research article capture of chemical
property data
Conclusion
 schalk@unf.edu
 Phone: 904-620-1938
 Skype: stuartchalk
 LinkedIn: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
Questions?

More Related Content

PPTX
11. Hashing - Data Structures using C++ by Varsha Patil
PDF
Coling2014:Single Document Keyphrase Extraction Using Label Information
PPTX
3. Stack - Data Structures using C++ by Varsha Patil
PPTX
Fostering Serendipity through Big Linked Data
PPTX
14. Files - Data Structures using C++ by Varsha Patil
PPTX
5. Queue - Data Structures using C++ by Varsha Patil
PDF
Efficient top k retrieval on massive data
PPTX
6. Linked list - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil
Coling2014:Single Document Keyphrase Extraction Using Label Information
3. Stack - Data Structures using C++ by Varsha Patil
Fostering Serendipity through Big Linked Data
14. Files - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil
Efficient top k retrieval on massive data
6. Linked list - Data Structures using C++ by Varsha Patil

What's hot (20)

PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
PPTX
A Standard Data Format for Computational Chemistry: CSX
PPTX
Effective and Efficient Entity Search in RDF data
ODP
2009 0807 Lod Gmod
PDF
PPTX
ModelDR - the tool that untangles complex information
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PPTX
chemengine karthi acs sandiego rev1.0
PDF
Rethinking data intensive science using scalable analytics systems
PDF
Hybrid geo textual index structure
PDF
Lecture 07 Data Structures - Basic Sorting
PPTX
Latest trends in AI and information Retrieval
PDF
International Journal of Engineering Research and Development
PPTX
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
PPTX
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
PPTX
Limits of RDBMS and Need for NoSQL in Bioinformatics
PPT
Integrating scientific laboratories into the cloud
PPT
Binary search in ds
PPT
Design and creation of ontologies for environmental information retrieval
PDF
M phil-computer-science-machine-language-and-pattern-analysis-projects
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
A Standard Data Format for Computational Chemistry: CSX
Effective and Efficient Entity Search in RDF data
2009 0807 Lod Gmod
ModelDR - the tool that untangles complex information
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
chemengine karthi acs sandiego rev1.0
Rethinking data intensive science using scalable analytics systems
Hybrid geo textual index structure
Lecture 07 Data Structures - Basic Sorting
Latest trends in AI and information Retrieval
International Journal of Engineering Research and Development
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
Limits of RDBMS and Need for NoSQL in Bioinformatics
Integrating scientific laboratories into the cloud
Binary search in ds
Design and creation of ontologies for environmental information retrieval
M phil-computer-science-machine-language-and-pattern-analysis-projects
Ad

Similar to ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data (13)

PDF
Using Regular Expressions in Document Management Data Capture and Indexing
PPTX
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
PPTX
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PPT
Royal society of chemistry activities to develop a data repository for chemis...
ODP
OISF: Regular Expressions (Regex) Overview
PPT
Chemxseer qr-sagnik
PDF
beyond-regular-regular-expressions-v20.pdf
ODP
DerbyCon 7.0 Legacy: Regular Expressions (Regex) Overview
PPTX
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
PPTX
Regular expressions
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Using Regular Expressions in Document Management Data Capture and Indexing
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
OISF: Regular Expressions (Regex) Overview
Chemxseer qr-sagnik
beyond-regular-regular-expressions-v20.pdf
DerbyCon 7.0 Legacy: Regular Expressions (Regex) Overview
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Regular expressions
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Ad

More from Stuart Chalk (20)

PPTX
Semantic properties and units
PPTX
Open semantic chemical structures
PPTX
AnIML: A New Analytical Data Standard
PPTX
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
PPTX
Scientific Units in the Electronic Age
PPTX
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
PPTX
The Electronic Notebook Ontology
PPTX
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
PPTX
Bringing Flow injection Analysis to the Semantic Web
PPTX
Reactions to the Open Spectral Database
PPTX
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
PPTX
Building a Standard for Standards: The ChAMP Project
PPTX
Overview of the Analytical Information Markup Language (AnIML)
PPTX
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
PPTX
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
PPTX
ACS 248th Paper 108 NIST-IUPAC Solubility Data
PPTX
ACS 248th Paper 104 ChemData Project
PPTX
ACS 248th Paper 71 ChAMP Project
PPTX
ACS 248th Paper 67 Eureka Collaboration
PPTX
247th ACS Meeting: The Eureka Research Workbench
Semantic properties and units
Open semantic chemical structures
AnIML: A New Analytical Data Standard
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
Scientific Units in the Electronic Age
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
The Electronic Notebook Ontology
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Bringing Flow injection Analysis to the Semantic Web
Reactions to the Open Spectral Database
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Building a Standard for Standards: The ChAMP Project
Overview of the Analytical Information Markup Language (AnIML)
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 104 ChemData Project
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 67 Eureka Collaboration
247th ACS Meeting: The Eureka Research Workbench

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
famous lake in india and its disturibution and importance
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
2. Earth - The Living Planet earth and life
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
ECG_Course_Presentation د.محمد صقران ppt
INTRODUCTION TO EVS | Concept of sustainability
The KM-GBF monitoring framework – status & key messages.pptx
An interstellar mission to test astrophysical black holes
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
neck nodes and dissection types and lymph nodes levels
famous lake in india and its disturibution and importance
Derivatives of integument scales, beaks, horns,.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Classification Systems_TAXONOMY_SCIENCE8.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
2. Earth - The Living Planet Module 2ELS
2. Earth - The Living Planet earth and life
The scientific heritage No 166 (166) (2025)
Cell Membrane: Structure, Composition & Functions
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
POSITIONING IN OPERATION THEATRE ROOM.ppt
Biophysics 2.pdffffffffffffffffffffffffff

ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data

  • 1. ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data Stuart J. Chalk, Department of Chemistry University of North Florida schalk@unf.edu 253rd ACS Meeting April 2017
  • 2. Outline  Motivation  Research Approach  Analyzing Tabular Data  Types of Data  Regular Expressions (Regex)  Rules and Rulesets  Examples of Data Extraction  Contextualizing Data  Data Storage (MySQL)  Data Representation (SciData)  Conclusion Funding for this project provided by
  • 3. Motivation  The Landholt-Börnstein Database is +450 volumes of curated chemical property data (18?? to date)  With the move to data-driven science it is imperative we leverage the time invested in, and scientific quality of, the curation of this data  This high value data is locked in PDF files   This data can be made more useful if it is extracted with its metadata (chemical system, original reference, physical property with unit, etc.)
  • 4.  Optical Character Recognition (OCR) is a standard process to extract text from scanned images  Accurate extraction of text using OCR in a page layout format not only captures the text but also the inferred relationships between tabular data  Utilize regular expression (regex) analysis of tabular text to capture data and its position relative to other data  Contextualize captured data with metadata encoded based on layout of table and string format Research Approach
  • 5. Analyzing Tabular Data Chemical Metadata Series of Property Data Condition Series Condition Property Data Reference
  • 6.  Properties and Units  Conditions, data, supplemental data  Equation coefficients and variables  Chemical properties (MW, BP, MP)  Annotations  Table headers, column headers, data notes  Chemical Metadata  Formula, name, CASRN  Context Metadata  Table #, refcodes, property headers, component # Types of Data
  • 7.  Implemented in every programming language  Relatively uniform implementation  Uses syntax to create regex string that matches and/or captures characters in string  Groups of characters – ., [A-Z], [a-z], [0-9]  Character classes w, d, s, h  Repetition - ? (optional), + (1 or more), * ( or more)  Capture group – () Regular Expressions (Regex)
  • 11.  We write rules for specific line formats constructed using  Rule templates – define basic structure of a line, i.e. how many blocks of text that need to be captured  Rule snippets – define small regex strings that capture a particular format of text  Rulesets are list of rules with associated actions indicating sequentaily process text line by line Rules and Rulesets ^@B1@h+@B2@h*@B3@h*@B4@h*@B5@h*@B6@$ (?:[A-Z][a-z]{0,2}d{0,3}h?)+(?:.?d*.?d*H2O)
  • 16. Examples of Data Extraction
  • 17. Examples of Data Extraction
  • 18. Examples of Data Extraction
  • 19. Examples of Data Extraction
  • 20.  This data…  … is detected by rule ‘Temperature (K) & refcode’ Contextualizing the Data T/K 298.15 () () () () () () () () 81V1 ()
  • 26.  Higher accuracy capture of property data and equation data (350,000 property data points, 10,000 equations)  Integrated chemical and reference metadata  Can be applied to other PDF-based curated datasets  Open website for community use this summer  Implementation for research article capture of chemical property data Conclusion
  • 27.  schalk@unf.edu  Phone: 904-620-1938  Skype: stuartchalk  LinkedIn: https://www.linkedin.com/in/stuchalk  ORCID: http://orcid.org/0000-0002-0703-7776 Questions?