SlideShare a Scribd company logo
Using publicly available resources
to build a comprehensive knowledgebase of chemical information
by B. Sattarov, R. Zakharov and V.Tkachenko
Science Data Software
Abstract
There is a variety of public resources on the Internet which contain information about
various aspects of chemical, biological and pharmaceutical domains. The quality,
maturity, hosting organizations, team sizes behind these data resources vary wildly and
as a consequence content cannot be always trusted and the effort of extracting
information and preparing it for reuse is repeated again and again at various levels.
This problem is especially serious in applications for QSAR, QSPR and QNAR modeling.
On the other hand authors of this poster believe, based on their own extensive
experience building various types of chemical, analytical and biological databases for
decades, that the process of building such knowledgebase can be systematically
described and automated tool for building a comprehensive knowledgebase of
chemical information.
We have developed data mining workflow to collect and standardize chemical data
from open sources, using several simple python scripts which will be included in open
source library. Data collection was carried out by HTML parsing and by using
ChemSpider API. We also used python version of Chemical Validation and
Standardization Platform developed by us to standardize collected data.
Our ChemScrapper allowed us to resolve 19.85% names of
biologically active compounds from MESH 2017 dataset and to save
this data into json and handy sdf format.
Chemical Validation and Standardization Platform
(CVSP),
which we used to standardize chemical structures, can also be used as
stand-alone platform for SMIRKS-based standardization of any dataset,
thanks to the visual implementation of its python version functionality on
Jupyter.
You can see every standardization
rule applied as SMIRKS string simply
by clicking on SMIRKS button as
well as download standardized
dataset as *.sdf file by checking
corresponding folder.
Example json—output with mol block, properties and synonyms
Example CLI
Example input
One of the most productive data mining tools we have created works with ChemSpider web
API. It allows user, looking for a chemical structures/data, to deal only with convenient com-
mand line interface written in Python, in order to resolve chemicals names and identifiers or
to find new data for QSAR/QSPR analysis or any other purpose that requires .
API
HTML Parsing
CVSP
Standardization
Data collection
Comprehensive
knowledgebase of
chemical information
Open Science Data Repository (OSDR)
Comprehensive distributed semantic knowledgebase of scientific information
with built-in Machine Learning capabilities

More Related Content

PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Drug Repurposing using Deep Learning on Knowledge Graphs
PDF
Data mining weka
PDF
Poster (1)
PDF
Sciunits: Reusable Research Objects
PPTX
Analytics of analytics pipelines: from optimising re-execution to general Dat...
PPTX
ReComp: optimising the re-execution of analytics pipelines in response to cha...
PDF
OpenTox Europe 2013
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Drug Repurposing using Deep Learning on Knowledge Graphs
Data mining weka
Poster (1)
Sciunits: Reusable Research Objects
Analytics of analytics pipelines: from optimising re-execution to general Dat...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
OpenTox Europe 2013

What's hot (20)

PDF
Accelerating GWAS epistatic interaction analysis methods
PPTX
Open chemistry registry and mapping platform based on open source cheminforma...
PPT
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
PDF
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
PPTX
PNNL April 2011 ogce
PDF
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
PPTX
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
PPTX
2016 davis-plantbio
PPTX
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
PDF
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
PPTX
An examination of data quality on QSAR Modeling in regards to the environment...
PDF
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
PDF
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
PPTX
ReComp, the complete story: an invited talk at Cardiff University
PDF
TMS workshop on machine learning in materials science: Intro to deep learning...
PDF
Rethinking data intensive science using scalable analytics systems
PDF
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
Accelerating GWAS epistatic interaction analysis methods
Open chemistry registry and mapping platform based on open source cheminforma...
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
PNNL April 2011 ogce
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
2016 davis-plantbio
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
An examination of data quality on QSAR Modeling in regards to the environment...
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
ReComp, the complete story: an invited talk at Cardiff University
TMS workshop on machine learning in materials science: Intro to deep learning...
Rethinking data intensive science using scalable analytics systems
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
Ad

Similar to Using publicly available resources to build a comprehensive knowledgebase of chemical information (20)

PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PDF
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
PPTX
ChemValidator – an online service for validating and standardizing chemical s...
PPT
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
PPT
Big data challenges associated with building a national data repository for c...
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
PPT
Importance of data standards for large scale data integration in chemistry
PPT
PPT
Hosting public domain chemicals data online for the community – the challenge...
PPT
Feeding and consuming data to support open notebook science via the chem spid...
PPT
eScience at the Royal Society of Chemistry and our current initiatives
PPTX
Evolution of public chemistry databases: past and the future
PPT
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
PPT
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
PPT
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
PPT
The importance of standards for data exchange and interchange on the Royal So...
PPTX
Overview of open resources to support automated structure verification and e...
PPT
Our dire need to mandate data standards and expectations for scientific publi...
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ChemValidator – an online service for validating and standardizing chemical s...
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Big data challenges associated with building a national data repository for c...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
Importance of data standards for large scale data integration in chemistry
Hosting public domain chemicals data online for the community – the challenge...
Feeding and consuming data to support open notebook science via the chem spid...
eScience at the Royal Society of Chemistry and our current initiatives
Evolution of public chemistry databases: past and the future
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
The importance of standards for data exchange and interchange on the Royal So...
Overview of open resources to support automated structure verification and e...
Our dire need to mandate data standards and expectations for scientific publi...
Ad

More from Valery Tkachenko (20)

PPTX
In silico design of new functional materials
PPTX
Metal-organic frameworks: from database to supramolecular effects in complexa...
PPTX
Abstract recommendation system: beyond word-level representations
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
PPTX
Chemical workflows supporting automated research data collection
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PPTX
Need and benefits for structure standardization to facilitate integration and...
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PPTX
Living in a world of federated knowledge challenges, principles, tools and ...
PPTX
Using the structured product labeling format to index versatile chemical data
PPTX
Tools and approaches for data deposition into nanomaterial databases
PPTX
Chemistry Validation and Standardization Platform v2.0
PPTX
Open Science Data Repository - the platform for materials research
PPTX
Opportunities in chemical structure standardization
PPT
OpenPHACTS - Chemistry Platform Update and Learnings
PPTX
Evolution of open chemical information
PPTX
OMPOL – visualisation of large chemical spaces
PPTX
Not just another reaction database
PPTX
Implementing chemistry platform for OpenPHACTS
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
In silico design of new functional materials
Metal-organic frameworks: from database to supramolecular effects in complexa...
Abstract recommendation system: beyond word-level representations
Machine learning methods for chemical properties and toxicity based endpoints
Chemical workflows supporting automated research data collection
Deep learning methods applied to physicochemical and toxicological endpoints
Need and benefits for structure standardization to facilitate integration and...
Development and comparison of deep learning toolkit with other machine learni...
Living in a world of federated knowledge challenges, principles, tools and ...
Using the structured product labeling format to index versatile chemical data
Tools and approaches for data deposition into nanomaterial databases
Chemistry Validation and Standardization Platform v2.0
Open Science Data Repository - the platform for materials research
Opportunities in chemical structure standardization
OpenPHACTS - Chemistry Platform Update and Learnings
Evolution of open chemical information
OMPOL – visualisation of large chemical spaces
Not just another reaction database
Implementing chemistry platform for OpenPHACTS
Building linked data large-scale chemistry platform - challenges, lessons and...

Recently uploaded (20)

PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
An interstellar mission to test astrophysical black holes
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
BIOMOLECULES PPT........................
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
The KM-GBF monitoring framework – status & key messages.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
INTRODUCTION TO EVS | Concept of sustainability
2. Earth - The Living Planet Module 2ELS
7. General Toxicologyfor clinical phrmacy.pptx
Cell Membrane: Structure, Composition & Functions
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
AlphaEarth Foundations and the Satellite Embedding dataset
neck nodes and dissection types and lymph nodes levels
An interstellar mission to test astrophysical black holes
Viruses (History, structure and composition, classification, Bacteriophage Re...
BIOMOLECULES PPT........................
Classification Systems_TAXONOMY_SCIENCE8.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
cpcsea ppt.pptxssssssssssssssjjdjdndndddd

Using publicly available resources to build a comprehensive knowledgebase of chemical information

  • 1. Using publicly available resources to build a comprehensive knowledgebase of chemical information by B. Sattarov, R. Zakharov and V.Tkachenko Science Data Software Abstract There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated tool for building a comprehensive knowledgebase of chemical information. We have developed data mining workflow to collect and standardize chemical data from open sources, using several simple python scripts which will be included in open source library. Data collection was carried out by HTML parsing and by using ChemSpider API. We also used python version of Chemical Validation and Standardization Platform developed by us to standardize collected data. Our ChemScrapper allowed us to resolve 19.85% names of biologically active compounds from MESH 2017 dataset and to save this data into json and handy sdf format. Chemical Validation and Standardization Platform (CVSP), which we used to standardize chemical structures, can also be used as stand-alone platform for SMIRKS-based standardization of any dataset, thanks to the visual implementation of its python version functionality on Jupyter. You can see every standardization rule applied as SMIRKS string simply by clicking on SMIRKS button as well as download standardized dataset as *.sdf file by checking corresponding folder. Example json—output with mol block, properties and synonyms Example CLI Example input One of the most productive data mining tools we have created works with ChemSpider web API. It allows user, looking for a chemical structures/data, to deal only with convenient com- mand line interface written in Python, in order to resolve chemicals names and identifiers or to find new data for QSAR/QSPR analysis or any other purpose that requires . API HTML Parsing CVSP Standardization Data collection Comprehensive knowledgebase of chemical information Open Science Data Repository (OSDR) Comprehensive distributed semantic knowledgebase of scientific information with built-in Machine Learning capabilities