Using publicly available resources to build a comprehensive knowledgebase of chemical information

1 like201 views

The document discusses the development of a comprehensive knowledgebase for chemical information using publicly available resources while addressing the variability in data quality and reliability. It outlines a systematic and automated data mining workflow utilizing Python scripts, HTML parsing, and the ChemSpider API to standardize and validate chemical data. The authors have created tools for data collection, such as a command line interface for resolving chemical names and identifiers, as well as a chemical validation and standardization platform.

Science

Using publicly available resources
to build a comprehensive knowledgebase of chemical information
by B. Sattarov, R. Zakharov and V.Tkachenko
Science Data Software
Abstract
There is a variety of public resources on the Internet which contain information about
various aspects of chemical, biological and pharmaceutical domains. The quality,
maturity, hosting organizations, team sizes behind these data resources vary wildly and
as a consequence content cannot be always trusted and the effort of extracting
information and preparing it for reuse is repeated again and again at various levels.
This problem is especially serious in applications for QSAR, QSPR and QNAR modeling.
On the other hand authors of this poster believe, based on their own extensive
experience building various types of chemical, analytical and biological databases for
decades, that the process of building such knowledgebase can be systematically
described and automated tool for building a comprehensive knowledgebase of
chemical information.
We have developed data mining workflow to collect and standardize chemical data
from open sources, using several simple python scripts which will be included in open
source library. Data collection was carried out by HTML parsing and by using
ChemSpider API. We also used python version of Chemical Validation and
Standardization Platform developed by us to standardize collected data.
Our ChemScrapper allowed us to resolve 19.85% names of
biologically active compounds from MESH 2017 dataset and to save
this data into json and handy sdf format.
Chemical Validation and Standardization Platform
(CVSP),
which we used to standardize chemical structures, can also be used as
stand-alone platform for SMIRKS-based standardization of any dataset,
thanks to the visual implementation of its python version functionality on
Jupyter.
You can see every standardization
rule applied as SMIRKS string simply
by clicking on SMIRKS button as
well as download standardized
dataset as *.sdf file by checking
corresponding folder.
Example json—output with mol block, properties and synonyms
Example CLI
Example input
One of the most productive data mining tools we have created works with ChemSpider web
API. It allows user, looking for a chemical structures/data, to deal only with convenient com-
mand line interface written in Python, in order to resolve chemicals names and identifiers or
to find new data for QSAR/QSPR analysis or any other purpose that requires .
API
HTML Parsing
CVSP
Standardization
Data collection
Comprehensive
knowledgebase of
chemical information
Open Science Data Repository (OSDR)
Comprehensive distributed semantic knowledgebase of scientific information
with built-in Machine Learning capabilities

More Related Content

PDF

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions

Valery Tkachenko

PDF

Drug Repurposing using Deep Learning on Knowledge Graphs

Databricks

PDF

Data mining weka

prashant 100702007

PDF

Poster (1)

Daniel Osei

PDF

Sciunits: Reusable Research Objects

Globus

PPTX

Analytics of analytics pipelines:from optimising re-execution to general Dat...

Paolo Missier

PPTX

ReComp: optimising the re-execution of analytics pipelines in response to cha...

Paolo Missier

PDF

OpenTox Europe 2013

Alejandra Gonzalez-Beltran

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions

Valery Tkachenko

Drug Repurposing using Deep Learning on Knowledge Graphs

Databricks

Data mining weka

prashant 100702007

Poster (1)

Daniel Osei

Sciunits: Reusable Research Objects

Globus

Analytics of analytics pipelines:from optimising re-execution to general Dat...

Paolo Missier

ReComp: optimising the re-execution of analytics pipelines in response to cha...

Paolo Missier

OpenTox Europe 2013

Alejandra Gonzalez-Beltran

What's hot (20)

PDF

Accelerating GWAS epistatic interaction analysis methods

Priscill Orue Esquivel

PPTX

Open chemistry registry and mapping platform based on open source cheminforma...

Valery Tkachenko

PPT

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Flávio Codeço Coelho

PDF

Evaluating Machine Learning Algorithms for Materials Science using the Matben...

Anubhav Jain

PPTX

PNNL April 2011 ogce

marpierc

PDF

Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...

Syed Ahmad Chan Bukhari, PhD

PDF

CSHALS 2013

Alejandra Gonzalez-Beltran

PPTX

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Paolo Missier

PPTX

2016 davis-plantbio

c.titus.brown

PPTX

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Paolo Missier

PDF

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...

Ahmad C. Bukhari

PDF

BioNLPSADI

Syed Ahmad Chan Bukhari, PhD

PPTX

An examination of data quality on QSAR Modeling in regards to the environment...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PDF

NETTAB 2013

Alejandra Gonzalez-Beltran

PDF

The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data

Anubhav Jain

PDF

Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...

Dominic Suciu

PPTX

ReComp, the complete story: an invited talk at Cardiff University

Paolo Missier

PDF

TMS workshop on machine learning in materials science: Intro to deep learning...

BrianDeCost

PDF

Rethinking data intensive science using scalable analytics systems

newmooxx

PDF

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...

Syed Ahmad Chan Bukhari, PhD

Accelerating GWAS epistatic interaction analysis methods

Priscill Orue Esquivel

Open chemistry registry and mapping platform based on open source cheminforma...

Valery Tkachenko

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Flávio Codeço Coelho

Evaluating Machine Learning Algorithms for Materials Science using the Matben...

Anubhav Jain

PNNL April 2011 ogce

marpierc

Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...

Syed Ahmad Chan Bukhari, PhD

CSHALS 2013

Alejandra Gonzalez-Beltran

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Paolo Missier

2016 davis-plantbio

c.titus.brown

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Paolo Missier

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...

Ahmad C. Bukhari

BioNLPSADI

Syed Ahmad Chan Bukhari, PhD

An examination of data quality on QSAR Modeling in regards to the environment...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

NETTAB 2013

Alejandra Gonzalez-Beltran

The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data

Anubhav Jain

Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...

Dominic Suciu

ReComp, the complete story: an invited talk at Cardiff University

Paolo Missier

TMS workshop on machine learning in materials science: Intro to deep learning...

BrianDeCost

Rethinking data intensive science using scalable analytics systems

newmooxx

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...

Syed Ahmad Chan Bukhari, PhD

Similar to Using publicly available resources to build a comprehensive knowledgebase of chemical information (20)

PPT

Open innovation contributions from RSC resulting from the Open Phacts project

Ken Karapetyan

PPT

Open innovation contributions from RSC resulting from the Open Phacts project

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PDF

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Dr. Haxel Consult

PPTX

ChemValidator – an online service for validating and standardizing chemical s...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

ChemSpider – disseminating data and enabling an abundance of chemistry platforms

Ken Karapetyan

PPT

Big data challenges associated with building a national data repository for c...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

eScience Resources for the Chemistry Community from the Royal Society of Chem...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Importance of data standards for large scale data integration in chemistry

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Connecting Chemistry Across the Internet Using ChemSpider

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Hosting public domain chemicals data online for the community – the challenge...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Feeding and consuming data to support open notebook science via the chem spid...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

eScience at the Royal Society of Chemistry and our current initiatives

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPTX

Evolution of public chemistry databases: past and the future

Valery Tkachenko

PPT

ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Activities at the Royal Society of Chemistry to gather, extract and analyze b...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Hosting Public Domain Chemicals Data Online for the Community – the Challenge...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Online Resources to Support Open Drug Discovery Systems

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

The importance of standards for data exchange and interchange on the Royal So...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPTX

Overview of open resources to support automated structure verification and e...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Our dire need to mandate data standards and expectations for scientific publi...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Open innovation contributions from RSC resulting from the Open Phacts project

Ken Karapetyan

Open innovation contributions from RSC resulting from the Open Phacts project

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Dr. Haxel Consult

ChemValidator – an online service for validating and standardizing chemical s...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

ChemSpider – disseminating data and enabling an abundance of chemistry platforms

Ken Karapetyan

Big data challenges associated with building a national data repository for c...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

eScience Resources for the Chemistry Community from the Royal Society of Chem...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Importance of data standards for large scale data integration in chemistry

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Connecting Chemistry Across the Internet Using ChemSpider

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Hosting public domain chemicals data online for the community – the challenge...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Feeding and consuming data to support open notebook science via the chem spid...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

eScience at the Royal Society of Chemistry and our current initiatives

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Evolution of public chemistry databases: past and the future

Valery Tkachenko

ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Activities at the Royal Society of Chemistry to gather, extract and analyze b...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Hosting Public Domain Chemicals Data Online for the Community – the Challenge...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Online Resources to Support Open Drug Discovery Systems

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The importance of standards for data exchange and interchange on the Royal So...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Overview of open resources to support automated structure verification and e...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Our dire need to mandate data standards and expectations for scientific publi...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

More from Valery Tkachenko (20)

PPTX

In silico design of new functional materials

Valery Tkachenko

PPTX

Metal-organic frameworks: from database to supramolecular effects in complexa...

Valery Tkachenko

PPTX

Abstract recommendation system: beyond word-level representations

Valery Tkachenko

PPTX

Machine learning methods for chemical properties and toxicity based endpoints

Valery Tkachenko

PPTX

Chemical workflows supporting automated research data collection

Valery Tkachenko

PDF

Deep learning methods applied to physicochemical and toxicological endpoints

Valery Tkachenko

PPTX

Need and benefits for structure standardization to facilitate integration and...

Valery Tkachenko

PPTX

Development and comparison of deep learning toolkit with other machine learni...

Valery Tkachenko

PPTX

Living in a world of federated knowledge challenges, principles, tools and ...

Valery Tkachenko

PPTX

Using the structured product labeling format to index versatile chemical data

Valery Tkachenko

PPTX

Tools and approaches for data deposition into nanomaterial databases

Valery Tkachenko

PPTX

Chemistry Validation and Standardization Platform v2.0

Valery Tkachenko

PPTX

Open Science Data Repository - the platform for materials research

Valery Tkachenko

PPTX

Opportunities in chemical structure standardization

Valery Tkachenko

PPT

OpenPHACTS - Chemistry Platform Update and Learnings

Valery Tkachenko

PPTX

Evolution of open chemical information

Valery Tkachenko

PPTX

OMPOL – visualisation of large chemical spaces

Valery Tkachenko

PPTX

Not just another reaction database

Valery Tkachenko

PPTX

Implementing chemistry platform for OpenPHACTS

Valery Tkachenko

PPTX

Building linked data large-scale chemistry platform - challenges, lessons and...

Valery Tkachenko

In silico design of new functional materials

Valery Tkachenko

Metal-organic frameworks: from database to supramolecular effects in complexa...

Valery Tkachenko

Abstract recommendation system: beyond word-level representations

Valery Tkachenko

Machine learning methods for chemical properties and toxicity based endpoints

Valery Tkachenko

Chemical workflows supporting automated research data collection

Valery Tkachenko

Deep learning methods applied to physicochemical and toxicological endpoints

Valery Tkachenko

Need and benefits for structure standardization to facilitate integration and...

Valery Tkachenko

Development and comparison of deep learning toolkit with other machine learni...

Valery Tkachenko

Living in a world of federated knowledge challenges, principles, tools and ...

Valery Tkachenko

Using the structured product labeling format to index versatile chemical data

Valery Tkachenko

Tools and approaches for data deposition into nanomaterial databases

Valery Tkachenko

Chemistry Validation and Standardization Platform v2.0

Valery Tkachenko

Open Science Data Repository - the platform for materials research

Valery Tkachenko

Opportunities in chemical structure standardization

Valery Tkachenko

OpenPHACTS - Chemistry Platform Update and Learnings

Valery Tkachenko

Evolution of open chemical information

Valery Tkachenko

OMPOL – visualisation of large chemical spaces

Valery Tkachenko

Not just another reaction database

Valery Tkachenko

Implementing chemistry platform for OpenPHACTS

Valery Tkachenko

Building linked data large-scale chemistry platform - challenges, lessons and...

Valery Tkachenko

Recently uploaded (20)

PPTX

The KM-GBF monitoring framework – status & key messages.pptx

pensoftservices

PDF

IFIT3 RNA-binding activity primores influenza A viruz infection and translati...

saracelre

PDF

VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS

ijab2

PPTX

ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

vedvyas5668

PDF

Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

Sérgio Sacani

PPTX

GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS

breeysal7

PDF

SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf

ivan1108lau

PPTX

INTRODUCTION TO EVS | Concept of sustainability

manumaddy15198

PPTX

2. Earth - The Living Planet Module 2ELS

markjustinebarolobau

PPTX

7. General Toxicologyfor clinical phrmacy.pptx

faysalphr

PPTX

Cell Membrane: Structure, Composition & Functions

Muhammad Sajid Afridi

PDF

ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf

savannaesar07

PDF

AlphaEarth Foundations and the Satellite Embedding dataset

gdgforscience

PPTX

neck nodes and dissection types and lymph nodes levels

DipuSarma

PDF

An interstellar mission to test astrophysical black holes

Sérgio Sacani

DOCX

Viruses (History, structure and composition, classification, Bacteriophage Re...

Educator

PPTX

BIOMOLECULES PPT........................

vachieagrawal1221

PPTX

Classification Systems_TAXONOMY_SCIENCE8.pptx

glofernignacio

PPTX

DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

vipvivek704

PPTX

cpcsea ppt.pptxssssssssssssssjjdjdndndddd

amolpawasesng123

The KM-GBF monitoring framework – status & key messages.pptx

pensoftservices

IFIT3 RNA-binding activity primores influenza A viruz infection and translati...

saracelre

VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS

ijab2

ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

vedvyas5668

Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

Sérgio Sacani

GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS

breeysal7

SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf

ivan1108lau

INTRODUCTION TO EVS | Concept of sustainability

manumaddy15198

2. Earth - The Living Planet Module 2ELS

markjustinebarolobau

7. General Toxicologyfor clinical phrmacy.pptx

faysalphr

Cell Membrane: Structure, Composition & Functions

Muhammad Sajid Afridi

ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf

savannaesar07

AlphaEarth Foundations and the Satellite Embedding dataset

gdgforscience

neck nodes and dissection types and lymph nodes levels

DipuSarma

An interstellar mission to test astrophysical black holes

Sérgio Sacani

Viruses (History, structure and composition, classification, Bacteriophage Re...

Educator

BIOMOLECULES PPT........................

vachieagrawal1221

Classification Systems_TAXONOMY_SCIENCE8.pptx

glofernignacio

DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

vipvivek704

cpcsea ppt.pptxssssssssssssssjjdjdndndddd

amolpawasesng123

Using publicly available resources to build a comprehensive knowledgebase of chemical information

1. Using publicly available resources to build a comprehensive knowledgebase of chemical information by B. Sattarov, R. Zakharov and V.Tkachenko Science Data Software Abstract There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated tool for building a comprehensive knowledgebase of chemical information. We have developed data mining workflow to collect and standardize chemical data from open sources, using several simple python scripts which will be included in open source library. Data collection was carried out by HTML parsing and by using ChemSpider API. We also used python version of Chemical Validation and Standardization Platform developed by us to standardize collected data. Our ChemScrapper allowed us to resolve 19.85% names of biologically active compounds from MESH 2017 dataset and to save this data into json and handy sdf format. Chemical Validation and Standardization Platform (CVSP), which we used to standardize chemical structures, can also be used as stand-alone platform for SMIRKS-based standardization of any dataset, thanks to the visual implementation of its python version functionality on Jupyter. You can see every standardization rule applied as SMIRKS string simply by clicking on SMIRKS button as well as download standardized dataset as *.sdf file by checking corresponding folder. Example json—output with mol block, properties and synonyms Example CLI Example input One of the most productive data mining tools we have created works with ChemSpider web API. It allows user, looking for a chemical structures/data, to deal only with convenient command line interface written in Python, in order to resolve chemicals names and identifiers or to find new data for QSAR/QSPR analysis or any other purpose that requires . API HTML Parsing CVSP Standardization Data collection Comprehensive knowledgebase of chemical information Open Science Data Repository (OSDR) Comprehensive distributed semantic knowledgebase of scientific information with built-in Machine Learning capabilities