SlideShare a Scribd company logo
Chemistry Validation and
Standardization Platform
     Modularization and
       “Hadoop”ization
 Kenneth Karapetyan, Colin Batchelor,
   Valery Tkachenko, Antony Williams
          ACS New Orleans April 2013
Overview
•   Motivation
•   What we support
•   Modularization
•   Parallelization
•   Examples
Motivation: validation
Open and free chemical validation system for:
•Structure validation
   – Warn on query atoms, pseudo atoms, polymers,
     etc.
   – Nonsensical stereo
•SDF field mapping for validating depositor-
provided names, InChI, SMILES
Motivation: standardization
Allows users to use CVSP default standardization workflow (or
FDA, Open PHACTS and so on)
Allows users to put together their own workflow using
modules provided:
•Apply default CVSP or user-defined SMIRKS rules
•Layout
•Neutralize
•Get canonical tautomer using ChemAxon’s algorithms
•Get biggest organic fragment
What we support
• SD files and mol files
• ChemDraw files (in-house code)
• Tab-delimited text files of names, InChIs,
  SMILES

• Zipped files
• GZipped files
CVSP: modularization
Reusable workflows
SMIRKS-based rules
The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
“Hadoop”ization
Apache Hadoop is a framework for the distributed processing of large data
sets across clusters of computers.

CVSP is written in C#. To run it on Linux machines we use Mono (cross-
platform .NET runtime environment)

Farm:
•28 CPU cores
•42G memory
•2T disk space

Processor intensive tasks
•Tautomerization
Deposit ID in
    Input file                          Convert to SD format
                       database




                                          Upload to farm for
                    Hadoop processing
                                        processing on Hadoop




Upload results to
database for user   Download results
    preview
Hadoop queues
Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP
submissions
•“Small” submission queue for submissions under 500 records
•Large submissions queue
•Internal queue
    – For internal projects, e.g. tautomer analysis of ChemSpider or
       ChemSpider standardization

All records have to be processed on Hadoop to user to see the results (no partial
preview)
Examples
DrugBank
•~6500 records, approximately 2 records per
second
PubMed
•~100 000 records, about 9 h
Rate-limiting step?
Canonical tautomerization
This molecule took
45 min to
canonicalize.
DrugBank dataset (6516 records)
Errors
•2 records with query(any) bond
•2 records with R groups
•3 polymers
•18 porphyrins with metal coordinated inside with one of the
metal-nitrogen bonds stereogenic
•Unusual valence: ~20

Warnings
•INCHI not matching structure (100+)
•SMILES not matching structure (100+)
DrugBank ID: DB00755
InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-
20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-
14+




DrugBank ID: DB00614
Stereo issues




                          J. Brecher, Pure Appl. Chem.,
                          2008,
                          doi:10.1351/pac200880020277




DB08128     DB06287
Please try CVSP at

http://cv.beta.rsc-us.org



Thank you

E-mail: karapetyank@rsc.org, batchelorc@rsc.org

More Related Content

PPTX
Polyglot metadata for Hadoop
PDF
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
PPTX
Hadoop storage
PPT
HDF4 and HDF5 Performance Preliminary Results
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
PDF
ORC 2015: Faster, Better, Smaller
PDF
Productionizing Spark and the Spark Job Server
Polyglot metadata for Hadoop
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Hadoop storage
HDF4 and HDF5 Performance Preliminary Results
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
ORC 2015: Faster, Better, Smaller
Productionizing Spark and the Spark Job Server

What's hot (20)

PDF
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
PPT
Performance Tuning in HDF5
PPTX
Hadoop operations-2015-hadoop-summit-san-jose-v5
PDF
Apache CarbonData:New high performance data format for faster data analysis
PDF
Scalable and High available Distributed File System Metadata Service Using gR...
PPSX
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
PPTX
HUG Nov 2010: HDFS Raid - Facebook
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PPTX
CaffeOnSpark Update: Recent Enhancements and Use Cases
PDF
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
PPTX
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
PPTX
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
PPT
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Hadoop Meetup Jan 2019 - Overview of Ozone
PDF
Set Up & Operate Real-Time Data Loading into Hadoop
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Performance Tuning in HDF5
Hadoop operations-2015-hadoop-summit-san-jose-v5
Apache CarbonData:New high performance data format for faster data analysis
Scalable and High available Distributed File System Metadata Service Using gR...
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
HUG Nov 2010: HDFS Raid - Facebook
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
CaffeOnSpark Update: Recent Enhancements and Use Cases
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
TeraCache: Efficient Caching Over Fast Storage Devices
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Hadoop Meetup Jan 2019 - Overview of Ozone
Set Up & Operate Real-Time Data Loading into Hadoop
Why you should care about data layout in the file system with Cheng Lian and ...
Ad

Similar to The RSC chemical validation and standardization platform, a potential path to quality-conscious databases (20)

PPT
The RSC chemical validation and standardization platform, a potential path to...
PDF
Hadoop 101 for bioinformaticians
PPTX
ChemValidator – an online service for validating and standardizing chemical s...
PPT
Hadoop ecosystem framework n hadoop in live environment
PDF
HadoopThe Hadoop Java Software Framework
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPTX
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
PPTX
Intro to hadoop
PDF
A Survey on Medical Image Retrieval Based on Hadoop
PPTX
ch 01B Introduction to Hadoop components
PPTX
Brief Introduction about Hadoop and Core Services.
PDF
Hadoop - How It Works
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
ChemSpider compound database as one of the pillars of a semantic web for …
PDF
Using publicly available resources to build a comprehensive knowledgebase of ...
PPTX
Hadoop basics
PDF
Introduction to HADOOP.pdf
PPT
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
The RSC chemical validation and standardization platform, a potential path to...
Hadoop 101 for bioinformaticians
ChemValidator – an online service for validating and standardizing chemical s...
Hadoop ecosystem framework n hadoop in live environment
HadoopThe Hadoop Java Software Framework
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
Hadoop a Natural Choice for Data Intensive Log Processing
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Intro to hadoop
A Survey on Medical Image Retrieval Based on Hadoop
ch 01B Introduction to Hadoop components
Brief Introduction about Hadoop and Core Services.
Hadoop - How It Works
EclipseCon Keynote: Apache Hadoop - An Introduction
ChemSpider compound database as one of the pillars of a semantic web for …
Using publicly available resources to build a comprehensive knowledgebase of ...
Hadoop basics
Introduction to HADOOP.pdf
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
August Patch Tuesday
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Machine learning based COVID-19 study performance prediction
August Patch Tuesday
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Univ-Connecticut-ChatGPT-Presentaion.pdf
TLE Review Electricity (Electricity).pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Mushroom cultivation and it's methods.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Tartificialntelligence_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative analysis of optical character recognition models for extracting...
cloud_computing_Infrastucture_as_cloud_p
Accuracy of neural networks in brain wave diagnosis of schizophrenia

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases