SlideShare a Scribd company logo
Living in a world of federated knowledge:
Challenges, principles, tools and solutions
Fall ACS 2017, Washington, DC
Rick Zakharov1, Valery Tkachenko1
1 Science Data Software, Rockville, MD, United States
We live in a hyperconnected World
Data repositories
Dimensions and complexity of scientific data
Standards and authorities
Traditional data – relational
Chemical data[base]
Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
Big Data Integration 9
OpenPHACTS
FAIR Data Principles
VirtualStandardFAIRDataBus
Other Registries
Other Registries
Other Registries
Living in a world of federated knowledge   challenges, principles, tools and solutions
D
a
t
a
Data Lake
Social
Media
Electronic
Notebooks
Databases
Sensor Med
Dev
IoT
Curated
Repository
Models
Curation &
Integration
Validation
Decision
Support
Analysis &
Modeling
Open Data Science Platform
Mining
USERS
Model-driven experimental studies
Organize your data in a natural way
● Now-natural folder structure
● Organize your data into
collections
● You have an option to
download anything to your
local drive as long as the
security context allows etc
Chemical processing
● Support for chemical
formats
● Chemistry validation
and standardization
● Automatic processing
and visualization
OSDR - documents
• Integrated text-mining
Other formats
Convert between formats
● Integrated
format
transformation
● 50+ various
data formats
OSDR - mapping and conversion
OSDR - import
OSDR - export
Predefined or custom metadata
Tagging
Attributes
Taxonomies
Ontologies
Metadata
Harvesting
Industry
Standards
Metadata
Collaborative data authoring and curation
● Datacite.org
support
● Other formats
● Audit trail
● Notifications
Extensive search options
● Search language
● Elasticsearch
technology
● Domain-specific
search modules
● Search ranking
Built-in Machine Learning
● Automated ML
pipeline
● Pre-built ML
modules
● Comparison
between different
ML algorithms
● NB, NN, RF, SVM, LR
● DNN
Model Training Pipeline
Datasets used for evaluating multiple computational methods
for activity chemical properties prediction
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
solubility Huuskonen J. J Chem Inf
Comput Sci 2000
Log solubility = −5 1144 active, 155 inactive,
ratio 7.38
probe-like Litterman N. et al. J Chem Inf
Model 2014
described in reference 253 active, 69 inactive,
ratio 3.67
hERG Wang S. et al. Mol Pharm
2012
described in reference 373 active, 433 inactive,
ratio 0.86
KCNQ1 PubChem BioAssay: AID 2642
98
using actives assigned in PubChem 301,737 active, 3878 inactive,
ratio 77.81
Bubonic plague
(Yersina pestis)
PubChem single-point screen
BioAssay: AID 898
active when inhibition ≥50% 223 active, 139,710 inactive,
ratio 0.0016
Chagas disease
(Typanosoma cruzi)
Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold
difference in cytotoxicity as active
1692 active, 2363 inactive,
ratio 0.72
TB (Mycobacterium
tuberculosis)
in vitro bioactivity and
cytotoxicity data from MLSMR,
CB2, kinase, and ARRA
datasets
Mtb activity and acceptable Vero
cell cytotoxicity selectivity index =
(MIC or IC90)/CC50 ≥10
1434 active, 5789 inactive,
ratio 0.25
malaria (Plasmodium
falciparum)
CDD Public datasets (MMV, St.
Jude, Novartis, and TCAMS)
3D7 EC50 <10 nM 175 active, 19,604 inactive,
ratio 0.0089
Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active =
non inhibitors).
Solubility dataset: selected ROC
Solubility dataset: polar plots of the model evaluation metrics
BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support
Vector Machines, DNN-N (N is number of hidden layers).
AUC for all tested datasets (FCFP6, 1024)
Clark et al. J Chem Inf Model 2015
AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al.
solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866
solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933
probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757
probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563
hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849
hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840
KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842
KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848
Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810
Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753
Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800
Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789
Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727
Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685
Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977
Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974
Prediction pipeline
Extensible micro-service based architecture
Micro-service
● Single responsibility
● Simple API
● One-pizza size team
● Independent development
● Independent deployment
and scaling
● Different services can be
implemented using
different technologies
Technologies
● Mix of technologies connected
through microservices
architecture
● Open source toolkits and
libraries with permissive
licenses
● NoSQL Databases
● Containerization
● Leading practices in CI/CD
● Automated testing, rapid
development
Summary
• OSDR is a chemistry data platform
• Supports FAIR data principles
• Can handle specific use cases via modules
• Integrated Machine Learning
• Remove proprietary software barriers
• Uses open source toolkits
• Evolve and improve continuously
Thank you!
On Web:
scidatasoft.com
Slides:
https://www.slideshare.net/valerytkachenko16
Contact us:
info@scidatasoft.com

More Related Content

PPTX
Development and comparison of deep learning toolkit with other machine learni...
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PPTX
Open chemistry registry and mapping platform based on open source cheminforma...
PPTX
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
PDF
Medical physics 2004 - rivard - update of aapm task group no 43 report a ...
PDF
Applying cheminformatics and bioinformatics approaches to neglected tropical ...
PPT
Quantitative Medicine Feb 2009
PPT
acs talk open source drug discovery
Development and comparison of deep learning toolkit with other machine learni...
Deep learning methods applied to physicochemical and toxicological endpoints
Open chemistry registry and mapping platform based on open source cheminforma...
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
Medical physics 2004 - rivard - update of aapm task group no 43 report a ...
Applying cheminformatics and bioinformatics approaches to neglected tropical ...
Quantitative Medicine Feb 2009
acs talk open source drug discovery

Similar to Living in a world of federated knowledge challenges, principles, tools and solutions (20)

PPTX
Assay Standardisation - how this leads to improved patient results
PPTX
Serology Mid Sem Internship Report Pptxb
PPTX
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
PPT
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
PDF
C&E news talk sept 16
PPTX
Diagnostic process
PPTX
Predicting Adverse Drug Reactions Using PubChem Screening Data
PDF
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
PPTX
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
PDF
Fauteux World ADC 2017 San Diego
PPT
Slides for st judes
PDF
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
PDF
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
PDF
Exploiting bigger data and collaborative tools for predictive drug discovery
PDF
Mining Big datasets to create and validate machine learning models
DOCX
Survival analysis on kidney failure of kidney transplant patients
DOCX
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
PDF
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
PDF
academic / small company collaborations for rare and neglected diseasesv2
PDF
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Assay Standardisation - how this leads to improved patient results
Serology Mid Sem Internship Report Pptxb
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
C&E news talk sept 16
Diagnostic process
Predicting Adverse Drug Reactions Using PubChem Screening Data
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Fauteux World ADC 2017 San Diego
Slides for st judes
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
Exploiting bigger data and collaborative tools for predictive drug discovery
Mining Big datasets to create and validate machine learning models
Survival analysis on kidney failure of kidney transplant patients
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
academic / small company collaborations for rare and neglected diseasesv2
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Ad

More from Valery Tkachenko (20)

PPTX
Evolution of public chemistry databases: past and the future
PPTX
In silico design of new functional materials
PPTX
Metal-organic frameworks: from database to supramolecular effects in complexa...
PPTX
Abstract recommendation system: beyond word-level representations
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
PPTX
Chemical workflows supporting automated research data collection
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Using publicly available resources to build a comprehensive knowledgebase of ...
PPTX
Need and benefits for structure standardization to facilitate integration and...
PPTX
Using the structured product labeling format to index versatile chemical data
PPTX
Tools and approaches for data deposition into nanomaterial databases
PPTX
Chemistry Validation and Standardization Platform v2.0
PPTX
Open Science Data Repository - the platform for materials research
PPTX
Opportunities in chemical structure standardization
PPT
OpenPHACTS - Chemistry Platform Update and Learnings
PPTX
Evolution of open chemical information
PPTX
OMPOL – visualisation of large chemical spaces
PPTX
Not just another reaction database
PPTX
Implementing chemistry platform for OpenPHACTS
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
Evolution of public chemistry databases: past and the future
In silico design of new functional materials
Metal-organic frameworks: from database to supramolecular effects in complexa...
Abstract recommendation system: beyond word-level representations
Machine learning methods for chemical properties and toxicity based endpoints
Chemical workflows supporting automated research data collection
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Using publicly available resources to build a comprehensive knowledgebase of ...
Need and benefits for structure standardization to facilitate integration and...
Using the structured product labeling format to index versatile chemical data
Tools and approaches for data deposition into nanomaterial databases
Chemistry Validation and Standardization Platform v2.0
Open Science Data Repository - the platform for materials research
Opportunities in chemical structure standardization
OpenPHACTS - Chemistry Platform Update and Learnings
Evolution of open chemical information
OMPOL – visualisation of large chemical spaces
Not just another reaction database
Implementing chemistry platform for OpenPHACTS
Building linked data large-scale chemistry platform - challenges, lessons and...
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
famous lake in india and its disturibution and importance
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
Microbiology with diagram medical studies .pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Sciences of Europe No 170 (2025)
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
An interstellar mission to test astrophysical black holes
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
BIOMOLECULES PPT........................
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Application of enzymes in medicine (2).pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
ECG_Course_Presentation د.محمد صقران ppt
famous lake in india and its disturibution and importance
lecture 2026 of Sjogren's syndrome l .pdf
Microbiology with diagram medical studies .pptx
Placing the Near-Earth Object Impact Probability in Context
Sciences of Europe No 170 (2025)
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
6.1 High Risk New Born. Padetric health ppt
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
An interstellar mission to test astrophysical black holes
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Biophysics 2.pdffffffffffffffffffffffffff
neck nodes and dissection types and lymph nodes levels
BIOMOLECULES PPT........................
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Application of enzymes in medicine (2).pptx
. Radiology Case Scenariosssssssssssssss
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

Living in a world of federated knowledge challenges, principles, tools and solutions

  • 1. Living in a world of federated knowledge: Challenges, principles, tools and solutions Fall ACS 2017, Washington, DC Rick Zakharov1, Valery Tkachenko1 1 Science Data Software, Rockville, MD, United States
  • 2. We live in a hyperconnected World
  • 4. Dimensions and complexity of scientific data
  • 6. Traditional data – relational
  • 8. Why is it so hard to…. Competitors? What’s the structure? Are they in our file? What’s similar? What’s the target?Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP?
  • 9. Big Data Integration 9 OpenPHACTS
  • 13. D a t a Data Lake Social Media Electronic Notebooks Databases Sensor Med Dev IoT Curated Repository Models Curation & Integration Validation Decision Support Analysis & Modeling Open Data Science Platform Mining USERS Model-driven experimental studies
  • 14. Organize your data in a natural way ● Now-natural folder structure ● Organize your data into collections ● You have an option to download anything to your local drive as long as the security context allows etc
  • 15. Chemical processing ● Support for chemical formats ● Chemistry validation and standardization ● Automatic processing and visualization
  • 16. OSDR - documents • Integrated text-mining
  • 18. Convert between formats ● Integrated format transformation ● 50+ various data formats
  • 19. OSDR - mapping and conversion
  • 22. Predefined or custom metadata Tagging Attributes Taxonomies Ontologies Metadata Harvesting Industry Standards Metadata
  • 23. Collaborative data authoring and curation ● Datacite.org support ● Other formats ● Audit trail ● Notifications
  • 24. Extensive search options ● Search language ● Elasticsearch technology ● Domain-specific search modules ● Search ranking
  • 25. Built-in Machine Learning ● Automated ML pipeline ● Pre-built ML modules ● Comparison between different ML algorithms ● NB, NN, RF, SVM, LR ● DNN
  • 27. Datasets used for evaluating multiple computational methods for activity chemical properties prediction Model Datasets used and references Cutoff for active Number of molecules and ratio solubility Huuskonen J. J Chem Inf Comput Sci 2000 Log solubility = −5 1144 active, 155 inactive, ratio 7.38 probe-like Litterman N. et al. J Chem Inf Model 2014 described in reference 253 active, 69 inactive, ratio 3.67 hERG Wang S. et al. Mol Pharm 2012 described in reference 373 active, 433 inactive, ratio 0.86 KCNQ1 PubChem BioAssay: AID 2642 98 using actives assigned in PubChem 301,737 active, 3878 inactive, ratio 77.81 Bubonic plague (Yersina pestis) PubChem single-point screen BioAssay: AID 898 active when inhibition ≥50% 223 active, 139,710 inactive, ratio 0.0016 Chagas disease (Typanosoma cruzi) Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold difference in cytotoxicity as active 1692 active, 2363 inactive, ratio 0.72 TB (Mycobacterium tuberculosis) in vitro bioactivity and cytotoxicity data from MLSMR, CB2, kinase, and ARRA datasets Mtb activity and acceptable Vero cell cytotoxicity selectivity index = (MIC or IC90)/CC50 ≥10 1434 active, 5789 inactive, ratio 0.25 malaria (Plasmodium falciparum) CDD Public datasets (MMV, St. Jude, Novartis, and TCAMS) 3D7 EC50 <10 nM 175 active, 19,604 inactive, ratio 0.0089 Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active = non inhibitors).
  • 29. Solubility dataset: polar plots of the model evaluation metrics BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support Vector Machines, DNN-N (N is number of hidden layers).
  • 30. AUC for all tested datasets (FCFP6, 1024) Clark et al. J Chem Inf Model 2015 AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al. solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866 solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933 probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757 probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563 hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849 hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840 KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842 KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848 Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810 Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753 Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800 Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789 Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727 Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685 Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977 Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974
  • 33. Micro-service ● Single responsibility ● Simple API ● One-pizza size team ● Independent development ● Independent deployment and scaling ● Different services can be implemented using different technologies
  • 34. Technologies ● Mix of technologies connected through microservices architecture ● Open source toolkits and libraries with permissive licenses ● NoSQL Databases ● Containerization ● Leading practices in CI/CD ● Automated testing, rapid development
  • 35. Summary • OSDR is a chemistry data platform • Supports FAIR data principles • Can handle specific use cases via modules • Integrated Machine Learning • Remove proprietary software barriers • Uses open source toolkits • Evolve and improve continuously

Editor's Notes

  • #5: What about science and chemistry in particular?
  • #9: Remember this, some of these questions are easier to answer than others
  • #10: Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data
  • #30: The representative polar plots of the model evaluation metrics for the Solubility dataset.
  • #31: In general the DNN models performed well for predictions except for the AUC performance of the probe-like dataset. For AUC DNN-3 outperforms BNB on 6 of 8 datasets