Living in a world of federated knowledge challenges, principles, tools and solutions

Living in a world of federated knowledge:
Challenges, principles, tools and solutions
Fall ACS 2017, Washington, DC
Rick Zakharov1, Valery Tkachenko1
1 Science Data Software, Rockville, MD, United States

We live in a hyperconnected World

Dimensions and complexity of scientific data

Traditional data – relational

Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?

Big Data Integration 9
OpenPHACTS

VirtualStandardFAIRDataBus
Other Registries
Other Registries
Other Registries

Living in a world of federated knowledge challenges, principles, tools and solutions

D
a
t
a
Data Lake
Social
Media
Electronic
Notebooks
Databases
Sensor Med
Dev
IoT
Curated
Repository
Models
Curation &
Integration
Validation
Decision
Support
Analysis &
Modeling
Open Data Science Platform
Mining
USERS
Model-driven experimental studies

Organize your data in a natural way
● Now-natural folder structure
● Organize your data into
collections
● You have an option to
download anything to your
local drive as long as the
security context allows etc

Chemical processing
● Support for chemical
formats
● Chemistry validation
and standardization
● Automatic processing
and visualization

OSDR - documents
• Integrated text-mining

Convert between formats
● Integrated
format
transformation
● 50+ various
data formats

Predefined or custom metadata
Tagging
Attributes
Taxonomies
Ontologies
Metadata
Harvesting
Industry
Standards
Metadata

Collaborative data authoring and curation
● Datacite.org
support
● Other formats
● Audit trail
● Notifications

Extensive search options
● Search language
● Elasticsearch
technology
● Domain-specific
search modules
● Search ranking

Built-in Machine Learning
● Automated ML
pipeline
● Pre-built ML
modules
● Comparison
between different
ML algorithms
● NB, NN, RF, SVM, LR
● DNN

Datasets used for evaluating multiple computational methods
for activity chemical properties prediction
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
solubility Huuskonen J. J Chem Inf
Comput Sci 2000
Log solubility = −5 1144 active, 155 inactive,
ratio 7.38
probe-like Litterman N. et al. J Chem Inf
Model 2014
described in reference 253 active, 69 inactive,
ratio 3.67
hERG Wang S. et al. Mol Pharm
2012
described in reference 373 active, 433 inactive,
ratio 0.86
KCNQ1 PubChem BioAssay: AID 2642
98
using actives assigned in PubChem 301,737 active, 3878 inactive,
ratio 77.81
Bubonic plague
(Yersina pestis)
PubChem single-point screen
BioAssay: AID 898
active when inhibition ≥50% 223 active, 139,710 inactive,
ratio 0.0016
Chagas disease
(Typanosoma cruzi)
Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold
difference in cytotoxicity as active
1692 active, 2363 inactive,
ratio 0.72
TB (Mycobacterium
tuberculosis)
in vitro bioactivity and
cytotoxicity data from MLSMR,
CB2, kinase, and ARRA
datasets
Mtb activity and acceptable Vero
cell cytotoxicity selectivity index =
(MIC or IC90)/CC50 ≥10
1434 active, 5789 inactive,
ratio 0.25
malaria (Plasmodium
falciparum)
CDD Public datasets (MMV, St.
Jude, Novartis, and TCAMS)
3D7 EC50 <10 nM 175 active, 19,604 inactive,
ratio 0.0089
Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active =
non inhibitors).

Solubility dataset: selected ROC

Solubility dataset: polar plots of the model evaluation metrics
BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support
Vector Machines, DNN-N (N is number of hidden layers).

AUC for all tested datasets (FCFP6, 1024)
Clark et al. J Chem Inf Model 2015
AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al.
solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866
solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933
probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757
probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563
hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849
hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840
KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842
KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848
Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810
Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753
Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800
Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789
Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727
Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685
Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977
Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974

Extensible micro-service based architecture

Micro-service
● Single responsibility
● Simple API
● One-pizza size team
● Independent development
● Independent deployment
and scaling
● Different services can be
implemented using
different technologies

Technologies
● Mix of technologies connected
through microservices
architecture
● Open source toolkits and
libraries with permissive
licenses
● NoSQL Databases
● Containerization
● Leading practices in CI/CD
● Automated testing, rapid
development

Summary
• OSDR is a chemistry data platform
• Supports FAIR data principles
• Can handle specific use cases via modules
• Integrated Machine Learning
• Remove proprietary software barriers
• Uses open source toolkits
• Evolve and improve continuously

Thank you!
On Web:
scidatasoft.com
Slides:
https://www.slideshare.net/valerytkachenko16
Contact us:
info@scidatasoft.com

Living in a world of federated knowledge challenges, principles, tools and solutions

More Related Content

Similar to Living in a world of federated knowledge challenges, principles, tools and solutions (20)

More from Valery Tkachenko (20)

Recently uploaded (20)

Living in a world of federated knowledge challenges, principles, tools and solutions

Editor's Notes