SlideShare a Scribd company logo
Desperately seeking curated D-A-R-C-P:
Assessing the past to predict the future
Introduction
Bioscientists reading papers or patents on bioactive chemistry strive to discern the
key relationships reported within a document “D“ (e.g. with a PubMed ID) where a
bioactivity “A” with a quantitative result “R” (e.g. an IC50) is reported for chemical
structure “C” that modulates (e.g. inhibits) a protein target “P” (e.g. a UniProt ID).
D – A – R – C – P
While it cannot encompass all mechanistic cases a useful shorthand for this
connectivity thus becomes DARCP. Biocuration for extraction and structured capture
of this relationship chain in databases has high value that can be explored both
manually and computationally, viz;
• “D”: clustering by relatedness, entity content, citation networks, connections
via authors and institutions
• “A”: classified by various assay ontologies
• “R”: log transformations (e.g. pIC50 or pKi) for potency ranking and SAR,
sorting by molecular mechanism of action (mmoa), (e.g. where A-R indicates
C to be a potent inhibitor of P)
• “C”: the full range of cheminformatic analysis including 2D/3D clustering,
property prediction, substructures, analogue searching and chemical ontologies
• “P” a full range of bioinformatic analysis including; target classes, Gene
Ontology (GO) assignments, pathway annotation, structural homology, disease
associations and genetic variation (e.g. for target validation).
The problem the community faces is that we have spent millions burying DARCP in
paywalled PDFs (a.k.a. “Hamburgerisation”) over many decades but must now
spend millions more trying to get it back out.
Assessing the past
The table below shows the statistics of DARCP entity accumulation from three
manually curated resources over approximately the last decade. In the table these are
compared with PubChem wherein these four are integrated as submitting sources
(GtoPdb = IUPHAR/BPS Guide to Pharmacology, PMID 31691834).
Statistical comparisons between databases can be confounded by differences in their
data models, publication selectivity, curatorial practice and activity thresholds.
Nonetheless, discrete entity count can be informative for assessing relative
extraction capture of documents, structures and proteins. The DCP counts are shown
below for the three sources.
PubMed IDs PubChem CIDs Swiss-Prot human IDs
Christopher Southan, TW2Informatics, Göteborg, Sweden
41266
Interpreting entity count differences
The capture of PMIDs shows a pattern of intersects and differences that is to some
extent also reflected in chemistry and protein targets. Each source has some unique
capture but ChEMBL and BindingDB overlap for ~25K papers (partially due to
collaborative mirroring between them). The total from all four of is ~75K PMIDs.
The chemistry (as PubChem identifiers) shows similar disproportionation with
ChEMBL, as expected, dominating with unique content of ~1.2 million. While
this is skewed by their BioAssay subsumation of ~0.5 million, most has been
extracted from ~35K unique papers. In BindingDB unique structures are mainly
from SAR curation of US Patents. In terms of interpreting difference we should
also note that GtoPdb extract on average ~ 1 lead compound per-paper, ChEMBL
~14 per-paper and BindingDB ~ 40 per-patent.
For the differences in target coverage (i.e. as “P” in DARCP) further work is needed
to know what selectivity causes this divergence (e.g. journal choice) but some
BindingDB unique proteins are patent-only. While exploring further causes of target
divergence are outside the scope of this work, the total of 3745 human proteins
(with A-R-C modulating chemistry) covered by these three, represents ~18% of the
UniProt proteome of 20,365.
So how much could be captured?
While an upper limit is difficult to assess, commercial DARCP extraction sources
such as Exelra GOSTAR and Reaxys Medicinal Chemistry, declare curated entity
counts in the range of 6-8 million activity-mapped compounds from ~200-350,000
papers plus ~70-130,000 patents. They also count over 10,000 targets (but not all as
protein identifiers). While there are caveats with comparisons (i.e. not counting the
entities in exactly the same way and no disclosure of entities-in-common) the
indication is that these two sources have captured (very roughly) 4-fold more
DARCP than public efforts, largely due to the larger number of curators employed
or contracted. However, in terms of upper limits for public capture, we must not
overlook issues of data reproducibility arising from the increasingly patchy quality
of PubMed (i.e. many papers from which DARCP should perhaps not be extracted).
Predicting the future
The future flow of DARCP into databases is constrained by the following factors;
• The three resources that continue to capture the majority of open DARCP are to
be congratulated and we hope their funding will be sustained. However, their
capacity is limited by the number of biocurators in the face of increasing
bioactivity publications (and which cheminformatics AI may accelerate).
• Progress in entity recognition via Natural Language Processing now means that
the extraction of discrete D,A,R,C, and P per se can be automated with
reasonable specificity as well as indexed by resource look-ups in European
PubMed Central (EPMC). However, this has not been achieved for D-A-R-C-P
relationships that biocurators can discern and extract from documents in minutes.
• The good news on the journal front is that we have J.Med.Chem. supplementary
SMILES listings (occasionally even with activities), Nat. Chem.Biol pointing to
PubChem entries and Brit J. Pharmacol. incorporating GtoPdb out-links and
(via those) links to PubChem. The bad new is we will move into 2020 without
even a single journal (from 1000s across the domains of medicinal chemistry,
drug discovery, pharmacology and chemical biology) facilitating author-specified
explicit DARCP automatically piped to databases (e.g. PubChem BioAssay).
• The FAIR initiative (Findable, Accessible Interoperable, Reusable) is gaining
momentum and should lead to at least discrete D,A,R,C,P annotations flowing
into various repositories, However, the proportion of fully connected D-A-R-C-P
may be low and it is unclear technically how this might flow through to major
databases. For example, there is currently neither push nor pull for DARCP to
flow from Figshare into PubChem BioAssay.
• While Open Access and Plan S are also gaining momentum, paywalls still
seriously impede extraction. The legacy problem is that only 14% of the ~62K
papers extracted by ChEMBL (as indexed in EPMC) are free full text.
• For the future we need publications to facilitate FAIR data extraction. Non-
document surfacing (e.g. open Electronic Notebooks and Wikidata) also needs
encouraging as an alternative to journals. Both trends should increase DARCP
flow into open databases to enable big data mining and knowledge distillation.
N.b. additional details from this work are given in a ChemRxiv preprint
(10.6084/m9.figshare.11295323) that is under consideration by a journal.
https://sites.google.com/view/tw2informatics/home

More Related Content

PPTX
Connecting chemistry-to-biology
PDF
Current advances to bridge the usability-expressivity gap in biomedical seman...
PPTX
The US-EPA CompTox Chemicals Dashboard – an information hub for over five tho...
PDF
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
PPTX
EPA CompTox Chemicals Dashboard as a Data Integration Hub for Environmental C...
PPT
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
PPT
Adding complex expert knowledge into chemical database and transforming surfa...
PPT
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
Connecting chemistry-to-biology
Current advances to bridge the usability-expressivity gap in biomedical seman...
The US-EPA CompTox Chemicals Dashboard – an information hub for over five tho...
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
EPA CompTox Chemicals Dashboard as a Data Integration Hub for Environmental C...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
Adding complex expert knowledge into chemical database and transforming surfa...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...

What's hot (20)

PPTX
Metabolic Set Enrichment Analysis - chemrich - 2019
PPTX
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
PPT
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
PPTX
Implementing chemistry platform for OpenPHACTS
PPTX
Presentation from Code Camp 2017
PPTX
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
PPTX
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
PPTX
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
PDF
ReVeaLD: A User-driven Domain Specific Interactive Search Platform for Biomed...
PDF
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
PDF
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
PPTX
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
PPTX
Searching for chemical information using PubChem
PPT
2011-11-28 Open PHACTS at RSC CICAG
PPTX
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
PPT
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
PDF
Data Journalism - Cleaning Data
PPTX
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
Metabolic Set Enrichment Analysis - chemrich - 2019
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
Implementing chemistry platform for OpenPHACTS
Presentation from Code Camp 2017
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
Building linked data large-scale chemistry platform - challenges, lessons and...
ReVeaLD: A User-driven Domain Specific Interactive Search Platform for Biomed...
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
Searching for chemical information using PubChem
2011-11-28 Open PHACTS at RSC CICAG
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
Data Journalism - Cleaning Data
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
Ad

Similar to Desperately seeking DARCP (20)

PPTX
FAIR connectivity for DARCP
PPTX
Looking at chemistry - protein - papers connectivity in ELIXIR
PDF
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
PDF
Comparative Genomics 1st Edition Philipp Pagel
PPTX
Connectivity > documents > structures > bioactivity
PDF
Comparative Genomics 1st Edition Philipp Pagel
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
PDF
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
PPTX
Peptide Tribulations in GtoPdb
PDF
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
PPT
Revolution in the Connectivity Between Medicinal Chemistry and Biology
PPT
Towards semantic systems chemical biology
PPTX
Pub Med to PubChem Connectivity
PPTX
The big data join in pharmacology
PPTX
Drug design in medicinal chemistry 3 pci syllabus
PPT
Cadd and molecular modeling for M.Pharm
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
PPTX
Assessing GtoPdb ligand content in PubChem
PPTX
Exploiting PubChem for drug discovery based on natural products
PPTX
PubChem for drug discovery and chemical biology
FAIR connectivity for DARCP
Looking at chemistry - protein - papers connectivity in ELIXIR
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Comparative Genomics 1st Edition Philipp Pagel
Connectivity > documents > structures > bioactivity
Comparative Genomics 1st Edition Philipp Pagel
Reproducibility in cheminformatics and computational chemistry research: cert...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Peptide Tribulations in GtoPdb
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
Revolution in the Connectivity Between Medicinal Chemistry and Biology
Towards semantic systems chemical biology
Pub Med to PubChem Connectivity
The big data join in pharmacology
Drug design in medicinal chemistry 3 pci syllabus
Cadd and molecular modeling for M.Pharm
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Assessing GtoPdb ligand content in PubChem
Exploiting PubChem for drug discovery based on natural products
PubChem for drug discovery and chemical biology
Ad

More from Chris Southan (20)

PPTX
Peptide tribulations
PPTX
Vicissitudes of target validation for BACE1 and BACE2
PPTX
Guide to Pharmacology database: ELIXIR updae
PPTX
In silico 360 Analysis for Drug Development
PPTX
Will the correct BACE ORFs please stand up?
PPTX
Seeking glimmers of light in Pharos “Tdark” proteins
PPTX
5HT2A modulators update for SAFER
PDF
Quality and noise in big chemistry databases
PDF
GtoPdb June 2019 poster
PPTX
PubChem as a source of systems biology perturbagens
PPTX
Will the real proteins please stand up
PPTX
Peptide Tribulations
PPTX
Guide to Immunopharmacology update
PPTX
Druggable Proteome sources in UniProt
PPTX
Patents in PubChem
PPTX
The IUPHAR/MMV Guide to Malaria Pharmacology
PPTX
Linking GtoP <> PubChem <> PubMed
PPTX
Druggable genome in GtoPdb and other dbs
PPTX
5HT2A modulators in GtoPdb and other databses
PPTX
Pros and cons of patent-extracted structures in PubChem
Peptide tribulations
Vicissitudes of target validation for BACE1 and BACE2
Guide to Pharmacology database: ELIXIR updae
In silico 360 Analysis for Drug Development
Will the correct BACE ORFs please stand up?
Seeking glimmers of light in Pharos “Tdark” proteins
5HT2A modulators update for SAFER
Quality and noise in big chemistry databases
GtoPdb June 2019 poster
PubChem as a source of systems biology perturbagens
Will the real proteins please stand up
Peptide Tribulations
Guide to Immunopharmacology update
Druggable Proteome sources in UniProt
Patents in PubChem
The IUPHAR/MMV Guide to Malaria Pharmacology
Linking GtoP <> PubChem <> PubMed
Druggable genome in GtoPdb and other dbs
5HT2A modulators in GtoPdb and other databses
Pros and cons of patent-extracted structures in PubChem

Recently uploaded (20)

PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPT
protein biochemistry.ppt for university classes
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
BIOMOLECULES PPT........................
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
TOTAL hIP ARTHROPLASTY Presentation.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Placing the Near-Earth Object Impact Probability in Context
Biophysics 2.pdffffffffffffffffffffffffff
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
protein biochemistry.ppt for university classes
. Radiology Case Scenariosssssssssssssss
INTRODUCTION TO EVS | Concept of sustainability
BIOMOLECULES PPT........................
ECG_Course_Presentation د.محمد صقران ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Cell Membrane: Structure, Composition & Functions
Derivatives of integument scales, beaks, horns,.pptx
2. Earth - The Living Planet earth and life
Comparative Structure of Integument in Vertebrates.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx

Desperately seeking DARCP

  • 1. Desperately seeking curated D-A-R-C-P: Assessing the past to predict the future Introduction Bioscientists reading papers or patents on bioactive chemistry strive to discern the key relationships reported within a document “D“ (e.g. with a PubMed ID) where a bioactivity “A” with a quantitative result “R” (e.g. an IC50) is reported for chemical structure “C” that modulates (e.g. inhibits) a protein target “P” (e.g. a UniProt ID). D – A – R – C – P While it cannot encompass all mechanistic cases a useful shorthand for this connectivity thus becomes DARCP. Biocuration for extraction and structured capture of this relationship chain in databases has high value that can be explored both manually and computationally, viz; • “D”: clustering by relatedness, entity content, citation networks, connections via authors and institutions • “A”: classified by various assay ontologies • “R”: log transformations (e.g. pIC50 or pKi) for potency ranking and SAR, sorting by molecular mechanism of action (mmoa), (e.g. where A-R indicates C to be a potent inhibitor of P) • “C”: the full range of cheminformatic analysis including 2D/3D clustering, property prediction, substructures, analogue searching and chemical ontologies • “P” a full range of bioinformatic analysis including; target classes, Gene Ontology (GO) assignments, pathway annotation, structural homology, disease associations and genetic variation (e.g. for target validation). The problem the community faces is that we have spent millions burying DARCP in paywalled PDFs (a.k.a. “Hamburgerisation”) over many decades but must now spend millions more trying to get it back out. Assessing the past The table below shows the statistics of DARCP entity accumulation from three manually curated resources over approximately the last decade. In the table these are compared with PubChem wherein these four are integrated as submitting sources (GtoPdb = IUPHAR/BPS Guide to Pharmacology, PMID 31691834). Statistical comparisons between databases can be confounded by differences in their data models, publication selectivity, curatorial practice and activity thresholds. Nonetheless, discrete entity count can be informative for assessing relative extraction capture of documents, structures and proteins. The DCP counts are shown below for the three sources. PubMed IDs PubChem CIDs Swiss-Prot human IDs Christopher Southan, TW2Informatics, Göteborg, Sweden 41266 Interpreting entity count differences The capture of PMIDs shows a pattern of intersects and differences that is to some extent also reflected in chemistry and protein targets. Each source has some unique capture but ChEMBL and BindingDB overlap for ~25K papers (partially due to collaborative mirroring between them). The total from all four of is ~75K PMIDs. The chemistry (as PubChem identifiers) shows similar disproportionation with ChEMBL, as expected, dominating with unique content of ~1.2 million. While this is skewed by their BioAssay subsumation of ~0.5 million, most has been extracted from ~35K unique papers. In BindingDB unique structures are mainly from SAR curation of US Patents. In terms of interpreting difference we should also note that GtoPdb extract on average ~ 1 lead compound per-paper, ChEMBL ~14 per-paper and BindingDB ~ 40 per-patent. For the differences in target coverage (i.e. as “P” in DARCP) further work is needed to know what selectivity causes this divergence (e.g. journal choice) but some BindingDB unique proteins are patent-only. While exploring further causes of target divergence are outside the scope of this work, the total of 3745 human proteins (with A-R-C modulating chemistry) covered by these three, represents ~18% of the UniProt proteome of 20,365. So how much could be captured? While an upper limit is difficult to assess, commercial DARCP extraction sources such as Exelra GOSTAR and Reaxys Medicinal Chemistry, declare curated entity counts in the range of 6-8 million activity-mapped compounds from ~200-350,000 papers plus ~70-130,000 patents. They also count over 10,000 targets (but not all as protein identifiers). While there are caveats with comparisons (i.e. not counting the entities in exactly the same way and no disclosure of entities-in-common) the indication is that these two sources have captured (very roughly) 4-fold more DARCP than public efforts, largely due to the larger number of curators employed or contracted. However, in terms of upper limits for public capture, we must not overlook issues of data reproducibility arising from the increasingly patchy quality of PubMed (i.e. many papers from which DARCP should perhaps not be extracted). Predicting the future The future flow of DARCP into databases is constrained by the following factors; • The three resources that continue to capture the majority of open DARCP are to be congratulated and we hope their funding will be sustained. However, their capacity is limited by the number of biocurators in the face of increasing bioactivity publications (and which cheminformatics AI may accelerate). • Progress in entity recognition via Natural Language Processing now means that the extraction of discrete D,A,R,C, and P per se can be automated with reasonable specificity as well as indexed by resource look-ups in European PubMed Central (EPMC). However, this has not been achieved for D-A-R-C-P relationships that biocurators can discern and extract from documents in minutes. • The good news on the journal front is that we have J.Med.Chem. supplementary SMILES listings (occasionally even with activities), Nat. Chem.Biol pointing to PubChem entries and Brit J. Pharmacol. incorporating GtoPdb out-links and (via those) links to PubChem. The bad new is we will move into 2020 without even a single journal (from 1000s across the domains of medicinal chemistry, drug discovery, pharmacology and chemical biology) facilitating author-specified explicit DARCP automatically piped to databases (e.g. PubChem BioAssay). • The FAIR initiative (Findable, Accessible Interoperable, Reusable) is gaining momentum and should lead to at least discrete D,A,R,C,P annotations flowing into various repositories, However, the proportion of fully connected D-A-R-C-P may be low and it is unclear technically how this might flow through to major databases. For example, there is currently neither push nor pull for DARCP to flow from Figshare into PubChem BioAssay. • While Open Access and Plan S are also gaining momentum, paywalls still seriously impede extraction. The legacy problem is that only 14% of the ~62K papers extracted by ChEMBL (as indexed in EPMC) are free full text. • For the future we need publications to facilitate FAIR data extraction. Non- document surfacing (e.g. open Electronic Notebooks and Wikidata) also needs encouraging as an alternative to journals. Both trends should increase DARCP flow into open databases to enable big data mining and knowledge distillation. N.b. additional details from this work are given in a ChemRxiv preprint (10.6084/m9.figshare.11295323) that is under consideration by a journal. https://sites.google.com/view/tw2informatics/home