Desperately seeking DARCP

Desperately seeking curated D-A-R-C-P:
Assessing the past to predict the future
Introduction
Bioscientists reading papers or patents on bioactive chemistry strive to discern the
key relationships reported within a document “D“ (e.g. with a PubMed ID) where a
bioactivity “A” with a quantitative result “R” (e.g. an IC50) is reported for chemical
structure “C” that modulates (e.g. inhibits) a protein target “P” (e.g. a UniProt ID).
D – A – R – C – P
While it cannot encompass all mechanistic cases a useful shorthand for this
connectivity thus becomes DARCP. Biocuration for extraction and structured capture
of this relationship chain in databases has high value that can be explored both
manually and computationally, viz;
• “D”: clustering by relatedness, entity content, citation networks, connections
via authors and institutions
• “A”: classified by various assay ontologies
• “R”: log transformations (e.g. pIC50 or pKi) for potency ranking and SAR,
sorting by molecular mechanism of action (mmoa), (e.g. where A-R indicates
C to be a potent inhibitor of P)
• “C”: the full range of cheminformatic analysis including 2D/3D clustering,
property prediction, substructures, analogue searching and chemical ontologies
• “P” a full range of bioinformatic analysis including; target classes, Gene
Ontology (GO) assignments, pathway annotation, structural homology, disease
associations and genetic variation (e.g. for target validation).
The problem the community faces is that we have spent millions burying DARCP in
paywalled PDFs (a.k.a. “Hamburgerisation”) over many decades but must now
spend millions more trying to get it back out.
Assessing the past
The table below shows the statistics of DARCP entity accumulation from three
manually curated resources over approximately the last decade. In the table these are
compared with PubChem wherein these four are integrated as submitting sources
(GtoPdb = IUPHAR/BPS Guide to Pharmacology, PMID 31691834).
Statistical comparisons between databases can be confounded by differences in their
data models, publication selectivity, curatorial practice and activity thresholds.
Nonetheless, discrete entity count can be informative for assessing relative
extraction capture of documents, structures and proteins. The DCP counts are shown
below for the three sources.
PubMed IDs PubChem CIDs Swiss-Prot human IDs
Christopher Southan, TW2Informatics, Göteborg, Sweden
41266
Interpreting entity count differences
The capture of PMIDs shows a pattern of intersects and differences that is to some
extent also reflected in chemistry and protein targets. Each source has some unique
capture but ChEMBL and BindingDB overlap for ~25K papers (partially due to
collaborative mirroring between them). The total from all four of is ~75K PMIDs.
The chemistry (as PubChem identifiers) shows similar disproportionation with
ChEMBL, as expected, dominating with unique content of ~1.2 million. While
this is skewed by their BioAssay subsumation of ~0.5 million, most has been
extracted from ~35K unique papers. In BindingDB unique structures are mainly
from SAR curation of US Patents. In terms of interpreting difference we should
also note that GtoPdb extract on average ~ 1 lead compound per-paper, ChEMBL
~14 per-paper and BindingDB ~ 40 per-patent.
For the differences in target coverage (i.e. as “P” in DARCP) further work is needed
to know what selectivity causes this divergence (e.g. journal choice) but some
BindingDB unique proteins are patent-only. While exploring further causes of target
divergence are outside the scope of this work, the total of 3745 human proteins
(with A-R-C modulating chemistry) covered by these three, represents ~18% of the
UniProt proteome of 20,365.
So how much could be captured?
While an upper limit is difficult to assess, commercial DARCP extraction sources
such as Exelra GOSTAR and Reaxys Medicinal Chemistry, declare curated entity
counts in the range of 6-8 million activity-mapped compounds from ~200-350,000
papers plus ~70-130,000 patents. They also count over 10,000 targets (but not all as
protein identifiers). While there are caveats with comparisons (i.e. not counting the
entities in exactly the same way and no disclosure of entities-in-common) the
indication is that these two sources have captured (very roughly) 4-fold more
DARCP than public efforts, largely due to the larger number of curators employed
or contracted. However, in terms of upper limits for public capture, we must not
overlook issues of data reproducibility arising from the increasingly patchy quality
of PubMed (i.e. many papers from which DARCP should perhaps not be extracted).
Predicting the future
The future flow of DARCP into databases is constrained by the following factors;
• The three resources that continue to capture the majority of open DARCP are to
be congratulated and we hope their funding will be sustained. However, their
capacity is limited by the number of biocurators in the face of increasing
bioactivity publications (and which cheminformatics AI may accelerate).
• Progress in entity recognition via Natural Language Processing now means that
the extraction of discrete D,A,R,C, and P per se can be automated with
reasonable specificity as well as indexed by resource look-ups in European
PubMed Central (EPMC). However, this has not been achieved for D-A-R-C-P
relationships that biocurators can discern and extract from documents in minutes.
• The good news on the journal front is that we have J.Med.Chem. supplementary
SMILES listings (occasionally even with activities), Nat. Chem.Biol pointing to
PubChem entries and Brit J. Pharmacol. incorporating GtoPdb out-links and
(via those) links to PubChem. The bad new is we will move into 2020 without
even a single journal (from 1000s across the domains of medicinal chemistry,
drug discovery, pharmacology and chemical biology) facilitating author-specified
explicit DARCP automatically piped to databases (e.g. PubChem BioAssay).
• The FAIR initiative (Findable, Accessible Interoperable, Reusable) is gaining
momentum and should lead to at least discrete D,A,R,C,P annotations flowing
into various repositories, However, the proportion of fully connected D-A-R-C-P
may be low and it is unclear technically how this might flow through to major
databases. For example, there is currently neither push nor pull for DARCP to
flow from Figshare into PubChem BioAssay.
• While Open Access and Plan S are also gaining momentum, paywalls still
seriously impede extraction. The legacy problem is that only 14% of the ~62K
papers extracted by ChEMBL (as indexed in EPMC) are free full text.
• For the future we need publications to facilitate FAIR data extraction. Non-
document surfacing (e.g. open Electronic Notebooks and Wikidata) also needs
encouraging as an alternative to journals. Both trends should increase DARCP
flow into open databases to enable big data mining and knowledge distillation.
N.b. additional details from this work are given in a ChemRxiv preprint
(10.6084/m9.figshare.11295323) that is under consideration by a journal.
https://sites.google.com/view/tw2informatics/home

Desperately seeking DARCP

More Related Content

What's hot (20)

Similar to Desperately seeking DARCP (20)

More from Chris Southan (20)

Recently uploaded (20)

Desperately seeking DARCP