SlideShare a Scribd company logo
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Automatic extraction of bioactivity
data from patents
Daniel Lowe*, Stefan Senger† and Roger Sayle*
*NextMove Software Cambridge, UK
†GlaxoSmithKline, Stevenage, UK
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Example Use cases
• “A patent has recently come out on a topic of
interest, can the key compounds be extracted
with their activity data?”
• “Which compounds have been found to be
active against this target?”
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
US Patent data freely available
patents.reedtech.com
(Or from the USPTO: bulkdata.uspto.gov)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
= text-mined
What are
these
compounds?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
SureChEMBL Google Patents
After text-mining for chemical entities:
Green = substituent
Purple = molecule
Source: US20170050925A9
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
SureChEMBL
Google PatentsPatent PDF
PatFetch
(NextMove Software)Source: US20010016661A1
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Understanding table semantics
5 columns
6 columns
• Columns merged such that header and body
have same number of columns
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Getting the compound
structures
• Chemical names
• Chemical sketches
• R-group tables
• Compound identifier associated with any of
the above
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical names
• OPSIN (Open Parser for systematic IUPAC
nomenclature)
• Dictionaries (ChEMBL/PubChem/NextMove)
• Chemical line formula parsing, especially
useful for peptide names and R-group
definitions
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Chemical sketches
• Utilize the ChemDraw sketches provided by
the USPTO
• Detection and handling of repeat brackets and
positional variation
• Fixing obvious errors e.g. undervalent
nitrogen near to H atom with no bond
• Labels reinterpreted
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Formula Interpretation
Input ChemDraw 15 This work
HATU
C4F9
H3PO4
CON(cHex)2 No result
III-2 No result
N
N
+
O
N
N
N
N
F
P
-
F
F
F
F
F
A
T U
C C
F
FF
F
F
F
F F
F
FF
F F
FF
F
F
F
O
N
P
O
O
O
OH
HH HO P
O
OH
OH
I
I
2
-
I
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
R-group tables
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
• Need to “name space” identifiers
– “Compound 1”, “Reference compound 1”,
“Example 1”
– But “Compound 1” = “cmpd 1” = “cpd. #1”
• Where a column is just called “#” is it a
compound number, example number or just a
table row number!
• Identifier may be defined multiple times e.g.
as a sketch and chemical name
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(text-mining)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Sketches)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Resolving Identifiers
(Tables)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Excel table export
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Extracting compound-activity
relationships
What is the
target?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Assay identification
• Naïve Bayes classifier trained from assay
descriptions identified by BindingDB curators
• 10-fold cross validation: 98.9% recall, 94.7%
precision
• Paragraph associated with next table or table
mentioned in paragraph
• Target/organism detected
• Care taken to avoid common irrelevant
organisms/proteins e.g. bovine serum albumin
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Results From US Patent
applications (2001-Mar 2017)
Red = Bioactivity
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Activities with associated
structures per year
0
100,000
200,000
300,000
400,000
500,000
600,000
Activitty-structurerelationshipsextracted
Publication Year
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Activity data from ~1500 US patent grants (2013-
2016) manually extracted over the course of 3 years
• ~150,000 activities
• Comparison done on the subset that was made
available in ChEMBL 22_1 (98,898 activity values,
1012 patents)
• As some assay results are missed by the automatic
extraction, and some are considered out of scope by
BindingDB, difficult to distinguish differences in
coverage from genuine disagreements
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison with BindingDB
• Values normalized into nM
– 1000s of instances of measurements in nanometers!
• Mid point of ranges taken
• Structures compared by StdInChI
• Target name normalized to ChEMBL target ID
(organism specific), using either:
– ChEMBL target synonyms
– Normalize to HGNC symbol and check if HGNC symbol is a
ChEMBL target synonym
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Comparison
Expected
values
found
Expected
structures
found
Expected
value +
structure
found
Expected
value +
structure +
target
75% 65% 53% 18%
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Unclear structure assignment
? ?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Stereochemistry and salts
OH
O
O
N
H
CH3H3C
Br
H
H
Patent BindingDB This
work
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Long tail of difficult cases
What does this
superscript term
mean?
What are the
units?
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Targets of patent data compared
to journal data
ChEMBL 22_1
(excluding BindingDB)
US Patent Applications
Common Target Classes
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
%peryear
Kinase
GPCR (Family A)
Protease
Nuclear receptor
Voltage-gated ion
channel
Electrochemical
transporter
Oxidoreductase
0%
5%
10%
15%
20%
25%
30%
35%
40%
2002
2004
2006
2008
2010
2012
2014
2016
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Upcoming target classes
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Percentageofdocumentswithactivityvaluesagainst
targetclass
Epigenetic writer (Patents)
Epigenetic reader (Patents)
Epigenetic writer (ChEMBL ex
BindingDB)
Epigenetic reader (ChEMBL ex
BindingDB)
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Future work
• Support for more complex R-group tables
• Improve recognition and resolution of protein
target names
• Support for activities specified in text e.g.
Example 1 has an IC50 of 12 nM measured at rat EP4
• Resolution of symbols for activity ranges e.g.
“A” indicates an IC50 value of less than 100 nM
• Improve assay metadata extraction
cf. BioAssay Express
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Disambiguation of Conflicting
structure descriptions
Image from
original filing
Redrawn by US
patent office in
ChemDraw
Intended
structure from
chemical name
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Conclusions
• Processing all US patents from 2001 to present
can be done in less than a day on a desktop PC
• Technique applicable to chemical properties
other than activity values
• Compound number <-> structure relationships
useful for key compound identification
• For the majority of patents, extracting
structure-activity relationships can be
significantly expedited
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Acknowledgements
• Noel O`Boyle
• John Mayfield
• Funding provided by:
253rd ACS National Meeting, San Francisco CA, USA 4th April 2017
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

More Related Content

PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
Advanced grammars for state-of-the-art named entity recognition (NER)
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Sketchy sketches hiding chemistry in plain sight
PDF
Chemistry and reactions from non-US patents
PDF
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
PDF
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
PDF
In grammars we trust: LeadMine, a knowledge driven solution
Challenges and successes in machine interpretation of Markush descriptions
Advanced grammars for state-of-the-art named entity recognition (NER)
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Sketchy sketches hiding chemistry in plain sight
Chemistry and reactions from non-US patents
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
In grammars we trust: LeadMine, a knowledge driven solution

What's hot (20)

PDF
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PPTX
Classification, representation and analysis of cyclic peptides and peptide-li...
PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PDF
Standardized Representations of ELN Reactions for Categorization and Duplicat...
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
PDF
Substructure Search Face-off
PDF
ICIC 2016: New Product Introduction CAS
PDF
Chemical structure representation in PubChem
PDF
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
PPTX
Acs 2013 indianapolis_cvsp
PPTX
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...
PPTX
Resolving cryptic needles to molecular structures: The GtoPdb experience
PPT
Adding complex expert knowledge into chemical database and transforming surfa...
PPTX
Data model
PPT
Building support for the semantic web for chemistry at the Royal Society of C...
PPT
OpenPHACTS - Chemistry Platform Update and Learnings
PPTX
The needs for chemistry standards, database tools and data curation at the ch...
PDF
2020 scifinder-n manual (2020) english
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Classification, representation and analysis of cyclic peptides and peptide-li...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Substructure Search Face-off
ICIC 2016: New Product Introduction CAS
Chemical structure representation in PubChem
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
Acs 2013 indianapolis_cvsp
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...
Resolving cryptic needles to molecular structures: The GtoPdb experience
Adding complex expert knowledge into chemical database and transforming surfa...
Data model
Building support for the semantic web for chemistry at the Royal Society of C...
OpenPHACTS - Chemistry Platform Update and Learnings
The needs for chemistry standards, database tools and data curation at the ch...
2020 scifinder-n manual (2020) english
Ad

Similar to Automatic extraction of bioactivity data from patents (20)

PDF
Unlocking chemical information from tables and legacy articles
PDF
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
PPTX
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
PPTX
Preserving the currency of analytics outcomes over time through selective re-...
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
PDF
Making the Old New Again - Modern Technical Provides Access to Historical Che...
PDF
AMOS: the EPA database of analytical methods and open mass spectral database ...
PPTX
Structure Identification Using High Resolution Mass Spectrometry Data and the...
PPTX
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
PPTX
Supporting Dataset Descriptions in the Life Sciences
PDF
Tackling the difficult areas of chemical entity extraction
PPTX
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
PPTX
Free online access to experimental and predicted chemical properties through ...
PPT
How to Find Physical Properties of Chemical Substances
PDF
Kk m5re9v2e3
PPTX
CAS: Transforming Discovery
 
PDF
Semantic Search and Result Presentation with Entity Cards
PPT
The eCrystals Federation
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
Unlocking chemical information from tables and legacy articles
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
Preserving the currency of analytics outcomes over time through selective re-...
Building linked data large-scale chemistry platform - challenges, lessons and...
Making the Old New Again - Modern Technical Provides Access to Historical Che...
AMOS: the EPA database of analytical methods and open mass spectral database ...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...
Supporting Dataset Descriptions in the Life Sciences
Tackling the difficult areas of chemical entity extraction
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
Free online access to experimental and predicted chemical properties through ...
How to Find Physical Properties of Chemical Substances
Kk m5re9v2e3
CAS: Transforming Discovery
 
Semantic Search and Result Presentation with Entity Cards
The eCrystals Federation
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
Ad

More from NextMove Software (20)

PDF
DeepSMILES
PDF
Building a bridge between human-readable and machine-readable representations...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Can we agree on the structure represented by a SMILES string? A benchmark dat...
PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
PDF
Recent improvements to the RDKit
PDF
Digital Chemical Representations
PDF
PubChem as a Biologics Database
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Challenges in Chemical Information Exchange
PDF
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
PDF
RDKit UGM 2016: Higher Quality Chemical Depictions
PDF
GHS and NFPA diamonds: where they come from and how they can be useful
PDF
Line notations for nucleic acids (both natural and therapeutic)
PDF
Which is the best fingerprint for medicinal chemistry?
PDF
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
DeepSMILES
Building a bridge between human-readable and machine-readable representations...
CINF 35: Structure searching for patent information: The need for speed
A de facto standard or a free-for-all? A benchmark for reading SMILES
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Comparing Cahn-Ingold-Prelog Rule Implementations
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Recent improvements to the RDKit
Digital Chemical Representations
PubChem as a Biologics Database
Building on Sand: Standard InChIs on non-standard molfiles
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Challenges in Chemical Information Exchange
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit UGM 2016: Higher Quality Chemical Depictions
GHS and NFPA diamonds: where they come from and how they can be useful
Line notations for nucleic acids (both natural and therapeutic)
Which is the best fingerprint for medicinal chemistry?
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...

Recently uploaded (20)

PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Nekopoi APK 2025 free lastest update
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
iTop VPN Crack Latest Version Full Key 2025
PDF
Download FL Studio Crack Latest version 2025 ?
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
assetexplorer- product-overview - presentation
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Nekopoi APK 2025 free lastest update
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Monitoring Stack: Grafana, Loki & Promtail
Design an Analysis of Algorithms II-SECS-1021-03
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Design an Analysis of Algorithms I-SECS-1021-03
iTop VPN Crack Latest Version Full Key 2025
Download FL Studio Crack Latest version 2025 ?
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Computer Software and OS of computer science of grade 11.pptx
assetexplorer- product-overview - presentation
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Salesforce Agentforce AI Implementation.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Operating system designcfffgfgggggggvggggggggg
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025

Automatic extraction of bioactivity data from patents

  • 1. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Automatic extraction of bioactivity data from patents Daniel Lowe*, Stefan Senger† and Roger Sayle* *NextMove Software Cambridge, UK †GlaxoSmithKline, Stevenage, UK
  • 2. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Example Use cases • “A patent has recently come out on a topic of interest, can the key compounds be extracted with their activity data?” • “Which compounds have been found to be active against this target?”
  • 3. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 US Patent data freely available patents.reedtech.com (Or from the USPTO: bulkdata.uspto.gov)
  • 4. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 = text-mined What are these compounds?
  • 5. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Understanding table semantics SureChEMBL Google Patents After text-mining for chemical entities: Green = substituent Purple = molecule Source: US20170050925A9
  • 6. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 SureChEMBL Google PatentsPatent PDF PatFetch (NextMove Software)Source: US20010016661A1
  • 7. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Understanding table semantics 5 columns 6 columns • Columns merged such that header and body have same number of columns
  • 8. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Getting the compound structures • Chemical names • Chemical sketches • R-group tables • Compound identifier associated with any of the above
  • 9. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Chemical names • OPSIN (Open Parser for systematic IUPAC nomenclature) • Dictionaries (ChEMBL/PubChem/NextMove) • Chemical line formula parsing, especially useful for peptide names and R-group definitions
  • 10. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Chemical sketches • Utilize the ChemDraw sketches provided by the USPTO • Detection and handling of repeat brackets and positional variation • Fixing obvious errors e.g. undervalent nitrogen near to H atom with no bond • Labels reinterpreted
  • 11. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Formula Interpretation Input ChemDraw 15 This work HATU C4F9 H3PO4 CON(cHex)2 No result III-2 No result N N + O N N N N F P - F F F F F A T U C C F FF F F F F F F FF F F FF F F F O N P O O O OH HH HO P O OH OH I I 2 - I
  • 12. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 R-group tables
  • 13. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers • Need to “name space” identifiers – “Compound 1”, “Reference compound 1”, “Example 1” – But “Compound 1” = “cmpd 1” = “cpd. #1” • Where a column is just called “#” is it a compound number, example number or just a table row number! • Identifier may be defined multiple times e.g. as a sketch and chemical name
  • 14. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (text-mining)
  • 15. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (Sketches)
  • 16. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Resolving Identifiers (Tables)
  • 17. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Extracting compound-activity relationships
  • 18. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Excel table export
  • 19. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Extracting compound-activity relationships What is the target?
  • 20. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Assay identification • Naïve Bayes classifier trained from assay descriptions identified by BindingDB curators • 10-fold cross validation: 98.9% recall, 94.7% precision • Paragraph associated with next table or table mentioned in paragraph • Target/organism detected • Care taken to avoid common irrelevant organisms/proteins e.g. bovine serum albumin
  • 21. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Results
  • 22. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Results From US Patent applications (2001-Mar 2017) Red = Bioactivity
  • 23. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Activities with associated structures per year 0 100,000 200,000 300,000 400,000 500,000 600,000 Activitty-structurerelationshipsextracted Publication Year
  • 24. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison with BindingDB • Activity data from ~1500 US patent grants (2013- 2016) manually extracted over the course of 3 years • ~150,000 activities • Comparison done on the subset that was made available in ChEMBL 22_1 (98,898 activity values, 1012 patents) • As some assay results are missed by the automatic extraction, and some are considered out of scope by BindingDB, difficult to distinguish differences in coverage from genuine disagreements
  • 25. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison with BindingDB • Values normalized into nM – 1000s of instances of measurements in nanometers! • Mid point of ranges taken • Structures compared by StdInChI • Target name normalized to ChEMBL target ID (organism specific), using either: – ChEMBL target synonyms – Normalize to HGNC symbol and check if HGNC symbol is a ChEMBL target synonym
  • 26. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Comparison Expected values found Expected structures found Expected value + structure found Expected value + structure + target 75% 65% 53% 18%
  • 27. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Unclear structure assignment ? ?
  • 28. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Stereochemistry and salts OH O O N H CH3H3C Br H H Patent BindingDB This work
  • 29. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Long tail of difficult cases What does this superscript term mean? What are the units?
  • 30. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Targets of patent data compared to journal data ChEMBL 22_1 (excluding BindingDB) US Patent Applications Common Target Classes 0% 5% 10% 15% 20% 25% 30% 35% 40% 2002 2004 2006 2008 2010 2012 2014 2016 %peryear Kinase GPCR (Family A) Protease Nuclear receptor Voltage-gated ion channel Electrochemical transporter Oxidoreductase 0% 5% 10% 15% 20% 25% 30% 35% 40% 2002 2004 2006 2008 2010 2012 2014 2016
  • 31. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Upcoming target classes 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Percentageofdocumentswithactivityvaluesagainst targetclass Epigenetic writer (Patents) Epigenetic reader (Patents) Epigenetic writer (ChEMBL ex BindingDB) Epigenetic reader (ChEMBL ex BindingDB)
  • 32. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Future work • Support for more complex R-group tables • Improve recognition and resolution of protein target names • Support for activities specified in text e.g. Example 1 has an IC50 of 12 nM measured at rat EP4 • Resolution of symbols for activity ranges e.g. “A” indicates an IC50 value of less than 100 nM • Improve assay metadata extraction cf. BioAssay Express
  • 33. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Disambiguation of Conflicting structure descriptions Image from original filing Redrawn by US patent office in ChemDraw Intended structure from chemical name
  • 34. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Conclusions • Processing all US patents from 2001 to present can be done in less than a day on a desktop PC • Technique applicable to chemical properties other than activity values • Compound number <-> structure relationships useful for key compound identification • For the majority of patents, extracting structure-activity relationships can be significantly expedited
  • 35. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Acknowledgements • Noel O`Boyle • John Mayfield • Funding provided by:
  • 36. 253rd ACS National Meeting, San Francisco CA, USA 4th April 2017 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com