SlideShare a Scribd company logo
Advanced grammars for
state-of-the-art named
entity recognition (NER)
Roger Sayle and daniel lowe
NextMove Software, Cambridge, UK
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
overview
• NextMove Software’s LeadMine text-mining engine
internally uses “CaffeineFix” (.cfx) technology for
specifying and efficiently matching important terms.
• In addition to case-sensitive and case-insensitive
term matching CaffeineFix/LeadMine also support
spelling correction (fuzzy matching).
• The most common usage is to simply compile
dictionaries into binary form for fast matching.
• Advanced users, specify “regular expressions”.
• In this presentation, we go beyond REGEXPs.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
leadmine v2 entity types
1. Chemicals
2. Biomolecules
3. Anatomy
4. Cell Lines
5. Diseases
6. Symptoms
7. Mechanisms of Action
8. Species/Organisms
9. Companies
10. Named Reactions
11. Regions
12. Languages/Possessives
1.1 Dictionary Names
1.2 Systematic Names
1.3 Generic Classes
1.4 Polymers
1.5 Formulae
2.1 Proteins
2.2 Genes
2.3 E.C. Numbers
2.4 PDB Codes
3.1 Cell Types
3.2 Cytogenetic Loci
1.1.1 Abbreviations
1.1.2 CAS RN Numbers
1.1.3 Registry Numbers
1.2.1 Functional Groups
1.2.2 Elements
1.2.3 Acids
1.2.4 SMILES
1.2.5 InChIs
2.1.1 Targets
2.1.2 P450s
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
named entity normal forms
• Chemicals SMILES and/or InChI
• Proteins UniProt
• Genes Entrez GeneID/HGNC
• Targets ChEMBL
• Species/Organism NCBI Taxonomy ID
• Diseases/Symptoms ICD-10
• Named Reactions RXNO
• Mechanism of Action ATC
• Many of these can also use NLM MeSH Terms.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Example entity dictionary as dag
• Nitrogen containing heterocycles as minimal DFA:
– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
Pyrimidine, Pyrazine
• CaffeineFix supports (very large) user dictionaries.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Obo ontologies as dictionaries
• In addition to regular TSV (tab-separated value) files
for storing dictionaries, LeadMine’s obo2dict also
supports OBO ontologies, a convenient method for
tracking synonyms and foreign language forms.
[Term]
id: RXNO:0000006
name: Diels-Alder reaction
synonym: "Diels-Alder cycloaddition" EXACT []
synonym: "ディールス・アルダー反応" EXACT Japanese []
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Plural form generation
• LeadMine’s pluralize automatically generates English
plural forms from singular dictionary entries.
diels-alder couplings RXNO:0000006
diels-alder cycloadditions RXNO:0000006
diels-alder reactions RXNO:0000006
acridine syntheses RXNO:0000518
acyclic beckmann rearrangements RXNO:0000564
acyloin condensations RXNO:0000085
olefin metatheses RXNO:0000280
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unusual entities
• ISBN, URL, PubMed SQL statement
• Roman Numerals, Date Solvent Mixture
• ColorState, Zip codes Hearst Patterns
• Katakana Unknown acid
• HELM, InChI, SMILES, v2000 Unknown antibody
• Credit Card Numbers Unknown disease
• Region Unknown INN
• Person Ordinal numbers
• Disease Cardinal numbers
• Journal de, es, fr, it, sv
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Grammars within grammars
• LeadMine grammar’s are specified constructively
effectively producing even more entity types.
• Region = City + Continent + Country + Island + Lake +
Mountain + Ocean + River + Sea + State/Province +
OtherFeature + OtherRegion.
• City = CityAlbania + CityAndorra + CityAustralia + CityAustria +
… + CityUS + …
• CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ +
CityUS_CA + CityUS_CO + …
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Pharma registry numbers
• CaffineFix v2.0 supports sets of user-defined
regular expressions as dictionaries.
• One application is specifying the format of
registry numbers, such as GSK204454A
• Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | …
• Number: d{3-7}
• Suffix: (“.” d) | [“a” .. “z”]
• RegistryNumber: Prefix [“ ” | “-”] Number [Suffix]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Cardinal numbers
• English
– One, ten, two thousand and forty eight, ten million
• German
– Eins, Zehn, Hundert, Million, Viermillion
– Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig
• French
– Trois cents, un mille, mille neuf cent quatre-vingts dix-huit
• Italian
– Uno, due, trenta, ottocentosessantamila settecentoottantanove
• Swedish
– en miljon trehundrasjuttiåtta tusen niohundrasjuttiett
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
cas registry number grammar
• Two to seven digits, followed by a hyphen, two digits,
a hyphen and a final check digit
– e.g. 7732-18-5
• Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Cas check digit calculation
• More generally CaffeineFix’s finite state machines
can do limited processing...
• The final check digit of a CAS number is calculated by
series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
previous digit times 3, and computing the sum
modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
4x3 + 5x7 + 6x7) mod 10 = 5.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
cas number correction example
• 7732-18-8? Did you mean...
– 7732-18-5
– 7732-11-8
– 77328-18-8
– 7733-18-8
– 77342-18-8
– 77392-18-8
– 71732-18-8
– 76732-18-8
– 97732-18-8
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Roman numerals
One useful operator is NonEmpty that removes the empty string
from the set of valid matches, and requires at least one or more
characters to match.
I
II
III
IV
V
VI
VII
VIII
IX
X
XX
XXX
XL
L
LX
LXX
LXXX
XC
C
CC
CCC
CD
D
DC
DCC
DCCC
CM
M
MM
MMM
Thousands Hundreds Tens Units
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unknown acid
• Another operators allows wildcards with exceptions,
effectively a not operator.
• An unknown acid is “[a-z’-]+ acid” where the first
word excludes:
– Stop words: a, the, and, any, is, in, was, etc.
– Common qualifiers: acceptable, preferred, etc.
– Adjectives: battery, free, inorganic, strong, etc.
– Known acids: acetic, nitric, amino, carboxylic, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unknown inn
• A variation on this theme allows LeadMine to
recognize novel (recently announced) kinase
inhibitors and antibodies based on the structure of
their INN names.
• An unknown kinase inhibitor is “[a-z]+inib” and an
unknown antibody is “[a-z]+mab” where the words
exclude previously known/reported INN names and
“colliding” English words.
april != capropril, KappaB != rozrolimupab, yuletide != exenatide,
triumvir != zanamivir, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Person grammar
• The named person grammar matches:
1. [Salutation] FirstName [Initials] Surname [Suffix]
2. [Salutation] FirstName [Initials] UnknownSurname [Suffix]
3. [Salutation] UnknownFirstName [Initials] Surname [Suffix]
• where
Salutation includes Mr., Mrs., Dr., Sir, His Highness, …
FirstName includes David, John, Sarah, Tom, Angela, …
Surname includes Smith, Jones, Overington, …
UnknownFirstname excludes Big, Lake, The, Outer, etc.
UnknownSurname excludes Avenue, Bridge, Street, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
List construction operator
• Another frequently used idiom, are the operators for
constructing comma separated list.
• These turn the grammar matching “X” into the
grammar matching things like “X, X, X and X”.
• More specifically:
(X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X
• Another variation of this allows “other”, “similar”
and “related” to the final X if the list is non-empty.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Hearst pattern grammars
• An example use of list constructions is in the
recognition of Heart Patterns.
1. X such as Y [“including”, “especially” etc.]
2. Y and other X [“and related”, “or similar” etc.]
3. such X as Y
• Where X is category or classification term;
• And Y is a list of exemplified terms.
• Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings
of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Complex object builder
• An application of the list construction operator is in
our “complex object builder” construction operator.
ComplexObjectBuilder cob;
cob.insert(“red”, “lorry”, “lorries”);
cob.insert(“yellow”, “lorry”, “lorries”);
• Allows matching not only of
“red lorry”, “red lorries”, “yellow lorry” and “yellow lorries”
• But also of…
“red and yellow lorries”, “yellow and red lorries”, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
complex disease examples
• Adenomatous polyps of the colon and rectum.
• Fibroepithelia or epithelial hyperplasias.
• Inherited spinocerebellar ataxia.
• Stage II or stage III colorectal cancer.
• Inherited breast and overian cancers.
• Argentinian, Bolivian and Korean haemorrhagic
fevers.
• Dermatitis due to heat, cold, radiation, cosmetics,
fungi and shellfish.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Grammars for Safety text mining
• “May cause lung damage if swallowed”
– “may” → “can”, “could”, “may”, “might”, “will”, etc.
– “cause” → “lead to”, “result in”, “trigger”, “bring on”, …
– “lung damage” → “explosion”, “cancer”, “injury”, …
– “if” → “when”, “once”…
– “swallowed” → “heated”, “shaken”, “dried”, “ignited”…
• “Highly toxic”
– “highly” → “very”, “extremely”, “unusually”, “intensely”…
– “toxic” → “explosive”, “carcinogenic”, “poisonous”…
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
efficient protein variant naming
• CaffeineFix technology can also be applied to naming
peptides and arbitrary protein variants/mutants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Dag representation of sequences
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
summary
• LeadMine’s .cfx files can do far more than efficiently
match very large dictionaries of terms.
• Indeed, many of the grammars used at NextMove
Software potentially match an infinite number of
terms.
• Construction of domain specific grammars can be
done in collaboration with LeadMine customers.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017

More Related Content

PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
Automatic extraction of bioactivity data from patents
PDF
Sketchy sketches hiding chemistry in plain sight
PDF
Chemistry and reactions from non-US patents
PDF
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
PDF
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Challenges and successes in machine interpretation of Markush descriptions
Automatic extraction of bioactivity data from patents
Sketchy sketches hiding chemistry in plain sight
Chemistry and reactions from non-US patents
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Pharmaceutical industry best practices in lessons learned: ELN implementation...

What's hot (10)

PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PPTX
Federated SPARQL Query Processing ISWC2015 Tutorial
PPTX
Federated SPARQL query processing over the Web of Data
PPTX
Efficient source selection for sparql endpoint federation
PPTX
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
PPTX
Federated Query Formulation and Processing Through BioFed
PPTX
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
PDF
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
PPTX
FedX - Optimization Techniques for Federated Query Processing on Linked Data
PPTX
Introduction to Linked Data
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL query processing over the Web of Data
Efficient source selection for sparql endpoint federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
Federated Query Formulation and Processing Through BioFed
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
FedX - Optimization Techniques for Federated Query Processing on Linked Data
Introduction to Linked Data
Ad

Similar to Advanced grammars for state-of-the-art named entity recognition (NER) (9)

PDF
Improved chemical text mining of patents using infinite dictionaries, transla...
PDF
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
PDF
Tackling the difficult areas of chemical entity extraction
PDF
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
PDF
Advances in Automatic Chemical Spelling Correction
PDF
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
PDF
Building vocabulary
PDF
Unlocking chemical information from tables and legacy articles
Improved chemical text mining of patents using infinite dictionaries, transla...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Advances in Automatic Chemical Spelling Correction
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Building vocabulary
Unlocking chemical information from tables and legacy articles
Ad

More from NextMove Software (20)

PDF
DeepSMILES
PDF
Building a bridge between human-readable and machine-readable representations...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Can we agree on the structure represented by a SMILES string? A benchmark dat...
PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
PDF
Recent improvements to the RDKit
PDF
Digital Chemical Representations
PDF
PubChem as a Biologics Database
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Challenges in Chemical Information Exchange
PDF
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
PDF
RDKit UGM 2016: Higher Quality Chemical Depictions
PDF
Chemical structure representation in PubChem
PDF
GHS and NFPA diamonds: where they come from and how they can be useful
PDF
Line notations for nucleic acids (both natural and therapeutic)
DeepSMILES
Building a bridge between human-readable and machine-readable representations...
CINF 35: Structure searching for patent information: The need for speed
A de facto standard or a free-for-all? A benchmark for reading SMILES
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Comparing Cahn-Ingold-Prelog Rule Implementations
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Recent improvements to the RDKit
Digital Chemical Representations
PubChem as a Biologics Database
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Building on Sand: Standard InChIs on non-standard molfiles
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Challenges in Chemical Information Exchange
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit UGM 2016: Higher Quality Chemical Depictions
Chemical structure representation in PubChem
GHS and NFPA diamonds: where they come from and how they can be useful
Line notations for nucleic acids (both natural and therapeutic)

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPT
veterinary parasitology ````````````.ppt
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
Application of enzymes in medicine (2).pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPTX
The Minerals for Earth and Life Science SHS.pptx
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
An interstellar mission to test astrophysical black holes
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
veterinary parasitology ````````````.ppt
TOTAL hIP ARTHROPLASTY Presentation.pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Biomechanics of the Hip - Basic Science.pptx
Placing the Near-Earth Object Impact Probability in Context
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Application of enzymes in medicine (2).pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
The Minerals for Earth and Life Science SHS.pptx
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Introduction to Cardiovascular system_structure and functions-1
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
An interstellar mission to test astrophysical black holes
7. General Toxicologyfor clinical phrmacy.pptx

Advanced grammars for state-of-the-art named entity recognition (NER)

  • 1. Advanced grammars for state-of-the-art named entity recognition (NER) Roger Sayle and daniel lowe NextMove Software, Cambridge, UK 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 2. overview • NextMove Software’s LeadMine text-mining engine internally uses “CaffeineFix” (.cfx) technology for specifying and efficiently matching important terms. • In addition to case-sensitive and case-insensitive term matching CaffeineFix/LeadMine also support spelling correction (fuzzy matching). • The most common usage is to simply compile dictionaries into binary form for fast matching. • Advanced users, specify “regular expressions”. • In this presentation, we go beyond REGEXPs. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 3. leadmine v2 entity types 1. Chemicals 2. Biomolecules 3. Anatomy 4. Cell Lines 5. Diseases 6. Symptoms 7. Mechanisms of Action 8. Species/Organisms 9. Companies 10. Named Reactions 11. Regions 12. Languages/Possessives 1.1 Dictionary Names 1.2 Systematic Names 1.3 Generic Classes 1.4 Polymers 1.5 Formulae 2.1 Proteins 2.2 Genes 2.3 E.C. Numbers 2.4 PDB Codes 3.1 Cell Types 3.2 Cytogenetic Loci 1.1.1 Abbreviations 1.1.2 CAS RN Numbers 1.1.3 Registry Numbers 1.2.1 Functional Groups 1.2.2 Elements 1.2.3 Acids 1.2.4 SMILES 1.2.5 InChIs 2.1.1 Targets 2.1.2 P450s 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 4. named entity normal forms • Chemicals SMILES and/or InChI • Proteins UniProt • Genes Entrez GeneID/HGNC • Targets ChEMBL • Species/Organism NCBI Taxonomy ID • Diseases/Symptoms ICD-10 • Named Reactions RXNO • Mechanism of Action ATC • Many of these can also use NLM MeSH Terms. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 5. Example entity dictionary as dag • Nitrogen containing heterocycles as minimal DFA: – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine • CaffeineFix supports (very large) user dictionaries. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 6. Obo ontologies as dictionaries • In addition to regular TSV (tab-separated value) files for storing dictionaries, LeadMine’s obo2dict also supports OBO ontologies, a convenient method for tracking synonyms and foreign language forms. [Term] id: RXNO:0000006 name: Diels-Alder reaction synonym: "Diels-Alder cycloaddition" EXACT [] synonym: "ディールス・アルダー反応" EXACT Japanese [] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 7. Plural form generation • LeadMine’s pluralize automatically generates English plural forms from singular dictionary entries. diels-alder couplings RXNO:0000006 diels-alder cycloadditions RXNO:0000006 diels-alder reactions RXNO:0000006 acridine syntheses RXNO:0000518 acyclic beckmann rearrangements RXNO:0000564 acyloin condensations RXNO:0000085 olefin metatheses RXNO:0000280 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 8. Unusual entities • ISBN, URL, PubMed SQL statement • Roman Numerals, Date Solvent Mixture • ColorState, Zip codes Hearst Patterns • Katakana Unknown acid • HELM, InChI, SMILES, v2000 Unknown antibody • Credit Card Numbers Unknown disease • Region Unknown INN • Person Ordinal numbers • Disease Cardinal numbers • Journal de, es, fr, it, sv 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 9. Grammars within grammars • LeadMine grammar’s are specified constructively effectively producing even more entity types. • Region = City + Continent + Country + Island + Lake + Mountain + Ocean + River + Sea + State/Province + OtherFeature + OtherRegion. • City = CityAlbania + CityAndorra + CityAustralia + CityAustria + … + CityUS + … • CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ + CityUS_CA + CityUS_CO + … 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 10. Pharma registry numbers • CaffineFix v2.0 supports sets of user-defined regular expressions as dictionaries. • One application is specifying the format of registry numbers, such as GSK204454A • Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | … • Number: d{3-7} • Suffix: (“.” d) | [“a” .. “z”] • RegistryNumber: Prefix [“ ” | “-”] Number [Suffix] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 11. Cardinal numbers • English – One, ten, two thousand and forty eight, ten million • German – Eins, Zehn, Hundert, Million, Viermillion – Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig • French – Trois cents, un mille, mille neuf cent quatre-vingts dix-huit • Italian – Uno, due, trenta, ottocentosessantamila settecentoottantanove • Swedish – en miljon trehundrasjuttiåtta tusen niohundrasjuttiett 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 12. cas registry number grammar • Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit – e.g. 7732-18-5 • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 13. Cas check digit calculation • More generally CaffeineFix’s finite state machines can do limited processing... • The final check digit of a CAS number is calculated by series term summation modulo 10. • The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10. • The CAS number for water is 7732-18-5. • The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 14. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 15. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 16. cas number correction example • 7732-18-8? Did you mean... – 7732-18-5 – 7732-11-8 – 77328-18-8 – 7733-18-8 – 77342-18-8 – 77392-18-8 – 71732-18-8 – 76732-18-8 – 97732-18-8 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 17. Roman numerals One useful operator is NonEmpty that removes the empty string from the set of valid matches, and requires at least one or more characters to match. I II III IV V VI VII VIII IX X XX XXX XL L LX LXX LXXX XC C CC CCC CD D DC DCC DCCC CM M MM MMM Thousands Hundreds Tens Units 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 18. Unknown acid • Another operators allows wildcards with exceptions, effectively a not operator. • An unknown acid is “[a-z’-]+ acid” where the first word excludes: – Stop words: a, the, and, any, is, in, was, etc. – Common qualifiers: acceptable, preferred, etc. – Adjectives: battery, free, inorganic, strong, etc. – Known acids: acetic, nitric, amino, carboxylic, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 19. Unknown inn • A variation on this theme allows LeadMine to recognize novel (recently announced) kinase inhibitors and antibodies based on the structure of their INN names. • An unknown kinase inhibitor is “[a-z]+inib” and an unknown antibody is “[a-z]+mab” where the words exclude previously known/reported INN names and “colliding” English words. april != capropril, KappaB != rozrolimupab, yuletide != exenatide, triumvir != zanamivir, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 20. Person grammar • The named person grammar matches: 1. [Salutation] FirstName [Initials] Surname [Suffix] 2. [Salutation] FirstName [Initials] UnknownSurname [Suffix] 3. [Salutation] UnknownFirstName [Initials] Surname [Suffix] • where Salutation includes Mr., Mrs., Dr., Sir, His Highness, … FirstName includes David, John, Sarah, Tom, Angela, … Surname includes Smith, Jones, Overington, … UnknownFirstname excludes Big, Lake, The, Outer, etc. UnknownSurname excludes Avenue, Bridge, Street, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 21. List construction operator • Another frequently used idiom, are the operators for constructing comma separated list. • These turn the grammar matching “X” into the grammar matching things like “X, X, X and X”. • More specifically: (X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X • Another variation of this allows “other”, “similar” and “related” to the final X if the list is non-empty. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 22. Hearst pattern grammars • An example use of list constructions is in the recognition of Heart Patterns. 1. X such as Y [“including”, “especially” etc.] 2. Y and other X [“and related”, “or similar” etc.] 3. such X as Y • Where X is category or classification term; • And Y is a list of exemplified terms. • Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 23. Complex object builder • An application of the list construction operator is in our “complex object builder” construction operator. ComplexObjectBuilder cob; cob.insert(“red”, “lorry”, “lorries”); cob.insert(“yellow”, “lorry”, “lorries”); • Allows matching not only of “red lorry”, “red lorries”, “yellow lorry” and “yellow lorries” • But also of… “red and yellow lorries”, “yellow and red lorries”, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 24. complex disease examples • Adenomatous polyps of the colon and rectum. • Fibroepithelia or epithelial hyperplasias. • Inherited spinocerebellar ataxia. • Stage II or stage III colorectal cancer. • Inherited breast and overian cancers. • Argentinian, Bolivian and Korean haemorrhagic fevers. • Dermatitis due to heat, cold, radiation, cosmetics, fungi and shellfish. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 25. Grammars for Safety text mining • “May cause lung damage if swallowed” – “may” → “can”, “could”, “may”, “might”, “will”, etc. – “cause” → “lead to”, “result in”, “trigger”, “bring on”, … – “lung damage” → “explosion”, “cancer”, “injury”, … – “if” → “when”, “once”… – “swallowed” → “heated”, “shaken”, “dried”, “ignited”… • “Highly toxic” – “highly” → “very”, “extremely”, “unusually”, “intensely”… – “toxic” → “explosive”, “carcinogenic”, “poisonous”… 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 26. efficient protein variant naming • CaffeineFix technology can also be applied to naming peptides and arbitrary protein variants/mutants. • Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin – CFVRNCPTG annetocin – CFWTSCPIG octopressin – CYFQNCPRG argipressin – CYFQNCPKG lypressin – CYFRNCPIG cephalotocin – CYIQNCPLG oxytocin – CYIQNCPPG prol-oxytocin – CYIQNCPRG vasotocin – CYIQSCPIG seritocin – CYISNCPIG isotocin 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 27. Dag representation of sequences These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 28. entirety of uniprot/swissprot • Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb). • This data structure allows close analogues to be identified much faster than using NCBI blastp. • For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop. • The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 29. Application to precision medicine • A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0 • Run-time for this query is 0.2s. • By comparison, blastp 2.2.29+ takes about 6s. – With default arguments, NCBI blastp run time is 7s. – Only 6s with –num_descriptions 1 –num_alignments 1. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 30. summary • LeadMine’s .cfx files can do far more than efficiently match very large dictionaries of terms. • Indeed, many of the grammars used at NextMove Software potentially match an infinite number of terms. • Construction of domain specific grammars can be done in collaboration with LeadMine customers. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017