SlideShare a Scribd company logo
Building a bridge between
human-readable
and
machine-readable
representations of biopolymers
Noel O’Boyle and Roger Sayle
NextMove Software
256th ACS National Meeting Washington Aug 2018
Introduction
Building a bridge between human-readable and machine-readable representations of biopolymers
Names preferred by MACHINES
CC(C)C[C@H](NC(=O)[C@H](CCCN)NC(=O)[C@H](CCCCNC(=O)[C@H](Cc1ccccc1)NC
(=O)[C@@H](N)CCCCN)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](N)CCCCN)C(=O)N
CCCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCN)NC(=O)[C@H](CCCCNC(=O)[
C@H](Cc1ccccc1)NC(=O)[C@@H](N)CCCCN)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@
H](N)CCCCN)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC
N)C(N)=O
PEPTIDE1{K.F.K.[Orn].L.K.K.L.K.[am]}|PEPTIDE2{K.F.K.[Orn].L}|PEPTIDE3{K.F}|PEPTIDE4{K.F}$PEPTIDE1,
PEPTIDE2,6:R3-5:R2|PEPTIDE2,PEPTIDE3,3:R3-2:R2|PEPTIDE1,PEPTIDE4,3:R3-2:R2$$$
HELM
SMILES
Names preferred by Humans
• IUPAC/IUBMB recommendations from 1983 describe
a three-letter system for peptides [1]
– L- by default, D-/DL- must be specified
– Side-chain substitutions like Ser(Ac), Asp(OMe)
– Terminal modifications like Ac-Tyr-OMe, Me2-Lys
– Backbone N substitution like Ala-(Me)Ala
– Cyclic peptides like cyclo(-Val-Orn-Leu-)
– Peptide analogs like Ala-[psi](NH-CO)-Ala
• See also Bachem nomenclature guide [2]
[1] http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/
[2] http://www.bachem.com/service-support/faq/nomenclature/
adrenorphin as H-Tyr-Gly-Gly-Phe-Met-Arg-Arg-Val-NH2
• The recommendations use drawings of bonds to
indicate heterodetic cyclic peptides
• In practice, people either use free text or cyclo
– cyclo only handles simple situations; cannot handle
overlapping disulphide bridges for example
• We use ring closure bonds like in SMILES [1]
– H-Cys(1)-Tyr-Ile-Gln-Asn-Cys(1)-Pro-Leu-Gly-NH2
– H-Thr(1)-Gly-Gly-Gly-(1)
Describing Cycles without LINES
[1] Similar to that described in “Abbreviated nomenclature for cyclic and
branched homo- and hetero-detic peptides.” J. Peptide Res. 2005, 65, 550.
NAMES SUITABLE FOR HUMANS *AND*
MACHINES
H-Lys-Phe-(1).H-Lys-Phe-Lys(1)-Orn-Leu-Lys(2)-Lys-Leu-Lys-NH2.H-Lys-Phe-Lys(3)-
Orn-Leu-(2).H-Lys-Phe-(3)
IUPAC Condensed
Human-Readable
Monomer names
Roger’s Recommendations [1]
• Don’t use a dictionary; create monomers from
building blocks and use systematic names
– Stereo, parent, backbone/sidechain substituents
• H-Ala-D-N(Bu)Phe(4-Cl)-OH
• Retain widely used 3-letter codes and use substituent
abbreviations or line formulae
• Consider aminoacids to have default substitution
locants, but have ability to specify
– Ser(Me), Phe(4-Cl)
• Implicit leaving groups
– Asp(OMe) – OH for acids, H otherwise
[1] https://www.slideshare.net/NextMoveSoftware/dallas-monomers
Monomer names, Choose wisely
• S&S used some monomer names adapted from PDB
monomer codes
– Sometimes not used in practice in the field
• For cysteine sulfinic acid
– PDB has CSD, and S&S had Csd
– Now changed to Cys(O2H)
• For selenomethionine
– PDB has MSE, and S&S had Mse
– Now changed to SeMet
Monomer names, Choose wisely
• 1-amino-cyclopentyl carboxylic acid
– HELM 1.0 has Spg
– S&S has Ac5c (Ac6c, etc.)
– Vendors use Ac5c but also Cle
• HELM 1.0 has Glc as a peptide N-terminal modifier
– Representing glycolic acid
– In sugar nomenclature, Glc is glucose, the most common
monosaccharide
Text-Mining Human Representations
• What three (or more) letter codes do people use?
– For non-standard aminoacids
– For sidechain substituents
– For C- and N- terminal modifications
– For non-standard connections between aminoacids
• We can answer these questions by text-mining
PubMed Abstracts or the patent literature
– Using a grammar for IUPAC condensed notation
Mined the Gaps
• Text-mining is usually for matching phrases that you
know
– How do you text-mine phrases that you don’t know?
• Look at the gaps between text that is recognised
(“entity extension”)
– Filter for gaps that do not contain space
– Sort the results by frequency to identify common
abbreviations that were missed
Building a bridge between human-readable and machine-readable representations of biopolymers
Igl
2-indan-2-yl-glycine
19 times
Aca
ε-aminocaproic acid
9 times
Dhb
dehydrobutryine
21 times
Bpa
p-benzoyl phenylalanine
10 times
Dpg
α,α-diisopropylglycine
19 times
Unrecognised Peptide Analogs
(peptide bond) -psi(CH2NH)-
9 times
-psi-(CH2S)-
10 times
-psi(PO2CH2)-
7 times
-NH-CO-NH-
25 times
-psi[CH(OH)CH2]
7 times
E.g. Ala-psi(CH2NH)-Ala instead of Ala-Ala
Unrecognised N-terminus Prefixes
• Extract text preceding recognised text
– Must end with a hyphen
– Stop at the first space
dansyl-
121 times
hippuryl-
39 times
benzyloxycarbonyl-
530 times
Suc-/succinyl-
329/167 times
Unrecognised N-terminus Prefixes
• Extract text preceding recognised text
– Must end with a hyphen
– Stop at the first space
dansyl-
121 times
hippuryl-
39 times
benzyloxycarbonyl-
530 times
Suc-/succinyl-
329/167 times
Unrecognised C-terminus Suffixes
• Extract text following any recognised text
– First pass focused on text beginning with a hyphen up to the
first space
– Later analyses identified common space-separated phrases
for esters
-chloromethylketone
(and variants)
270 times
-fluoromethylketone
(and variants)
607 times
ethyl ester
(already had –OEt)
30 times
-beta-naphthylamide
(and variants)
193 times
3-letter codes
• “A foolish consistency is the hobgoblin of little minds” –
Ralph Waldo Emerson
• No particular reason to stick to 3-letters
– Eventually leads to ambiguities
• Hyp as hydroxyproline or hypoxanthine
• Xan as xanthen-9-yl or xanthosine
• Unless an abbreviation is very common, favor longer
names that are more descriptive
– Igl vs Gly(indan-2-yl)
– Dhb vs Abu(2,3-dehydro)
– Bpa vs Phe(4-Bz)
Unrecognised SubStituents
• Search for text containing any aminoacid followed by
a bracketed expression
– I used a regular expression
Pxy, e.g. Lys(Pxy)
pyridoxyl
7 times
Mbh, e.g. Asn(Mbh)
4,4'-dimethoxybenzhydryl
8 times
Mts, e.g. Arg(Mts)
mesitylene-2-sulfonyl
7 times
Some care Needed
• Not everything that looks like a peptide is a peptide
– Ile-de-France (sheep)
– Glyol, Leuol and Pheol
• but not Tyrol, Lysol or Metol
– Argal and Proal
• but not Metal, Ileal, Penal or Seral
• Even where it definitely is a peptide, it is necessary to
check the details:
– Is this abbreviation widely used?
– Does it occur in different peptides?
– Is it unambiguous?
Some care Needed
• Dpa was found to occur 21 times in the gaps
– Many times (but not always) in the same peptide
• Inspection of the papers behind the abstracts and
googling “Boc-Dpa-OH” indicated two potential
meanings
diaminopropionic acid
aka Dap
3-(2,4-dinitrophenyl)-L-2,3-
diaminopropionyl
HUMAN-READABLE → MACHINE-READABLE
• 16.4K oligopeptides* textmined from PubMed
Abstracts
* Containing at least 5 monomers
Tyr-d-Ala-Phe-Gly-Tyr-Pro-Ser-NH(2)
Asp-Thr(P)-Pro-Ala-Lys
Pyr-Gly-Pro-Pro-Ile-Ser-Ile-Asp-Leu-Ser-Leu-Glu-Leu-Leu-Arg-Lys-Met-Ile-Glu-Ile
Gly-Ala-Aib-Pro-Ala-Aib-Aib-Glu
Nle-Leu-Phe-Nle-Tyr-Lys
L-Ala-D-Glu-L-Lys-D-Ala-D-Ala
D-Phe-Cys-Tyr-D-Trp-Orn-Thr-Pen-Thr-NH2
Glu-Asp-Pro-Gln-Gly-Asx-Ala-Ala
Ac-Phe-Leu-Val-His-NH2
Gly-Asx-Glx-Ser-Thr-Cys
Ac-Met-Glu-Glu-Lys-Leu-Lys-Lys-Thr-Lys-Ile-Ile-Phe-Val-Val-Gly-Gly-Pro-Gly-Ser-Gly-Lys-Gly-Thr-Gln-
Cys-Glu-Lys-Ile-Val-Gln-Lys-Tyr-Gly-Tyr-Thr-His-Leu-Ser-Thr-Gly-Asp-Leu-Leu-Arg-Ser-Glu-Val-Ser-Ser-
Gly-Ser-Ala-Arg-Gly-Lys-Lys-Leu-Ser-Glu-Ile-Met-Glu-Lys-Gly-Gln-Leu-Val-Pro-Leu-Glu-Thr-Val-Leu-Asp-
Met-Leu-Arg-Asp-Ala-Met-Val-Ala-Lys-Val-Asn-Thr-Ser-Lys-Gly-Phe-Leu-Ile-Asp-Gly-…….
HUMAN-READABLE → MACHINE-READABLE
• 16.4K oligopeptides* textmined from PubMed
Abstracts and converted to HELM
– 2.2K not converted, 1.6K as inline HELM, 12.6K as regular
HELM
* Containing at least 5 monomers
PEPTIDE1{Y.[dA].F.G.Y.P.S.[am]}$$$$
PEPTIDE1{D.[*C(=O)[C@H]([C@@H](C)OP(=O)(O)O)N* |$_R2;;;;;;;;;;;;_R1$|].P.A.K}$$$$
PEPTIDE1{[Glp].G.P.P.I.S.I.D.L.S.L.E.L.L.R.K.M.I.E.I}$$$$
PEPTIDE1{G.A.[Aib].P.A.[Aib].[Aib].E}$$$$
PEPTIDE1{[Nle].L.F.[Nle].Y.K}$$$$
PEPTIDE1{A.[dE].K.[dA].[dA]}$$$$
PEPTIDE1{[dF].C.Y.[dW].[Orn].T.[Pen].T.[am]}$$$$
PEPTIDE1{E.D.P.Q.G.(D,N).A.A}$$$$V2.0
PEPTIDE1{[ac].F.L.V.H.[am]}$$$$
PEPTIDE1{G.(D,N).(E,Q).S.T.C}$$$$V2.0
PEPTIDE1{[ac].M.E.E.K.L.K.K.T.K.I.I.F.V.V.G.G.P.G.S.G.K.G.T.Q.C.E.K.I.V.Q.K.Y.G.Y.T.H.L.S.T.G.D.L.L.R.S.E.V.S.S
.G.S.A.R.G.K.K.L.S.E.I.M.E.K.G.Q.L.V.P.L.E.T.V.L.D.M.L.R.D.A.M.V.A.K.V.N.T.S.K.G.F.L.I.D.G.Y.P.R.E.V.Q.Q.G.E.
E.F.E.R.R.I.G.Q.P.T.L.L.L.Y.V.D.A.G.P.E.T.M.T.R.R.L.L.K.R.G.E.T.S.G.R.V.D.N.E.E.T.I.K.K.R.L.E.T.Y.Y.K.A.T.E.P.V.I.A.
F.Y.E.K.R.G.I.V.R.K.V.N.A.E.G.S.V.D.E.V.F.S.Q.V.C.T.H.L.D.A.L.K}$$$$
Human-Readable
Peptide names
IUPAC Names for PeptideS
(S)-N-((S)-1-((2-amino-2-oxoethyl)amino)-4-methyl-1-oxopentan-2-yl)-1-
((4R,7S,10S,13S,16S,19R)-19-amino-7-(2-amino-2-oxoethyl)-10-(3-amino-3-oxopropyl)-16-(4-
hydroxybenzyl)-13-isobutyl-6,9,12,15,18-pentaoxo-1,2-dithia-5,8,11,14,17-
pentaazacycloicosane-4-carbonyl)pyrrolidine-2-carboxamide
(2S)-1-[(4R,7S,10S,13S,16S,19R)-19-amino-7-(2-amino-2-oxo-ethyl)-10-(3-amino-3-oxo-
propyl)-16-[(4-hydroxyphenyl)methyl]-13-isobutyl-6,9,12,15,18-pentaoxo-1,2-dithia-
5,8,11,14,17-pentazacycloicosane-4-carbonyl]-N-[(1S)-1-[(2-amino-2-oxo-ethyl)carbamoyl]-3-
methyl-butyl]pyrrolidine-2-carboxamide
(2S)-2-{[(2S)-1-[(4R,7S,10S,13S,16S,19R)-19-amino-10-(2-carbamoylethyl)-7-
(carbamoylmethyl)-16-[(4-hydroxyphenyl)methyl]-13-(2-methylpropyl)-6,9,12,15,18-
pentaoxo-1,2-dithia-5,8,11,14,17-pentaazacycloicosane-4-carbonyl]pyrrolidin-2-
yl]formamido}-N-(carbamoylmethyl)-4-methylpentanamide
L-cysteinyl-L-tyrosyl-L-leucyl-L-glutaminyl-L-asparagyl-L-cysteinyl-L-prolyl-L-leucyl-
glycinamide (1->6)-disulphide
[Leu3]oxytocin
Names that show Relationships
• Often better to describe a structure as a delta (or
modification) of a known structure
– Earlier, I showed Bpa versus Phe(4-Bz); reduces cognitive load;
similarity of Phe(4-X) follows intuitively
• If we apply this to whole peptides:
– Name a peptide as a delta from a reference set of peptides –
e.g. ‘known peptides’, or an in-house database
• Modification nomenclature described in 1983 IUPAC-
IUBMB recommendations
Analysis of PubChem
• Curated database of oligopeptides of biological
interest (currently 452 entries)
• 10.5% of the 170,708 peptides of length 5 or greater
in PubChem can be named as variants of these
– argipressin (1-8) vs H-Cys(1)-Tyr-Phe-Gln-Asn-Cys(1)-Pro-Arg-OH
– Cbz-cholecystokinin octapeptide (2-7) amide vs Cbz-
Tyr(SO3H)-Met-Gly-Trp-Met-Asp-NH2
– [Ile1,Ser2,Ser8]cyphokinin vs H-Ile-Ser-Arg-Pro-Pro-Gly-Phe-Ser-
Pro-Phe-Arg-OH
In Conclusion
• Textmining using LeadMine
• Biologic representations using Sugar&Splice
THANKS!
TOOLS USED
• Machines can bridge the gap to humans by
reading/writing human-readable representations of
biopolymers
• Appropriate monomer names can be found by
mining the literature
noel@nextmovesoftware.com

More Related Content

PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PDF
Roundtripping between small-molecule and biopolymer representations
PDF
PubChem as a Biologics Database
PDF
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
PDF
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
PDF
InChI for Large Molecules
ODP
Critical Assessment of Function Annotation, 2005
PDF
Custom peptide synthesis services
A de facto standard or a free-for-all? A benchmark for reading SMILES
Roundtripping between small-molecule and biopolymer representations
PubChem as a Biologics Database
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
InChI for Large Molecules
Critical Assessment of Function Annotation, 2005
Custom peptide synthesis services

Similar to Building a bridge between human-readable and machine-readable representations of biopolymers (20)

PDF
CINF 4: Naming algorithms for derivatives of peptide-like natural products
PDF
Common Functional Groups in Organic Compounds
PPT
Organic Chemistry presentation useful.ppt
PPT
organic chemistry an overview carbon to candles
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PPTX
IB Chemistry on Homologous series, Functional gp and nomenclature
PPT
CARBOHYDRATES.ppt................................
PPT
CARBOHYDRATES. powerpoint presentation ...
PDF
Representation and display of non-standard peptides using semi-systematic ami...
PPT
Chapter18
PPT
Easy Chemistry for ALL students of chemical.ppt
PPTX
Intro to Open Babel
PPTX
IB Chemistry on Homologous series and functional groups of organic molecules
PDF
Efficient matching of multiple chemical subgraphs
PDF
IB Chemistry on Organic nomenclature and functional groups.
PDF
IB Chemistry on Homologous series and functional groups of organic molecules
PPT
Macromolecules scf 1.4.1
DOCX
CH1000 Fundamentals of ChemistryModule 2 – Chapter 6
PPTX
P7 2018 biopython3
PDF
Line notations for nucleic acids (both natural and therapeutic)
CINF 4: Naming algorithms for derivatives of peptide-like natural products
Common Functional Groups in Organic Compounds
Organic Chemistry presentation useful.ppt
organic chemistry an overview carbon to candles
Representing Chemicals Digitally: An overview of Cheminformatics
IB Chemistry on Homologous series, Functional gp and nomenclature
CARBOHYDRATES.ppt................................
CARBOHYDRATES. powerpoint presentation ...
Representation and display of non-standard peptides using semi-systematic ami...
Chapter18
Easy Chemistry for ALL students of chemical.ppt
Intro to Open Babel
IB Chemistry on Homologous series and functional groups of organic molecules
Efficient matching of multiple chemical subgraphs
IB Chemistry on Organic nomenclature and functional groups.
IB Chemistry on Homologous series and functional groups of organic molecules
Macromolecules scf 1.4.1
CH1000 Fundamentals of ChemistryModule 2 – Chapter 6
P7 2018 biopython3
Line notations for nucleic acids (both natural and therapeutic)
Ad

More from NextMove Software (20)

PDF
DeepSMILES
PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Can we agree on the structure represented by a SMILES string? A benchmark dat...
PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Recent improvements to the RDKit
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
PDF
Digital Chemical Representations
PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Advanced grammars for state-of-the-art named entity recognition (NER)
PDF
Challenges in Chemical Information Exchange
PDF
Automatic extraction of bioactivity data from patents
PDF
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
DeepSMILES
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 35: Structure searching for patent information: The need for speed
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Comparing Cahn-Ingold-Prelog Rule Implementations
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Recent improvements to the RDKit
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Digital Chemical Representations
Challenges and successes in machine interpretation of Markush descriptions
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Building on Sand: Standard InChIs on non-standard molfiles
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Advanced grammars for state-of-the-art named entity recognition (NER)
Challenges in Chemical Information Exchange
Automatic extraction of bioactivity data from patents
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
Ad

Recently uploaded (20)

PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
C1 cut-Methane and it's Derivatives.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
perinatal infections 2-171220190027.pptx
PPTX
BIOMOLECULES PPT........................
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Application of enzymes in medicine (2).pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
The Minerals for Earth and Life Science SHS.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPT
veterinary parasitology ````````````.ppt
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
C1 cut-Methane and it's Derivatives.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Introcution to Microbes Burton's Biology for the Health
7. General Toxicologyfor clinical phrmacy.pptx
perinatal infections 2-171220190027.pptx
BIOMOLECULES PPT........................
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Application of enzymes in medicine (2).pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
CORDINATION COMPOUND AND ITS APPLICATIONS
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
The Minerals for Earth and Life Science SHS.pptx
6.1 High Risk New Born. Padetric health ppt
lecture 2026 of Sjogren's syndrome l .pdf
veterinary parasitology ````````````.ppt
Science Quipper for lesson in grade 8 Matatag Curriculum

Building a bridge between human-readable and machine-readable representations of biopolymers

  • 1. Building a bridge between human-readable and machine-readable representations of biopolymers Noel O’Boyle and Roger Sayle NextMove Software 256th ACS National Meeting Washington Aug 2018
  • 4. Names preferred by MACHINES CC(C)C[C@H](NC(=O)[C@H](CCCN)NC(=O)[C@H](CCCCNC(=O)[C@H](Cc1ccccc1)NC (=O)[C@@H](N)CCCCN)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](N)CCCCN)C(=O)N CCCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCN)NC(=O)[C@H](CCCCNC(=O)[ C@H](Cc1ccccc1)NC(=O)[C@@H](N)CCCCN)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@ H](N)CCCCN)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC N)C(N)=O PEPTIDE1{K.F.K.[Orn].L.K.K.L.K.[am]}|PEPTIDE2{K.F.K.[Orn].L}|PEPTIDE3{K.F}|PEPTIDE4{K.F}$PEPTIDE1, PEPTIDE2,6:R3-5:R2|PEPTIDE2,PEPTIDE3,3:R3-2:R2|PEPTIDE1,PEPTIDE4,3:R3-2:R2$$$ HELM SMILES
  • 5. Names preferred by Humans • IUPAC/IUBMB recommendations from 1983 describe a three-letter system for peptides [1] – L- by default, D-/DL- must be specified – Side-chain substitutions like Ser(Ac), Asp(OMe) – Terminal modifications like Ac-Tyr-OMe, Me2-Lys – Backbone N substitution like Ala-(Me)Ala – Cyclic peptides like cyclo(-Val-Orn-Leu-) – Peptide analogs like Ala-[psi](NH-CO)-Ala • See also Bachem nomenclature guide [2] [1] http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/ [2] http://www.bachem.com/service-support/faq/nomenclature/ adrenorphin as H-Tyr-Gly-Gly-Phe-Met-Arg-Arg-Val-NH2
  • 6. • The recommendations use drawings of bonds to indicate heterodetic cyclic peptides • In practice, people either use free text or cyclo – cyclo only handles simple situations; cannot handle overlapping disulphide bridges for example • We use ring closure bonds like in SMILES [1] – H-Cys(1)-Tyr-Ile-Gln-Asn-Cys(1)-Pro-Leu-Gly-NH2 – H-Thr(1)-Gly-Gly-Gly-(1) Describing Cycles without LINES [1] Similar to that described in “Abbreviated nomenclature for cyclic and branched homo- and hetero-detic peptides.” J. Peptide Res. 2005, 65, 550.
  • 7. NAMES SUITABLE FOR HUMANS *AND* MACHINES H-Lys-Phe-(1).H-Lys-Phe-Lys(1)-Orn-Leu-Lys(2)-Lys-Leu-Lys-NH2.H-Lys-Phe-Lys(3)- Orn-Leu-(2).H-Lys-Phe-(3) IUPAC Condensed
  • 9. Roger’s Recommendations [1] • Don’t use a dictionary; create monomers from building blocks and use systematic names – Stereo, parent, backbone/sidechain substituents • H-Ala-D-N(Bu)Phe(4-Cl)-OH • Retain widely used 3-letter codes and use substituent abbreviations or line formulae • Consider aminoacids to have default substitution locants, but have ability to specify – Ser(Me), Phe(4-Cl) • Implicit leaving groups – Asp(OMe) – OH for acids, H otherwise [1] https://www.slideshare.net/NextMoveSoftware/dallas-monomers
  • 10. Monomer names, Choose wisely • S&S used some monomer names adapted from PDB monomer codes – Sometimes not used in practice in the field • For cysteine sulfinic acid – PDB has CSD, and S&S had Csd – Now changed to Cys(O2H) • For selenomethionine – PDB has MSE, and S&S had Mse – Now changed to SeMet
  • 11. Monomer names, Choose wisely • 1-amino-cyclopentyl carboxylic acid – HELM 1.0 has Spg – S&S has Ac5c (Ac6c, etc.) – Vendors use Ac5c but also Cle • HELM 1.0 has Glc as a peptide N-terminal modifier – Representing glycolic acid – In sugar nomenclature, Glc is glucose, the most common monosaccharide
  • 12. Text-Mining Human Representations • What three (or more) letter codes do people use? – For non-standard aminoacids – For sidechain substituents – For C- and N- terminal modifications – For non-standard connections between aminoacids • We can answer these questions by text-mining PubMed Abstracts or the patent literature – Using a grammar for IUPAC condensed notation
  • 13. Mined the Gaps • Text-mining is usually for matching phrases that you know – How do you text-mine phrases that you don’t know? • Look at the gaps between text that is recognised (“entity extension”) – Filter for gaps that do not contain space – Sort the results by frequency to identify common abbreviations that were missed
  • 15. Igl 2-indan-2-yl-glycine 19 times Aca ε-aminocaproic acid 9 times Dhb dehydrobutryine 21 times Bpa p-benzoyl phenylalanine 10 times Dpg α,α-diisopropylglycine 19 times
  • 16. Unrecognised Peptide Analogs (peptide bond) -psi(CH2NH)- 9 times -psi-(CH2S)- 10 times -psi(PO2CH2)- 7 times -NH-CO-NH- 25 times -psi[CH(OH)CH2] 7 times E.g. Ala-psi(CH2NH)-Ala instead of Ala-Ala
  • 17. Unrecognised N-terminus Prefixes • Extract text preceding recognised text – Must end with a hyphen – Stop at the first space dansyl- 121 times hippuryl- 39 times benzyloxycarbonyl- 530 times Suc-/succinyl- 329/167 times
  • 18. Unrecognised N-terminus Prefixes • Extract text preceding recognised text – Must end with a hyphen – Stop at the first space dansyl- 121 times hippuryl- 39 times benzyloxycarbonyl- 530 times Suc-/succinyl- 329/167 times
  • 19. Unrecognised C-terminus Suffixes • Extract text following any recognised text – First pass focused on text beginning with a hyphen up to the first space – Later analyses identified common space-separated phrases for esters -chloromethylketone (and variants) 270 times -fluoromethylketone (and variants) 607 times ethyl ester (already had –OEt) 30 times -beta-naphthylamide (and variants) 193 times
  • 20. 3-letter codes • “A foolish consistency is the hobgoblin of little minds” – Ralph Waldo Emerson • No particular reason to stick to 3-letters – Eventually leads to ambiguities • Hyp as hydroxyproline or hypoxanthine • Xan as xanthen-9-yl or xanthosine • Unless an abbreviation is very common, favor longer names that are more descriptive – Igl vs Gly(indan-2-yl) – Dhb vs Abu(2,3-dehydro) – Bpa vs Phe(4-Bz)
  • 21. Unrecognised SubStituents • Search for text containing any aminoacid followed by a bracketed expression – I used a regular expression Pxy, e.g. Lys(Pxy) pyridoxyl 7 times Mbh, e.g. Asn(Mbh) 4,4'-dimethoxybenzhydryl 8 times Mts, e.g. Arg(Mts) mesitylene-2-sulfonyl 7 times
  • 22. Some care Needed • Not everything that looks like a peptide is a peptide – Ile-de-France (sheep) – Glyol, Leuol and Pheol • but not Tyrol, Lysol or Metol – Argal and Proal • but not Metal, Ileal, Penal or Seral • Even where it definitely is a peptide, it is necessary to check the details: – Is this abbreviation widely used? – Does it occur in different peptides? – Is it unambiguous?
  • 23. Some care Needed • Dpa was found to occur 21 times in the gaps – Many times (but not always) in the same peptide • Inspection of the papers behind the abstracts and googling “Boc-Dpa-OH” indicated two potential meanings diaminopropionic acid aka Dap 3-(2,4-dinitrophenyl)-L-2,3- diaminopropionyl
  • 24. HUMAN-READABLE → MACHINE-READABLE • 16.4K oligopeptides* textmined from PubMed Abstracts * Containing at least 5 monomers Tyr-d-Ala-Phe-Gly-Tyr-Pro-Ser-NH(2) Asp-Thr(P)-Pro-Ala-Lys Pyr-Gly-Pro-Pro-Ile-Ser-Ile-Asp-Leu-Ser-Leu-Glu-Leu-Leu-Arg-Lys-Met-Ile-Glu-Ile Gly-Ala-Aib-Pro-Ala-Aib-Aib-Glu Nle-Leu-Phe-Nle-Tyr-Lys L-Ala-D-Glu-L-Lys-D-Ala-D-Ala D-Phe-Cys-Tyr-D-Trp-Orn-Thr-Pen-Thr-NH2 Glu-Asp-Pro-Gln-Gly-Asx-Ala-Ala Ac-Phe-Leu-Val-His-NH2 Gly-Asx-Glx-Ser-Thr-Cys Ac-Met-Glu-Glu-Lys-Leu-Lys-Lys-Thr-Lys-Ile-Ile-Phe-Val-Val-Gly-Gly-Pro-Gly-Ser-Gly-Lys-Gly-Thr-Gln- Cys-Glu-Lys-Ile-Val-Gln-Lys-Tyr-Gly-Tyr-Thr-His-Leu-Ser-Thr-Gly-Asp-Leu-Leu-Arg-Ser-Glu-Val-Ser-Ser- Gly-Ser-Ala-Arg-Gly-Lys-Lys-Leu-Ser-Glu-Ile-Met-Glu-Lys-Gly-Gln-Leu-Val-Pro-Leu-Glu-Thr-Val-Leu-Asp- Met-Leu-Arg-Asp-Ala-Met-Val-Ala-Lys-Val-Asn-Thr-Ser-Lys-Gly-Phe-Leu-Ile-Asp-Gly-…….
  • 25. HUMAN-READABLE → MACHINE-READABLE • 16.4K oligopeptides* textmined from PubMed Abstracts and converted to HELM – 2.2K not converted, 1.6K as inline HELM, 12.6K as regular HELM * Containing at least 5 monomers PEPTIDE1{Y.[dA].F.G.Y.P.S.[am]}$$$$ PEPTIDE1{D.[*C(=O)[C@H]([C@@H](C)OP(=O)(O)O)N* |$_R2;;;;;;;;;;;;_R1$|].P.A.K}$$$$ PEPTIDE1{[Glp].G.P.P.I.S.I.D.L.S.L.E.L.L.R.K.M.I.E.I}$$$$ PEPTIDE1{G.A.[Aib].P.A.[Aib].[Aib].E}$$$$ PEPTIDE1{[Nle].L.F.[Nle].Y.K}$$$$ PEPTIDE1{A.[dE].K.[dA].[dA]}$$$$ PEPTIDE1{[dF].C.Y.[dW].[Orn].T.[Pen].T.[am]}$$$$ PEPTIDE1{E.D.P.Q.G.(D,N).A.A}$$$$V2.0 PEPTIDE1{[ac].F.L.V.H.[am]}$$$$ PEPTIDE1{G.(D,N).(E,Q).S.T.C}$$$$V2.0 PEPTIDE1{[ac].M.E.E.K.L.K.K.T.K.I.I.F.V.V.G.G.P.G.S.G.K.G.T.Q.C.E.K.I.V.Q.K.Y.G.Y.T.H.L.S.T.G.D.L.L.R.S.E.V.S.S .G.S.A.R.G.K.K.L.S.E.I.M.E.K.G.Q.L.V.P.L.E.T.V.L.D.M.L.R.D.A.M.V.A.K.V.N.T.S.K.G.F.L.I.D.G.Y.P.R.E.V.Q.Q.G.E. E.F.E.R.R.I.G.Q.P.T.L.L.L.Y.V.D.A.G.P.E.T.M.T.R.R.L.L.K.R.G.E.T.S.G.R.V.D.N.E.E.T.I.K.K.R.L.E.T.Y.Y.K.A.T.E.P.V.I.A. F.Y.E.K.R.G.I.V.R.K.V.N.A.E.G.S.V.D.E.V.F.S.Q.V.C.T.H.L.D.A.L.K}$$$$
  • 27. IUPAC Names for PeptideS (S)-N-((S)-1-((2-amino-2-oxoethyl)amino)-4-methyl-1-oxopentan-2-yl)-1- ((4R,7S,10S,13S,16S,19R)-19-amino-7-(2-amino-2-oxoethyl)-10-(3-amino-3-oxopropyl)-16-(4- hydroxybenzyl)-13-isobutyl-6,9,12,15,18-pentaoxo-1,2-dithia-5,8,11,14,17- pentaazacycloicosane-4-carbonyl)pyrrolidine-2-carboxamide (2S)-1-[(4R,7S,10S,13S,16S,19R)-19-amino-7-(2-amino-2-oxo-ethyl)-10-(3-amino-3-oxo- propyl)-16-[(4-hydroxyphenyl)methyl]-13-isobutyl-6,9,12,15,18-pentaoxo-1,2-dithia- 5,8,11,14,17-pentazacycloicosane-4-carbonyl]-N-[(1S)-1-[(2-amino-2-oxo-ethyl)carbamoyl]-3- methyl-butyl]pyrrolidine-2-carboxamide (2S)-2-{[(2S)-1-[(4R,7S,10S,13S,16S,19R)-19-amino-10-(2-carbamoylethyl)-7- (carbamoylmethyl)-16-[(4-hydroxyphenyl)methyl]-13-(2-methylpropyl)-6,9,12,15,18- pentaoxo-1,2-dithia-5,8,11,14,17-pentaazacycloicosane-4-carbonyl]pyrrolidin-2- yl]formamido}-N-(carbamoylmethyl)-4-methylpentanamide L-cysteinyl-L-tyrosyl-L-leucyl-L-glutaminyl-L-asparagyl-L-cysteinyl-L-prolyl-L-leucyl- glycinamide (1->6)-disulphide [Leu3]oxytocin
  • 28. Names that show Relationships • Often better to describe a structure as a delta (or modification) of a known structure – Earlier, I showed Bpa versus Phe(4-Bz); reduces cognitive load; similarity of Phe(4-X) follows intuitively • If we apply this to whole peptides: – Name a peptide as a delta from a reference set of peptides – e.g. ‘known peptides’, or an in-house database • Modification nomenclature described in 1983 IUPAC- IUBMB recommendations
  • 29. Analysis of PubChem • Curated database of oligopeptides of biological interest (currently 452 entries) • 10.5% of the 170,708 peptides of length 5 or greater in PubChem can be named as variants of these – argipressin (1-8) vs H-Cys(1)-Tyr-Phe-Gln-Asn-Cys(1)-Pro-Arg-OH – Cbz-cholecystokinin octapeptide (2-7) amide vs Cbz- Tyr(SO3H)-Met-Gly-Trp-Met-Asp-NH2 – [Ile1,Ser2,Ser8]cyphokinin vs H-Ile-Ser-Arg-Pro-Pro-Gly-Phe-Ser- Pro-Phe-Arg-OH
  • 30. In Conclusion • Textmining using LeadMine • Biologic representations using Sugar&Splice THANKS! TOOLS USED • Machines can bridge the gap to humans by reading/writing human-readable representations of biopolymers • Appropriate monomer names can be found by mining the literature noel@nextmovesoftware.com