SlideShare a Scribd company logo
Naming algorithms for
derivatives of peptide-like
natural products
Roger Sayle& Noel O’Boyle
Nextmove software, cambridge, uk
Christopher southan,
Iuphar/bps guide to pharmacology
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
30 second overview
• This talk describes the development of software for
both naming peptides and reading peptide names
matching the de facto standard practices currently
followed by biochemists.
• Unlike computer representations, like Pisotoia HELM
or InChI keys, these names and identifiers match
those typically found in the scientific literature and
vendor catalogues.
• A significant application of this technology is to check
and correct peptide representations in databases.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Problem motivation
• Maintaining a database of biological activities of
mature proteins and peptides presents a significant
technical challenge.
• IUPHAR/BPS’ “Guide to Pharmacology” and EBI’s
ChEMBL represent current state-of-the-art efforts to
capture/represent peptide-like ligands.
• The ligands require more than (FASTA) bioinformatics
including disulfide bridging architecture, non-
standard amino acids, post-translational
modifications, N- and C- terminal modifications etc.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Iuphar/bps gtop peptides
• The “Guide to Pharmacology” database contains:
– The common name “oxytocin”
– The species, e.g. “human”
– The UniProt ID “P001178”
– The 1-letter Sequence “CYIQNCPLG”
– The 3-letter sequence “Cys-Tyr-Ile-…-Leu-Gly-NH2”
– A text description “Post-translation modification”, e.g. “A
disulfide bond is formed between cysteine residues at
positions 1 and 5 and the C-terminal glycine is amidated”.
– Often SMILES and standard InChIKey.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Problem 1: consistency
• The challenge with these advanced formats are that
the names, three-letter codes and modification
descriptions are text-locked, unreadable by software.
Examples of errors and inconsistency:
Ligand #4463: “PheGlnThrSerGluAlaIleLeuPro…”
Ligand #1335: “…Leu-Arg-AlaPro-Leu-Lys...”
Ligand #8263: “val-leu-gln-glu-leu-asn-val-thr-val”
Ligand #5873: “…Pr-oGl-yGl-ySe-rMe-tLy-sLe-u…”
Ligand #3591: “PHQLLRVPro-His-Ala-Gln-Leu…”
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Problem 2: ambiguity
• Ligand #3630 (neuropeptide B29) “(Br)Trp” with note
“The n-terminal tryptophan is brominated”.
– Suggested replacement Trp(6-Br)
• In Ligand #1036, “(Ac)Ala” means N2-acetyl but
“(Ac)Lys” means N6-acetyl, in #1188 “Ac-” appears
without parenthesis, in Ligand #3853, “AcPhe-”
appears without a hyphen…
– Suggest Ac- at N-term, -N(Ac)Phe infix, Lys(Ac) sidechain
– This even allows Ac-N(Ac)Lys(Ac)-OH, aka. N(Ac2)Lys(Ac).
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Problem 3: disulfide bridging
• Capturing the disulfide bridging architecture in the
three-letter (condensed) representation allows it to
be read/checked for errors.
• This is done in some places but not in others.
• Disulfide bridges are particularly tricky even for the
folks at UNIPROT: Annexin I (ligand #1031, P04083)
isn’t annotated as disulfide bridged, despite the 3D
structure in PDB 1HM6, and the experimental
evidence of an intramolecular disulfide described in
PubMed 7663390.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Types of peptide name/identifier
• Sequence: CYIQDCPLG
• Peptide Name: [Asp5]oxytocin or [5-L-aspartic acid]oxytocin
• Chemical IUPAC Name:
2-[(4R,7S,10S,13S,16S,19R)-19-amino-4-[(2S)-2-[[(1S)-1-[(2-amino-2-oxo-ethyl)carbamoyl]-3-methyl-
butyl]carbamoyl]pyrrolidine-1-carbonyl]-10-(3-amino-3-oxo-propyl)-16-[(4-hydroxyphenyl)methyl]-13-
[(1S)-1-methylpropyl]-6,9,12,15,18-pentaoxo-1,2-dithia-5,8,11,14,17-pentazacycloicos-7-yl]acetic acid
• Biological IUPAC Name
L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl-
L-cysteinyl-L-prolyl-L-leucyl-glycinamide (1->6)-disulfide
• Condensed: Cys(1)-Tyr-Ile-Gln-Asp-Cys(1)-Pro-Leu-Gly-NH2
• Pistoia HELM:
PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Helm teething problems
• Pistoia’s HELM notation marks a significant advance
over the limitations of one-letter bioinformatics.
• Alas, its original goals didn’t include data exchange,
which has only recently been addressed by the
extensions of inlineHELM and XHELM [and fixes from
NextMove Software for improved interoperability].
• Alas, this still doesn’t address some core limitations:
– Pistoia Monomer Library: PEPTIDE1{[fmoc].A}$$$$
– EBI ChEMBL Monomers: PEPTIDE1{[Fmoc_A]}$$$$
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Iupac condensed names (chembl)
• The following names are machine generated
• H-Cys-Pro-Trp-His-Leu-Leu-Pro-Phe-Cys-OH CHEMBL501567
• H-Tyr-Pro-Phe-Phe-OtBu CHEMBL500195
• cyclo[Ala-Tyr-Val-Orn-Leu-D-Phe-Pro-Phe-D-Phe-Asn] CHEMBL438006
• H-Nle(Et)-Tyr-Pro-Trp-Phe-NH2 CHEMBL500704
• H-DL-hPhe-Val-Met-Tyr(PO3H2)-Asn-Leu-Gly-Glu-OH CHEMBL439086
• cyclo[Phe-D-Trp-Tyr(Me)-D-Pro] CHEMBL507127
• H-D-Pyr-D-Leu-pyrrolidide CHEMBL1181307
• Ac-DL-Phe-aThr-Leu-Asp-Ala-Asp-DL-Phe(4-Cl)-OH CHEMBL1791047
• H-D-Cys(1)-D-Asp-Gly-Tyr(3-NO2)-Gly-Hyp-Asp-D-Cys(1)-NH2 CHEMBL583516
• Boc-Tyr-Tyr(3-Br)-OMe CHEMBL1976073
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
peptide names (chembl)
• The following names are machine generated
• [15-L-arginine]nociceptin CHEMBL526333
• [2-4-chloro-L-phenylalanine]neuropeptide S [human] CHEMBL441576
• [1-L-threonine]cyclosporin A CHEMBL2370014
• [6-L-tryptophan]sermorelin free acid CHEMBL440438
• angiotensin II (3-8) CHEMBL261120
• nociceptin amide CHEMBL389521
• acetyl-alpha-MSH (4-10) amide CHEMBL410411
• [2-L-cysteine,13-L-cysteine]neurotensin disulfide CHEMBL3278512
• myristoyl-[1-L-lysine,4-L-tryptophan]tetrapandin 2 amide CHEMBL3288219
• [2-(4RS)-thiazolidine-4-carboxylic acid,4-L-proline]endomorphin-2 CHEMBL126611
• [22-L-serine]kalata B1 CHEMBL1801140
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Advanced Peptide names
• Named peptides imply not only sequence but also
N-terminal acetylation, C-terminal amidation and
disulfide bridge topology.
• Example derivative naming operations:
– gastrin (14-17)
– motilin amide
– oxytocin free-acid
– acetyl-oxytocin
– deacetyl-abarelix
– oxytocin reduced
– endothelin-1 (1→3),(11 → 15)-bis(disulfide)
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
homodetic cycles #1
cyclo[Leu-D-Phe-Pro-Val-Orn-Leu-D-Phe-Pro-Val-Orn]
gramicidin S
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
homodetic cycles #2
cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val]
valinomycin
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Ambiguous/Preferred forms
• [3-L-isoleucine]lypressin vs. [8-L-lysine]vasotocin
• [2-L-phenylalanine]lypressin vs. [8-L-lysine]phenypresin
• [2-L-phenylalanine]ornipressin vs. [8-L-ornithine]phenypressin
• [3-L-isoleucine]ornipressin vs. [8-L-ornithine]vasotocin
• [4-L-methionine]afamelanotide vs. [7-D-phenylalanine]α-MSH
• [Thr1,Lys2]endomorphin-1 vs. [Trp3,Phe4]tuftsin amide
• [Gln3]thyrotropin-releasing hormone vs. [Pro3]eisenin amide
• [Trp1,Val2]endomorphin-2 vs. [Val2,Phe3]gastrin tetrapeptide
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Named cyclic peptide derivatives
• Mutants of named cyclic peptides are identified by
comparing against all “rotational” permutations.
Example line notation query (CHEMBL478596)
cyclo[Ala-Gly-Thr-Phe-Val-Tyr]
Reference database line notations:
cyclo[Gly-Thr-Phe-Leu-Tyr-Thr] dichotomin B
cyclo[Ala-Gly-Thr-Phe-Leu-Tyr] dichotomin C
Resulting Sugar & Splice peptide name:
[5-L-valine]dichotomin C
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Lower locants in cyclic peptides
• Symmetric cyclic peptides provide an interesting
challenge, where substitions are different locants can
potentially be synonymous.
• CHEMBL1934531
– [3-(4S)-4-amino-L-proline]gramicidin S preferred
– [8-(4S)-4-amino-L-proline]gramicidin S acceptable
• CHEMBL1934536
– [3-(4R)-4-amino-L-proline]gramicidin S preferred
– [8-(4R)-4-amino-L-proline]gramicidin S acceptable
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Scaling-up protein variant naming
• The algorithm described for naming peptides can
also be applied to naming arbitary protein variants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Dag representation of sequences
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
30 second summary
• This talk describes the development of software for
both naming peptides and reading peptide names
matching the de facto standard practices currently
followed by biochemists.
• Unlike computer representations, like Pisotoia HELM
or InChI keys, these names and identifiers match
those typically found in the scientific literature and
vendor catalogues.
• A significant application of this technology is to check
and correct peptide representations in databases.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
acknowledgements
• Joanna Sharman, IUPhar, Edinburgh University, UK.
• Lisa Sach-Peltason, Hoffmann-La Roche, Basel.
• Joann Prescott-Roy, Novartis, Boston, MA.
• Greg Landrum, Novatis, Basel, Switzerland.
• Evan Bolton, NCBI PubChem project, Bethesda, MD.
• Daniel Lowe, NextMove Software, Cambridge, UK.
• John May, NextMove Software, Cambridge, UK.
250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

More Related Content

PDF
Challenges in Chemical Information Exchange
PDF
Unlocking chemical information from tables and legacy articles
PDF
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
PDF
Line notations for nucleic acids (both natural and therapeutic)
PDF
CINF 51: Analyzing success rates of supposedly 'easy' reactions
PDF
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
PDF
PubChem as a Biologics Database
PDF
InChI for Large Molecules
Challenges in Chemical Information Exchange
Unlocking chemical information from tables and legacy articles
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
Line notations for nucleic acids (both natural and therapeutic)
CINF 51: Analyzing success rates of supposedly 'easy' reactions
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
PubChem as a Biologics Database
InChI for Large Molecules

What's hot (12)

PDF
GHS and NFPA diamonds: where they come from and how they can be useful
PDF
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
PDF
Pharmaceutical industry best practices in lessons learned: ELN implementation...
PDF
CINF 35: Structure searching for patent information: The need for speed
PDF
Standardized Representations of ELN Reactions for Categorization and Duplicat...
PDF
Eugene Garfield: the father of chemical text mining and artificial intelligen...
ODP
Critical Assessment of Function Annotation, 2005
PDF
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
PDF
Thesis_full
PDF
p3d @EuroSciPy2010 by C. Fufezan
PPT
WWW (Glibs workshop)
PPTX
Biochemistry Homework Help
GHS and NFPA diamonds: where they come from and how they can be useful
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
CINF 35: Structure searching for patent information: The need for speed
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Critical Assessment of Function Annotation, 2005
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
Thesis_full
p3d @EuroSciPy2010 by C. Fufezan
WWW (Glibs workshop)
Biochemistry Homework Help
Ad

Viewers also liked (20)

PPTX
Cpeptide & diabetes dda 2015
PDF
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
PDF
التقدير الكمي للأحماض الأمينية
PPT
13 amino acids__peptides
PDF
Explanation of the Groasis Technology for Growing Food in Desert Regions
PPTX
Premios Hospital do Futuro 2011
PPTX
SharePoint Saturday Ottawa 2014 - Microsoft Azure : Central component of your...
PDF
Empleo con apoyo. preparadores laborales.
PPTX
How to Win Search and Social Traffic with Content Hubs
PPTX
NCSA Alabama In-Service - Powers of Good & Evil; Using the Internet & Social ...
PPTX
Fr402 les acteur-mondialisation_9h30groupe
PDF
Good Books help Students Excel in Life & School
PDF
Greening & Restoring the Sahara Desert with the Groasis Waterboxx
PDF
Hadees e Hauz - Ayatullah Syed Ali Naqi Naqvi Sahab Qibla t.s.
PDF
Groasis Waterboxx Lets Trees Grow Up in Unfriendly Places
PPS
PETROV-VODKIN, Kuzma, Featured Paintings in Detail
PDF
Classical Art School Gardening Posters
PPTX
European O365 Connect SharePoint Online Applification
PDF
One Teacher Makes Students into Champions
Cpeptide & diabetes dda 2015
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
التقدير الكمي للأحماض الأمينية
13 amino acids__peptides
Explanation of the Groasis Technology for Growing Food in Desert Regions
Premios Hospital do Futuro 2011
SharePoint Saturday Ottawa 2014 - Microsoft Azure : Central component of your...
Empleo con apoyo. preparadores laborales.
How to Win Search and Social Traffic with Content Hubs
NCSA Alabama In-Service - Powers of Good & Evil; Using the Internet & Social ...
Fr402 les acteur-mondialisation_9h30groupe
Good Books help Students Excel in Life & School
Greening & Restoring the Sahara Desert with the Groasis Waterboxx
Hadees e Hauz - Ayatullah Syed Ali Naqi Naqvi Sahab Qibla t.s.
Groasis Waterboxx Lets Trees Grow Up in Unfriendly Places
PETROV-VODKIN, Kuzma, Featured Paintings in Detail
Classical Art School Gardening Posters
European O365 Connect SharePoint Online Applification
One Teacher Makes Students into Champions
Ad

Similar to CINF 4: Naming algorithms for derivatives of peptide-like natural products (20)

PPTX
Classification, representation and analysis of cyclic peptides and peptide-li...
PDF
Building a bridge between human-readable and machine-readable representations...
PDF
Peptide line notations for biologics registration and patent filings
PDF
Representation and display of non-standard peptides using semi-systematic ami...
PDF
Roundtripping between small-molecule and biopolymer representations
PPTX
Advanced Computational Drug Design
PPTX
Cheminformatics
PPTX
Peptide Tribulations
PDF
22 text
PPTX
Peptide Tribulations in GtoPdb
PPTX
Overview of cheminformatics
PPTX
Peptide tribulations
PDF
Comparing Cahn-Ingold-Prelog Rule Implementations
PDF
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
PDF
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
PDF
OPSIN: Taming the jungle of IUPAC chemical nomenclature
PPT
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PPTX
Chemical Structure Standardization and Synonym Filtering in PubChem
PDF
Classification, representation and analysis of cyclic peptides and peptide-li...
Building a bridge between human-readable and machine-readable representations...
Peptide line notations for biologics registration and patent filings
Representation and display of non-standard peptides using semi-systematic ami...
Roundtripping between small-molecule and biopolymer representations
Advanced Computational Drug Design
Cheminformatics
Peptide Tribulations
22 text
Peptide Tribulations in GtoPdb
Overview of cheminformatics
Peptide tribulations
Comparing Cahn-Ingold-Prelog Rule Implementations
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
OPSIN: Taming the jungle of IUPAC chemical nomenclature
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
Representing Chemicals Digitally: An overview of Cheminformatics
Chemical Structure Standardization and Synonym Filtering in PubChem

More from NextMove Software (19)

PDF
DeepSMILES
PDF
A de facto standard or a free-for-all? A benchmark for reading SMILES
PDF
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
PDF
Can we agree on the structure represented by a SMILES string? A benchmark dat...
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
PDF
Recent improvements to the RDKit
PDF
Digital Chemical Representations
PDF
Challenges and successes in machine interpretation of Markush descriptions
PDF
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
PDF
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
PDF
Building on Sand: Standard InChIs on non-standard molfiles
PDF
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
PDF
Advanced grammars for state-of-the-art named entity recognition (NER)
PDF
Automatic extraction of bioactivity data from patents
PDF
RDKit UGM 2016: Higher Quality Chemical Depictions
PDF
Chemical structure representation in PubChem
PDF
Sketchy sketches hiding chemistry in plain sight
PDF
Which is the best fingerprint for medicinal chemistry?
PDF
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
DeepSMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Recent improvements to the RDKit
Digital Chemical Representations
Challenges and successes in machine interpretation of Markush descriptions
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
Building on Sand: Standard InChIs on non-standard molfiles
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Advanced grammars for state-of-the-art named entity recognition (NER)
Automatic extraction of bioactivity data from patents
RDKit UGM 2016: Higher Quality Chemical Depictions
Chemical structure representation in PubChem
Sketchy sketches hiding chemistry in plain sight
Which is the best fingerprint for medicinal chemistry?
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...

Recently uploaded (20)

PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Sciences of Europe No 170 (2025)
PPTX
The Minerals for Earth and Life Science SHS.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PPTX
Microbes in human welfare class 12 .pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
BIOMOLECULES PPT........................
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Pharmacology of Autonomic nervous system
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
perinatal infections 2-171220190027.pptx
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PDF
lecture 2026 of Sjogren's syndrome l .pdf
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Sciences of Europe No 170 (2025)
The Minerals for Earth and Life Science SHS.pptx
. Radiology Case Scenariosssssssssssssss
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
CORDINATION COMPOUND AND ITS APPLICATIONS
Microbes in human welfare class 12 .pptx
7. General Toxicologyfor clinical phrmacy.pptx
BIOMOLECULES PPT........................
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Pharmacology of Autonomic nervous system
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
C1 cut-Methane and it's Derivatives.pptx
perinatal infections 2-171220190027.pptx
Science Quipper for lesson in grade 8 Matatag Curriculum
lecture 2026 of Sjogren's syndrome l .pdf

CINF 4: Naming algorithms for derivatives of peptide-like natural products

  • 1. Naming algorithms for derivatives of peptide-like natural products Roger Sayle& Noel O’Boyle Nextmove software, cambridge, uk Christopher southan, Iuphar/bps guide to pharmacology 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 2. 30 second overview • This talk describes the development of software for both naming peptides and reading peptide names matching the de facto standard practices currently followed by biochemists. • Unlike computer representations, like Pisotoia HELM or InChI keys, these names and identifiers match those typically found in the scientific literature and vendor catalogues. • A significant application of this technology is to check and correct peptide representations in databases. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 3. Problem motivation • Maintaining a database of biological activities of mature proteins and peptides presents a significant technical challenge. • IUPHAR/BPS’ “Guide to Pharmacology” and EBI’s ChEMBL represent current state-of-the-art efforts to capture/represent peptide-like ligands. • The ligands require more than (FASTA) bioinformatics including disulfide bridging architecture, non- standard amino acids, post-translational modifications, N- and C- terminal modifications etc. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 4. Iuphar/bps gtop peptides • The “Guide to Pharmacology” database contains: – The common name “oxytocin” – The species, e.g. “human” – The UniProt ID “P001178” – The 1-letter Sequence “CYIQNCPLG” – The 3-letter sequence “Cys-Tyr-Ile-…-Leu-Gly-NH2” – A text description “Post-translation modification”, e.g. “A disulfide bond is formed between cysteine residues at positions 1 and 5 and the C-terminal glycine is amidated”. – Often SMILES and standard InChIKey. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 5. Problem 1: consistency • The challenge with these advanced formats are that the names, three-letter codes and modification descriptions are text-locked, unreadable by software. Examples of errors and inconsistency: Ligand #4463: “PheGlnThrSerGluAlaIleLeuPro…” Ligand #1335: “…Leu-Arg-AlaPro-Leu-Lys...” Ligand #8263: “val-leu-gln-glu-leu-asn-val-thr-val” Ligand #5873: “…Pr-oGl-yGl-ySe-rMe-tLy-sLe-u…” Ligand #3591: “PHQLLRVPro-His-Ala-Gln-Leu…” 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 6. Problem 2: ambiguity • Ligand #3630 (neuropeptide B29) “(Br)Trp” with note “The n-terminal tryptophan is brominated”. – Suggested replacement Trp(6-Br) • In Ligand #1036, “(Ac)Ala” means N2-acetyl but “(Ac)Lys” means N6-acetyl, in #1188 “Ac-” appears without parenthesis, in Ligand #3853, “AcPhe-” appears without a hyphen… – Suggest Ac- at N-term, -N(Ac)Phe infix, Lys(Ac) sidechain – This even allows Ac-N(Ac)Lys(Ac)-OH, aka. N(Ac2)Lys(Ac). 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 7. Problem 3: disulfide bridging • Capturing the disulfide bridging architecture in the three-letter (condensed) representation allows it to be read/checked for errors. • This is done in some places but not in others. • Disulfide bridges are particularly tricky even for the folks at UNIPROT: Annexin I (ligand #1031, P04083) isn’t annotated as disulfide bridged, despite the 3D structure in PDB 1HM6, and the experimental evidence of an intramolecular disulfide described in PubMed 7663390. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 8. Types of peptide name/identifier • Sequence: CYIQDCPLG • Peptide Name: [Asp5]oxytocin or [5-L-aspartic acid]oxytocin • Chemical IUPAC Name: 2-[(4R,7S,10S,13S,16S,19R)-19-amino-4-[(2S)-2-[[(1S)-1-[(2-amino-2-oxo-ethyl)carbamoyl]-3-methyl- butyl]carbamoyl]pyrrolidine-1-carbonyl]-10-(3-amino-3-oxo-propyl)-16-[(4-hydroxyphenyl)methyl]-13- [(1S)-1-methylpropyl]-6,9,12,15,18-pentaoxo-1,2-dithia-5,8,11,14,17-pentazacycloicos-7-yl]acetic acid • Biological IUPAC Name L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl- L-cysteinyl-L-prolyl-L-leucyl-glycinamide (1->6)-disulfide • Condensed: Cys(1)-Tyr-Ile-Gln-Asp-Cys(1)-Pro-Leu-Gly-NH2 • Pistoia HELM: PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$ 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 9. Helm teething problems • Pistoia’s HELM notation marks a significant advance over the limitations of one-letter bioinformatics. • Alas, its original goals didn’t include data exchange, which has only recently been addressed by the extensions of inlineHELM and XHELM [and fixes from NextMove Software for improved interoperability]. • Alas, this still doesn’t address some core limitations: – Pistoia Monomer Library: PEPTIDE1{[fmoc].A}$$$$ – EBI ChEMBL Monomers: PEPTIDE1{[Fmoc_A]}$$$$ 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 10. Iupac condensed names (chembl) • The following names are machine generated • H-Cys-Pro-Trp-His-Leu-Leu-Pro-Phe-Cys-OH CHEMBL501567 • H-Tyr-Pro-Phe-Phe-OtBu CHEMBL500195 • cyclo[Ala-Tyr-Val-Orn-Leu-D-Phe-Pro-Phe-D-Phe-Asn] CHEMBL438006 • H-Nle(Et)-Tyr-Pro-Trp-Phe-NH2 CHEMBL500704 • H-DL-hPhe-Val-Met-Tyr(PO3H2)-Asn-Leu-Gly-Glu-OH CHEMBL439086 • cyclo[Phe-D-Trp-Tyr(Me)-D-Pro] CHEMBL507127 • H-D-Pyr-D-Leu-pyrrolidide CHEMBL1181307 • Ac-DL-Phe-aThr-Leu-Asp-Ala-Asp-DL-Phe(4-Cl)-OH CHEMBL1791047 • H-D-Cys(1)-D-Asp-Gly-Tyr(3-NO2)-Gly-Hyp-Asp-D-Cys(1)-NH2 CHEMBL583516 • Boc-Tyr-Tyr(3-Br)-OMe CHEMBL1976073 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 11. peptide names (chembl) • The following names are machine generated • [15-L-arginine]nociceptin CHEMBL526333 • [2-4-chloro-L-phenylalanine]neuropeptide S [human] CHEMBL441576 • [1-L-threonine]cyclosporin A CHEMBL2370014 • [6-L-tryptophan]sermorelin free acid CHEMBL440438 • angiotensin II (3-8) CHEMBL261120 • nociceptin amide CHEMBL389521 • acetyl-alpha-MSH (4-10) amide CHEMBL410411 • [2-L-cysteine,13-L-cysteine]neurotensin disulfide CHEMBL3278512 • myristoyl-[1-L-lysine,4-L-tryptophan]tetrapandin 2 amide CHEMBL3288219 • [2-(4RS)-thiazolidine-4-carboxylic acid,4-L-proline]endomorphin-2 CHEMBL126611 • [22-L-serine]kalata B1 CHEMBL1801140 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 12. Advanced Peptide names • Named peptides imply not only sequence but also N-terminal acetylation, C-terminal amidation and disulfide bridge topology. • Example derivative naming operations: – gastrin (14-17) – motilin amide – oxytocin free-acid – acetyl-oxytocin – deacetyl-abarelix – oxytocin reduced – endothelin-1 (1→3),(11 → 15)-bis(disulfide) 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 13. homodetic cycles #1 cyclo[Leu-D-Phe-Pro-Val-Orn-Leu-D-Phe-Pro-Val-Orn] gramicidin S 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 15. Ambiguous/Preferred forms • [3-L-isoleucine]lypressin vs. [8-L-lysine]vasotocin • [2-L-phenylalanine]lypressin vs. [8-L-lysine]phenypresin • [2-L-phenylalanine]ornipressin vs. [8-L-ornithine]phenypressin • [3-L-isoleucine]ornipressin vs. [8-L-ornithine]vasotocin • [4-L-methionine]afamelanotide vs. [7-D-phenylalanine]α-MSH • [Thr1,Lys2]endomorphin-1 vs. [Trp3,Phe4]tuftsin amide • [Gln3]thyrotropin-releasing hormone vs. [Pro3]eisenin amide • [Trp1,Val2]endomorphin-2 vs. [Val2,Phe3]gastrin tetrapeptide 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 16. Named cyclic peptide derivatives • Mutants of named cyclic peptides are identified by comparing against all “rotational” permutations. Example line notation query (CHEMBL478596) cyclo[Ala-Gly-Thr-Phe-Val-Tyr] Reference database line notations: cyclo[Gly-Thr-Phe-Leu-Tyr-Thr] dichotomin B cyclo[Ala-Gly-Thr-Phe-Leu-Tyr] dichotomin C Resulting Sugar & Splice peptide name: [5-L-valine]dichotomin C 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 17. Lower locants in cyclic peptides • Symmetric cyclic peptides provide an interesting challenge, where substitions are different locants can potentially be synonymous. • CHEMBL1934531 – [3-(4S)-4-amino-L-proline]gramicidin S preferred – [8-(4S)-4-amino-L-proline]gramicidin S acceptable • CHEMBL1934536 – [3-(4R)-4-amino-L-proline]gramicidin S preferred – [8-(4R)-4-amino-L-proline]gramicidin S acceptable 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 18. Scaling-up protein variant naming • The algorithm described for naming peptides can also be applied to naming arbitary protein variants. • Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin – CFVRNCPTG annetocin – CFWTSCPIG octopressin – CYFQNCPRG argipressin – CYFQNCPKG lypressin – CYFRNCPIG cephalotocin – CYIQNCPLG oxytocin – CYIQNCPPG prol-oxytocin – CYIQNCPRG vasotocin – CYIQSCPIG seritocin – CYISNCPIG isotocin 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 19. Dag representation of sequences 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015 These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states]
  • 20. entirety of uniprot/swissprot • Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb). • This data structure allows close analogues to be identified much faster than using NCBI blastp. • For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop. • The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 21. Application to precision medicine • A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0 • Run-time for this query is 0.2s. • By comparison, blastp 2.2.29+ takes about 6s. – With default arguments, NCBI blastp run time is 7s. – Only 6s with –num_descriptions 1 –num_alignments 1. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 22. 30 second summary • This talk describes the development of software for both naming peptides and reading peptide names matching the de facto standard practices currently followed by biochemists. • Unlike computer representations, like Pisotoia HELM or InChI keys, these names and identifiers match those typically found in the scientific literature and vendor catalogues. • A significant application of this technology is to check and correct peptide representations in databases. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015
  • 23. acknowledgements • Joanna Sharman, IUPhar, Edinburgh University, UK. • Lisa Sach-Peltason, Hoffmann-La Roche, Basel. • Joann Prescott-Roy, Novartis, Boston, MA. • Greg Landrum, Novatis, Basel, Switzerland. • Evan Bolton, NCBI PubChem project, Bethesda, MD. • Daniel Lowe, NextMove Software, Cambridge, UK. • John May, NextMove Software, Cambridge, UK. 250th ACS National Meeting, Boston, MA. Sunday 16th August 2015