SlideShare a Scribd company logo
What is a protein?
Anticuerpo
ATP sintetasa
Hemoglobina
Catalasa
RNA
Polimerasa
RNA
Polimerasa
Actina
Insulina
Colágeno
Rodopsina
p53
GFP
1
InterPro
Protein classification resources at the EBI
• InterPro is the main resource
for protein classification at the
EBI.
• Is a single searchable resource for:
• Pa2erns
• Profiles
• Fingerprints
• HMMs
• From a number of different
member databases
Protein classification
Proteins can
be classified
into groups
• The FAMILIES to which they belong
• The DOMAINS they contain
• The SEQUENCE FEATURES they
possess
Structural properties:
• Domains
• Sequence Features
Tools/Models:
• Protein signatures
Evaluated through
• Sequence similarity
• Structural similarity
according to
their
Based on
What are protein FAMILIES?
Protein Family
Group of proteins that
share a common
evolutionary origin:
• Related functions
• Sequence similarity
• Structural similarity
Superfamily
Family
Subfamily
Arranged into hierarchies
Set of homologous proteins
What are protein DOMAINS?
Are structural units in a protein.
• Continuous or discontinous
(multiple motifs)
• Often associated with different
functions.
• Between 40 and 500 residues.
Domains may exist in a variety of biological
contexts, where similar domains can be found in
proteins with different functions.
Catali&c ac&vity
Subcellular
targe&ng
DNA
binding
ACP-binding domain
Catalytic
domain
E. coli - Malonyl-CoA:acyl carrier protein
PDB: 1MLA
SEQUENCE ANNOTATIONS:
Groups of amino acids that confer
certain characteristics upon a
protein.
• A few aminoacids long.
• O!en nested within domains
Examples:
• Active sites
• Binding sites
• Signals
• Post-translational modification
• Repeats
• Motifs
What are SEQUENCE FEATURES?
Active
site
Signal
Repeats
Domains
PDZ-binding
domain
Predic.ve Models to classify proteins into
families
predict the presence of important domains
or sequence features
What are PROTEIN SIGNATURES?
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMSKL
DCKLPQGFFF
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMSKL
DCKLPQGFFF
Protein
family/domain
MSA
Build Predictive
Model
Database
Search
Model
refinement -
maduration
Protein
signature
Significant
match
Unknown
protein
Different approaches can be used
to generate signatures.
1. Patterns
2. Profiles
3. Fingerprints
4. Hidden Markov Models
(HMMs)
Signature Types Single motif methods
Multiple motif methods
Full alingment methods
A L V K L I S G K
A I V H E S A T K
C H V R D L S C K
C P V E S T I S K
A F V T I Y E F R
C G V R Y T D K R
[AC]-x-V-x(4)-{ED}-[KR]
motif
MSA
Pattern/Regular
expression
MSA
1 2 3 4 5 6 7 ..
.
n
A - - 0.2 - - - - - -
R - - 0.3 - - - - - -
F - - 0.5 - - - - - -
G - - 0 - - - - - -
.. - - 0.8 - - - - - -
Y - - 1.2 - - - - - -
sequence positions
amino
acids
PSSM
1) Patterns 2) Profiles
position-specific scoring matrix
Regular expressions to model:
⚙ → Sequence features
MSA to PSSM:
⚙ → Familes and Domains
Motif 1 Motif 2 Motif 3
1 2 3
MSA
Profiles
Fingerprint signature
Correct Order and space
"Probabilistics profiles"
Convert multiple sequence alignments
into position-specific scoring matrix
(PSSMs).
1 2 3 4 5 6 7 ..
.
n
A - - 0.2 - - - - - -
R - - 0.3 - - - - - -
F - - 0.5 - - - - - -
G - - 0 - - - - - -
.. - - 0.8 - - - - - -
Y - - 1.2 - - - - - -
3) Fingerprints 4) HMMs
MSA to PSSM to model:
⚙ → Homologous sequences
MSA → Profiles → Fingerprint :
⚙ → Families
When use InterPro?
You can use InterPro if you have an amino acid sequence or set
of sequences and you want to know:
www.ebi.ac.uk/interpro/
• What they are, what family
they belong to
• What is their function and how
it can be explained in
structural terms
• You can also use InterPro for a
variety of other purposes, such
as examining the structural or
functional predictions for any
sequence already in the
UniProt database.
PROFILE
Hidden
Markov
Models
Profiles
Pa<ers
Composi?on
Predic?on
Domain
&
Families
CATH
Pfam
TIGRFAM
HAMAP
PRINTS
Prosite
pa$ers
MobiDB
SUPER
FAMILY
SMART
PIRSF
Prosite
Profiles
SFLD
CDD
PANTHER
Homologous
Superfamilies
Features &
Sites
Intrinsic
Disorder
InterPro
UniProt
PDB
4) Protein
5) Taxonomy
6) Proteome
3) Structure
1) Interpro entries 2) Database signatures
13 InterPro
Member databases
Biological
En;ty
Signature
method
UniProt: Universal Protein
resource
1. UniProtKB (Knowledgebase):
1. Swiss-Prot ⭐: Manually
annotated entries
2. TrEMBL: Automatically
annotated entries
2. UniParc (archive):
• Non-redundant database
• Protein sequences from
public databases
3. UniRef (Reference Clusters):
• Three databases
• 100, 90, 50
• Sequences are clustered by
its identity (%)
UniProt
Protein Data Bank
Database for the three-dimensional
structural data of large biological
molecules.
• X-ray crystallography
• Nuclear Magnetic Resonance
Spectroscopy
• Cryogenic electron microscopy
~ 175, 000 structures
UCSF
Chimera • Structure visualiza;on
• Molecular graphics
• Topology and structural Analysis
• UniProt annota;on
• MSA and Blastp
• Interface:
• 3D-modelling
• Pocket data
• Molecular docking
• Molecular Dynamics
RecommendaCon
3d - Protein structure prediction
PotenCal energy
landscape
Accuracy
Difficulty
• Compara.ve modelling
– Requires: Known fold + clear
homology
• Fold recogni.on (threading)
– Requires: Known fold
• Ab ini.o / new fold methods
– Requires: only target sequence
Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
INPUT: query sequence Q INPUT: query sequence Q INPUT: query sequence Q
INPUT:
Database of
known folds or
structure
fragments
INPUT:
Database of
protein structures
1. find protein P high sequence
similarity to Q
2. return P’s structure as an
approxima=on to Q’s structure
1. Laws of physics to
simulate folding of Q
1. find a set of fragments that Q
can be aligned with
2. return F as an approxima=on to
Q’s structure
Protein structure prediction
Database
similarity search
(BLAST)
Does sequence
align with a protein
of known
structure?
Protein family
Sequence search
(InterPro)
Rele7onship to
known structure?
Secondary
structure
predicCon
3D-structure
predicCon
Homology modeling
Predicted 3D
Structural model
3D structural
analysis in
laboratory
• Hints for domain
assignment?
• Func7on?
NO
NO
YES
YES
Protein
Sequence
AlphaFold
An overview
AlphaFold
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
What is it?
- AF is an Artificial intelligence program
- Google’s DeepMind
The Goal:
- Predicting the three-dimensional
structure that a protein will adopt
based solely on its amino acid sequence
It “solves” two main problems:
1. Sequence-Structure gap
2. Protein folding Why solving these
problems?
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
Sequence-Structure gap
- 1958: determination of
the first protein
structure.
- John Kendrew & Max
Perutz
- Structure determination
(experimental):
- NMR
- X-ray crystallography
- Cryo-Electron
microscopy
- Protein Data Bank:
- Total: ~170,000
- Unique: ~100,000
AlphaFold 1
The protein folding problem
- 1972: Christian Anfisen, Nobel Prize in
Chemistry.
- “It should be possible to determine a
protein’s three-dimensional shape based
solely on its sequence”
- A typical protein could adopt
10^300 different configurations
- Longer than the age of the universe
- However, in nature, proteins spontaneously fold
into their functional shape.
- Cyrus Levinthal’s paradox (1969)
- 50 years open research problem
The protein folding problem
CASP
Critical Assessment of
Techniques for Protein
Structure prediction
• The protein folding Olympics
• The state of the art in
protein structure prediction
- The competition:
- Since 1994
- Takes place every two years
- Last competition: CASP14 – 2020
- Organizers:
- Known both the sequence and the
structure
Participants:
- Receive only the protein’s
sequence
- Must blindly predict the
structure of the proteins
- Predictions: compared with
the experimental data
Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
INPUT: query sequence Q INPUT: query sequence Q INPUT: query sequence Q
INPUT:
Database of
known folds or
structure
fragments
INPUT:
Database of
protein structures
1. find protein P high sequence
similarity to Q
2. return P’s structure as an
approxima=on to Q’s structure
1. Laws of physics to
simulate folding of Q
1. find a set of fragments that Q
can be aligned with
2. return F as an approxima=on to
Q’s structure
• Force field
• Molecular
mechanics
CASP before AlphaFold
The metric:
- How well is the prediction compared
with the experimental data?
GDT: Global Distance Test
- Compares two structures
- From 0 to 100 (%)
- Greater is better
- Uses distance cutoffs
- Uses alpha Carbons
- More accurate than RMSD
Homology
modeling
Threading &
Fragment assembly
Molecular
dynamics
CASP and AlphaFold
CASP14: 152 targets
Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
AlphaFold 1
The model:
• CASP13 (2018)
• Convolutional-based Neural Network
Training:
• Structures: 31,247 domains
• Sequences: UniClust30
Senior, et al. (2020). Nature, 577(7792), 706–710.
X y
Sequence Structure
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
AlphaFold 1
• Input:
• Protein amino acid sequence
• Multiple Sequence Alignments (MSA):
• Profile features
MSA
1 2 3 4 5 6 7 ..
.
n
A - - 0.2 - - - - - -
R - - 0.3 - - - - - -
F - - 0.5 - - - - - -
G - - 0 - - - - - -
.. - - 0.8 - - - - - -
Y - - 1.2 - - - - - -
sequence positions
amino
acids
PSSM
position-specific scoring matrix
MSA to Profiles - PSSM:
⚙ → Familes and Domains
Senior, et al. (2020). Nature, 577(7792), 706–710.
MSA
Profile
Structure
optimization
AlphaFold 1
Senior, et al. (2020). Nature, 577(7792), 706–710.
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
Input
Sequence
MSA
Profile
The ML
model
The
distogram
y
X
Convolu=onal
Neural Network
The central component:
• A convolutional neural network
• Trained on PDB structures
• It predicts the distances dij
between the Cβ atoms of pairs,
ij, of residues of a protein.
AlphaFold 1
Senior, et al. (2020). Nature, 577(7792), 706–710.
AlphaFold 1
The distogram
Resiudue 29
The predicted probability distribu?ons for
distances of residue 29 to all other residues (41)
Senior, et al. (2020). Nature, 577(7792), 706–710.
MGAFGHGFG
TYHKLAALED
GTLKHHAKLQ
PHLSLLCMF…
Input
Sequence
MSA
Profile
The ML
model
The
distogram
y
X
Convolu=onal
Neural Network
Gradient descent:
• Rotate the phi and psi angles
• Match the predicted Cβ atoms
distances
AlphaFold 1 Protein folding
Senior, et al. (2020). Nature, 577(7792), 706–710.
Senior, et al. (2020). Nature, 577(7792), 706–710.
AlphaFold
References:
1. Senior, et al. (2020). Improved protein structure
prediction using potentials from deep learning.
Nature, 577(7792), 706–710.
2. Jumper, J., Evans, R., Pritzel, A. et al. Highly
accurate protein structure prediction with
AlphaFold. Nature 596, 583–589 (2021).

More Related Content

PPTX
Alpha fold 2
PDF
Protein folding prediction using Alphafold 1
PPTX
Alphafold2 - Protein Structural Bioinformatics After CASP14
PPTX
Oxido nitrico
PPTX
Diseño primers
PPTX
Antifreeze protein
PPTX
Protein Sequencing Strategies
PPTX
โครงสร้างและหน้าที่ของเซลล์
Alpha fold 2
Protein folding prediction using Alphafold 1
Alphafold2 - Protein Structural Bioinformatics After CASP14
Oxido nitrico
Diseño primers
Antifreeze protein
Protein Sequencing Strategies
โครงสร้างและหน้าที่ของเซลล์

What's hot (20)

PPTX
Protien Structure Prediction
PPTX
Protein Threading
PPTX
In silico structure prediction
PPTX
Dynamic programming and pairwise sequence alignment
PPTX
Genome Mapping
PPTX
Threading modeling methods
PDF
The jackknife and bootstrap
PPT
Lecture 9 slides: Machine learning for Protein Structure ...
PDF
MEGA (Molecular Evolutionary Genetics Analysis)
PPTX
Algorithm research project neighbor joining
PPTX
Scop database
PPTX
Upgma
PPTX
Sequence Alignment
PDF
Sequencing, Alignment and Assembly
PPTX
Orthologs,Paralogs & Xenologs
PPTX
Ensemble methods
PDF
MASCOT
PPTX
GENOMICS AND BIOINFORMATICS
PPTX
Phylogenetic tree construction
PPT
Systems biology & Approaches of genomics and proteomics
Protien Structure Prediction
Protein Threading
In silico structure prediction
Dynamic programming and pairwise sequence alignment
Genome Mapping
Threading modeling methods
The jackknife and bootstrap
Lecture 9 slides: Machine learning for Protein Structure ...
MEGA (Molecular Evolutionary Genetics Analysis)
Algorithm research project neighbor joining
Scop database
Upgma
Sequence Alignment
Sequencing, Alignment and Assembly
Orthologs,Paralogs & Xenologs
Ensemble methods
MASCOT
GENOMICS AND BIOINFORMATICS
Phylogenetic tree construction
Systems biology & Approaches of genomics and proteomics
Ad

Similar to An Overview to Protein bioinformatics (20)

PPT
protein structure prediction in bioinformatics.ppt
PPTX
Bioinformatics t8-go-hmm v2014
PDF
Protein structure prediction with a focus on Rosetta
PPT
NIH-mar2604. structural bioinformatics and genomics
PPT
MPDB Presentation
PPT
Cartic Ramakrishnan's dissertation defense
PPTX
Bioinformaatics for M.Sc. Biotecchnology.pptx
PPTX
2018 02 20_biological_databases_part1_v_upload
DOC
Protein databases
PPTX
Thesis def
PPT
ProFET - Protein Feature Engineering Toolki
PPTX
Databases_CSS2.pptx
PDF
Research presentation-wd
PDF
Introduction to Protein Families and Databases
PDF
patterndat.pdf
PPTX
Drug discovery presentation
PPTX
Informal presentation on bioinformatics
PPTX
2017 biological databases_part1_vupload
PPTX
Basic Bioinformatics and computational biology
PPTX
Basic Bioinformatics and Biotechnology.pptx
protein structure prediction in bioinformatics.ppt
Bioinformatics t8-go-hmm v2014
Protein structure prediction with a focus on Rosetta
NIH-mar2604. structural bioinformatics and genomics
MPDB Presentation
Cartic Ramakrishnan's dissertation defense
Bioinformaatics for M.Sc. Biotecchnology.pptx
2018 02 20_biological_databases_part1_v_upload
Protein databases
Thesis def
ProFET - Protein Feature Engineering Toolki
Databases_CSS2.pptx
Research presentation-wd
Introduction to Protein Families and Databases
patterndat.pdf
Drug discovery presentation
Informal presentation on bioinformatics
2017 biological databases_part1_vupload
Basic Bioinformatics and computational biology
Basic Bioinformatics and Biotechnology.pptx
Ad

More from Joel Ricci-López (20)

PDF
Biología Molecular: Introducción
PDF
Acoplamiento molecular con Autodock 4.2
PDF
Enfermedad del Sueño: Tripanosomiasis Africana
PDF
Inmunofluorescencia
PDF
LSD: Dietilamida de ácido lisérgico
PDF
Reptiles: Técnicas de Captura
PDF
Cuculiformes
PDF
Aves: Orden Passeriformes
PDF
Tortuga Laúd: Leatherback turtle
PDF
Familia: Cactáceas
PDF
Peces Óseos: Orden Myctophiformes
PDF
Peces Óseos: Orden Cypriniformes (Carpas)
PDF
Peces Óseos: Orden Batrachoidiformes
PDF
Evolución del Sistema Nervioso
PDF
Tiburones Ángel: Squatiniformes
PPTX
Replicación en Genomas de RNA: Virus y Viroides
PDF
Familia: Solanaceas
PDF
Genética de la violencia: Gen MAOA
PDF
Lóbulos Cerebrales: La corteza cerebral
PDF
Hemofília: Características Genéticas
Biología Molecular: Introducción
Acoplamiento molecular con Autodock 4.2
Enfermedad del Sueño: Tripanosomiasis Africana
Inmunofluorescencia
LSD: Dietilamida de ácido lisérgico
Reptiles: Técnicas de Captura
Cuculiformes
Aves: Orden Passeriformes
Tortuga Laúd: Leatherback turtle
Familia: Cactáceas
Peces Óseos: Orden Myctophiformes
Peces Óseos: Orden Cypriniformes (Carpas)
Peces Óseos: Orden Batrachoidiformes
Evolución del Sistema Nervioso
Tiburones Ángel: Squatiniformes
Replicación en Genomas de RNA: Virus y Viroides
Familia: Solanaceas
Genética de la violencia: Gen MAOA
Lóbulos Cerebrales: La corteza cerebral
Hemofília: Características Genéticas

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Placing the Near-Earth Object Impact Probability in Context
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
Sciences of Europe No 170 (2025)
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
An interstellar mission to test astrophysical black holes
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
BIOMOLECULES PPT........................
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ECG_Course_Presentation د.محمد صقران ppt
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Placing the Near-Earth Object Impact Probability in Context
POSITIONING IN OPERATION THEATRE ROOM.ppt
2Systematics of Living Organisms t-.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Phytochemical Investigation of Miliusa longipes.pdf
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Sciences of Europe No 170 (2025)
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Comparative Structure of Integument in Vertebrates.pptx
famous lake in india and its disturibution and importance
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Derivatives of integument scales, beaks, horns,.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
An interstellar mission to test astrophysical black holes
TOTAL hIP ARTHROPLASTY Presentation.pptx
BIOMOLECULES PPT........................
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS

An Overview to Protein bioinformatics

  • 1. What is a protein? Anticuerpo ATP sintetasa Hemoglobina Catalasa RNA Polimerasa RNA Polimerasa Actina Insulina Colágeno Rodopsina p53 GFP 1
  • 2. InterPro Protein classification resources at the EBI • InterPro is the main resource for protein classification at the EBI. • Is a single searchable resource for: • Pa2erns • Profiles • Fingerprints • HMMs • From a number of different member databases
  • 3. Protein classification Proteins can be classified into groups • The FAMILIES to which they belong • The DOMAINS they contain • The SEQUENCE FEATURES they possess Structural properties: • Domains • Sequence Features Tools/Models: • Protein signatures Evaluated through • Sequence similarity • Structural similarity according to their Based on
  • 4. What are protein FAMILIES? Protein Family Group of proteins that share a common evolutionary origin: • Related functions • Sequence similarity • Structural similarity Superfamily Family Subfamily Arranged into hierarchies Set of homologous proteins
  • 5. What are protein DOMAINS? Are structural units in a protein. • Continuous or discontinous (multiple motifs) • Often associated with different functions. • Between 40 and 500 residues. Domains may exist in a variety of biological contexts, where similar domains can be found in proteins with different functions. Catali&c ac&vity Subcellular targe&ng DNA binding ACP-binding domain Catalytic domain E. coli - Malonyl-CoA:acyl carrier protein PDB: 1MLA
  • 6. SEQUENCE ANNOTATIONS: Groups of amino acids that confer certain characteristics upon a protein. • A few aminoacids long. • O!en nested within domains Examples: • Active sites • Binding sites • Signals • Post-translational modification • Repeats • Motifs What are SEQUENCE FEATURES? Active site Signal Repeats Domains PDZ-binding domain
  • 7. Predic.ve Models to classify proteins into families predict the presence of important domains or sequence features What are PROTEIN SIGNATURES? MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMSKL DCKLPQGFFF MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMSKL DCKLPQGFFF Protein family/domain MSA Build Predictive Model Database Search Model refinement - maduration Protein signature Significant match Unknown protein
  • 8. Different approaches can be used to generate signatures. 1. Patterns 2. Profiles 3. Fingerprints 4. Hidden Markov Models (HMMs) Signature Types Single motif methods Multiple motif methods Full alingment methods
  • 9. A L V K L I S G K A I V H E S A T K C H V R D L S C K C P V E S T I S K A F V T I Y E F R C G V R Y T D K R [AC]-x-V-x(4)-{ED}-[KR] motif MSA Pattern/Regular expression MSA 1 2 3 4 5 6 7 .. . n A - - 0.2 - - - - - - R - - 0.3 - - - - - - F - - 0.5 - - - - - - G - - 0 - - - - - - .. - - 0.8 - - - - - - Y - - 1.2 - - - - - - sequence positions amino acids PSSM 1) Patterns 2) Profiles position-specific scoring matrix Regular expressions to model: ⚙ → Sequence features MSA to PSSM: ⚙ → Familes and Domains
  • 10. Motif 1 Motif 2 Motif 3 1 2 3 MSA Profiles Fingerprint signature Correct Order and space "Probabilistics profiles" Convert multiple sequence alignments into position-specific scoring matrix (PSSMs). 1 2 3 4 5 6 7 .. . n A - - 0.2 - - - - - - R - - 0.3 - - - - - - F - - 0.5 - - - - - - G - - 0 - - - - - - .. - - 0.8 - - - - - - Y - - 1.2 - - - - - - 3) Fingerprints 4) HMMs MSA to PSSM to model: ⚙ → Homologous sequences MSA → Profiles → Fingerprint : ⚙ → Families
  • 11. When use InterPro? You can use InterPro if you have an amino acid sequence or set of sequences and you want to know: www.ebi.ac.uk/interpro/ • What they are, what family they belong to • What is their function and how it can be explained in structural terms • You can also use InterPro for a variety of other purposes, such as examining the structural or functional predictions for any sequence already in the UniProt database.
  • 13. UniProt: Universal Protein resource 1. UniProtKB (Knowledgebase): 1. Swiss-Prot ⭐: Manually annotated entries 2. TrEMBL: Automatically annotated entries 2. UniParc (archive): • Non-redundant database • Protein sequences from public databases 3. UniRef (Reference Clusters): • Three databases • 100, 90, 50 • Sequences are clustered by its identity (%) UniProt
  • 14. Protein Data Bank Database for the three-dimensional structural data of large biological molecules. • X-ray crystallography • Nuclear Magnetic Resonance Spectroscopy • Cryogenic electron microscopy ~ 175, 000 structures
  • 15. UCSF Chimera • Structure visualiza;on • Molecular graphics • Topology and structural Analysis • UniProt annota;on • MSA and Blastp • Interface: • 3D-modelling • Pocket data • Molecular docking • Molecular Dynamics RecommendaCon
  • 16. 3d - Protein structure prediction PotenCal energy landscape Accuracy Difficulty • Compara.ve modelling – Requires: Known fold + clear homology • Fold recogni.on (threading) – Requires: Known fold • Ab ini.o / new fold methods – Requires: only target sequence
  • 17. Homology modeling Threading & Fragment assembly Molecular dynamics INPUT: query sequence Q INPUT: query sequence Q INPUT: query sequence Q INPUT: Database of known folds or structure fragments INPUT: Database of protein structures 1. find protein P high sequence similarity to Q 2. return P’s structure as an approxima=on to Q’s structure 1. Laws of physics to simulate folding of Q 1. find a set of fragments that Q can be aligned with 2. return F as an approxima=on to Q’s structure
  • 18. Protein structure prediction Database similarity search (BLAST) Does sequence align with a protein of known structure? Protein family Sequence search (InterPro) Rele7onship to known structure? Secondary structure predicCon 3D-structure predicCon Homology modeling Predicted 3D Structural model 3D structural analysis in laboratory • Hints for domain assignment? • Func7on? NO NO YES YES Protein Sequence
  • 20. AlphaFold MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF… What is it? - AF is an Artificial intelligence program - Google’s DeepMind The Goal: - Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence It “solves” two main problems: 1. Sequence-Structure gap 2. Protein folding Why solving these problems? Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
  • 21. Sequence-Structure gap - 1958: determination of the first protein structure. - John Kendrew & Max Perutz - Structure determination (experimental): - NMR - X-ray crystallography - Cryo-Electron microscopy - Protein Data Bank: - Total: ~170,000 - Unique: ~100,000
  • 22. AlphaFold 1 The protein folding problem - 1972: Christian Anfisen, Nobel Prize in Chemistry. - “It should be possible to determine a protein’s three-dimensional shape based solely on its sequence” - A typical protein could adopt 10^300 different configurations - Longer than the age of the universe - However, in nature, proteins spontaneously fold into their functional shape. - Cyrus Levinthal’s paradox (1969) - 50 years open research problem
  • 23. The protein folding problem CASP Critical Assessment of Techniques for Protein Structure prediction • The protein folding Olympics • The state of the art in protein structure prediction - The competition: - Since 1994 - Takes place every two years - Last competition: CASP14 – 2020 - Organizers: - Known both the sequence and the structure Participants: - Receive only the protein’s sequence - Must blindly predict the structure of the proteins - Predictions: compared with the experimental data
  • 24. Homology modeling Threading & Fragment assembly Molecular dynamics INPUT: query sequence Q INPUT: query sequence Q INPUT: query sequence Q INPUT: Database of known folds or structure fragments INPUT: Database of protein structures 1. find protein P high sequence similarity to Q 2. return P’s structure as an approxima=on to Q’s structure 1. Laws of physics to simulate folding of Q 1. find a set of fragments that Q can be aligned with 2. return F as an approxima=on to Q’s structure • Force field • Molecular mechanics
  • 25. CASP before AlphaFold The metric: - How well is the prediction compared with the experimental data? GDT: Global Distance Test - Compares two structures - From 0 to 100 (%) - Greater is better - Uses distance cutoffs - Uses alpha Carbons - More accurate than RMSD Homology modeling Threading & Fragment assembly Molecular dynamics
  • 26. CASP and AlphaFold CASP14: 152 targets Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583–589 (2021).
  • 27. AlphaFold 1 The model: • CASP13 (2018) • Convolutional-based Neural Network Training: • Structures: 31,247 domains • Sequences: UniClust30 Senior, et al. (2020). Nature, 577(7792), 706–710. X y Sequence Structure MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF…
  • 28. AlphaFold 1 • Input: • Protein amino acid sequence • Multiple Sequence Alignments (MSA): • Profile features MSA 1 2 3 4 5 6 7 .. . n A - - 0.2 - - - - - - R - - 0.3 - - - - - - F - - 0.5 - - - - - - G - - 0 - - - - - - .. - - 0.8 - - - - - - Y - - 1.2 - - - - - - sequence positions amino acids PSSM position-specific scoring matrix MSA to Profiles - PSSM: ⚙ → Familes and Domains Senior, et al. (2020). Nature, 577(7792), 706–710. MSA Profile Structure optimization
  • 29. AlphaFold 1 Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 30. MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF… Input Sequence MSA Profile The ML model The distogram y X Convolu=onal Neural Network The central component: • A convolutional neural network • Trained on PDB structures • It predicts the distances dij between the Cβ atoms of pairs, ij, of residues of a protein. AlphaFold 1 Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 31. AlphaFold 1 The distogram Resiudue 29 The predicted probability distribu?ons for distances of residue 29 to all other residues (41) Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 32. MGAFGHGFG TYHKLAALED GTLKHHAKLQ PHLSLLCMF… Input Sequence MSA Profile The ML model The distogram y X Convolu=onal Neural Network Gradient descent: • Rotate the phi and psi angles • Match the predicted Cβ atoms distances AlphaFold 1 Protein folding Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 33. Senior, et al. (2020). Nature, 577(7792), 706–710.
  • 34. AlphaFold References: 1. Senior, et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710. 2. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).