SlideShare a Scribd company logo
Validation of Time Series Technique
   for Prediction of Conformational
   States of Amino Acids




Dr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide)

Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)
Concepts Used
              Ramachandran Plot

                  Time series

           AR,ARMA,ARIMA models

                  AIC criteria

              Euclidean distance

        Potential values for AA residues

      Feynman Problem Solving Algorithm
Ramachandran Plot
Time Series
a sequence of data points or set of observations, measured
typically at successive time instants spaced at uniform time
intervals.

                                                 Patterns, variations


                                                 forecasting
Time Series Models (probability model)


Autoregressive (AR) models


Autoregressive-moving average (ARMA)


Autoregressive integrated moving average (ARIMA)
models

- depend linearly on previous data points
Materials & Methods
                              R

                       R-Studio, Tinn-R

            bio3d,itsmr,forecast,tseries,timsac,wordcloud

                   ITSM_2000- Standalone

 R Nabble
 BioStars
 stats.stackexchange
Methods

A) Calculation of Potential values for AA
residues

B)Forecasting of AA states


C) Clustering
Calculation of Potential values for AA residues

                                                                     Dataset-I

 3829 proteins selected from PDB (Protein Data Bank) –PDBSelect dataset list(25 %
                                 seq. similarity)


         Expt. method-X-ray, R-factor: - 0-0.25 (for best resolved structures)



                             Chain breaks, only CA atoms


Phi-Psi values –torsion.pdb() of “bio3d” & verified via PDBGoodies (IISC, Bangalore) &
                       Protein Angle Descriptor utility (IIT, Delhi )


 Assignment of Conformational state 1, 2, or 3 - to regions I, II, or III of the Rama.
                 Plot, to each amino-acid residue (Phi_psi values)
ᵠ




                                      ᶲ
Figure No- 2 Ramachandran plot showing three conformational regions I ,II and III

   I- closely/tightly packed conformations, Phi-140 to 0,Psi -100 to 0
   II-extended conformations, Phi -180 to 0, Psi 80 to 180
   III- all remaining confirmations
Frequencies of single residues in three states calculated
& normalized using (Kolaskar, A.S. & Sawant, S.V. -1996 )



                                   nik N
                           Pik =
                                  nik  nik
Nik –no. of times the AA of type (i) occurs in state k=1-3;
N -total no. of residues
Pik -potential values of AA of type (i) in state k

 Potential values in pdf
Potential values
BIM_2010_20_Bioinformatics_Project
Time Series
ACF Plot
ACF –Stat Vs. Non-stationary
Stationary




                     Non-stationary
Time Series
      ACF plot

                      Stationary



                     Non-
                  stationary




                 Stationary
Stationary TS
TS model building…..

            AR (p)



            ARMA(p,q)



      ARIMA (p,q)
Best model Selection

                     AR (p)

                  ARMA (p, q)

                  ARIMA (p, q)

            AIC
Forecasting of AA states for best models
Forecasting of AA states for best models….

e.g. for AR(1) process,



X t = φ X (t-1) + Z (t), t=0,± 1,….



Where {Z t}~ WN (0, s2) & | φ | <1


  1st observed potential for AA with index given as data points & t
   respectively, prediction starts from 2nd position up to last index
                        using forecast() “itsmr”
Similarly for ARMA (1,1) /ARIMA (1,1)


X t = φ X (t-1) + Z (t) + θ Z (t-1),      θ+φ


Forecasting Quality by coefficient of determination (R2)
using formula


                      R =1
                         2             (Yi  Fi )2
                                        (Yi  Y )2

    Yi =True value /Observed value
    Fi = Forecasted/predicted value
Clustering
                      Dataset-II

SCOP Domain specific PDB-style files(ATOM & HETATM records )
downloaded from


ASTRAL Compendium for Sequence and Structure Analysis -
release 1.75 (June 2009)


Scan for chain breaks & presence of CA atoms only, breaked files
kept aside
Length of AA residues(100-110) e.g.
10gsa1_a_133_pot.txt

   File
Potential values (Time series),each domain divided into
stationary (506) & non-stationary process (1692)

Non-stationary data kept aside for further
transformations

AR,ARMA & ARIMA models


Best model (minimum AIC criteria)


Best-AR(22),ARMA(484),ARIMA(No model)


AR(p), ARMA(p,q) -distance matrix (Euclidean distance )


Dendrogram-Neighbour-joing ( Phylip packages)
Dendrogram_TS –AR models-22
Dendrogram_TS –ARMA models-484

• Phylowidget link
Results & Discussion

For each AA of all the proteins, 3D-
Cartesian co-ordinates were transformed
into 2D info. i.e. conformational states of
AA and potential values were computed
and used to build time-distance (index of
AA) dependent statistical model as time
series for forecasting purposes.
AR values


            Autoregressive order (p)  1-18 range

            Short & long range dependence  variations
            in protein structural arrangements

            Variations proves  diversity exhibits
            through structural components
Table No. II – Forecasting results for AR models (44) out of best
 90 models (Note- for 46 models, class information not found in
 SCOP database) All values are in % accuracy

      All  (a)-12   All  (b)-5   /  (c)-9     +  (d)-13   Small    Coiled-coil Designed
                                                                proteins (h)-3        proteins
                                                                (g)-1                 (k)-1

      Max    Min     Max     Min   Max     Min   Max    Min              Max Min
AA    26.82 2.41     16.30 8.88    27.77 1.47    28.57 7.04     19.51    22.5 5.88 29.03
seq
(%)
States 55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78             26      15   26.88
(%)


Conformational states accuracy > AA residues accuracy due to low
resolution of potential values(forecasted values)
Table No. III– Forecasting results for ARMA models (557) out of best 1239
   models (Note- for 682 models, class information not found in SCOP
   database) —All values are in % accuracy

         All  (a)-123 All  (b)-146    /  (c)-120    +  (d)-127   Multi domains Membrane &        Small
                                                                       proteins (e)-13 cell surface    proteins(g)-
                                                                                       (f)-3           17



         Max     Min    Max     Min     Max     Min    Max     Min     Max     Min     Max     Min     Max    Min


AA       32.55   2.63   32.81   3.96    43.47   5      37.96   2.70    24.39   6.034   12.65   7.01    30.64 6.60
seq
(%)

States   65.77   8.06   65.01   17.94   62.89   8.97   68.15   11.11   50      17.80   34.33   11.42   64.51 14.28
(%)




Due to non-representative dataset & inadequate info. about class, can’t say
that for any particular class i) pred. accuracy ↑ or ↓ & ii) follows mostly
ARMA process
Discussion
TS graphs opens new door in scientific visualization of proteins (no 3D str. info) i.e.
specific AA can be visualized on line plot with its value proportional to frequency to
occur into allowed regions of Ramachandran plot.



Potential value for each AA adds new feature of selection in machine learning
techniques.




Order of AR model tells how current value linearly related to past p value




Intra-dependency of AA shown using models of TS e.g. AR(4),ARMA(1,3)
CONCLUSIONS
Found new way of looking at protein structure
prediction.

Application of TS technique for predicting conformational states based on the
conformational state potentials instead of secondary str. has been attempted.

Accuracy of prediction of conformational states for AA, using time series is
higher than that for prediction of AA residues.

To increase accuracy for prediction, multivariate time series concept may be
useful instead of uni-variate time series

Intra-fluctuations inside proteins, due to AA arrangement can be traced out
by stationary & non-stationary groups
FUTURE WORK
AR and MA order of TS models -as point of genetic information (distances) to
predict evolutionary relationship between different proteins.


TS concept can be used to predict conformational states of missing residues
in PDB data files


Hierarchical clustering/classification of TS of proteins -birth to new concept
of time dependent clustering (pseudo-clustering) & pseudo-phylogeny.


Development of synthetic proteins to combat seasonal diseases & to tackle
chemical warfare attacks.


TS fluctuations for specific class of proteins can be used as “Pattern” for data
analysis and pattern-dependent classification of proteins
References

Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-
based prediction of protein structures and the design of novel
molecules. Nature. 1987 Mar 26-Apr 1;326(6111):347-52. Review

Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational
states of amino acids using a Ramachandran plot. Int.J.Peptide
Protein Res.110-116

Alessandro G.,Romualdo B.,(2000). Nonlinear Methods in the
Analysis of Protein Sequences:A Case Study in Rubredoxins.
Biophysical Journal.136-148
Questions
Thank You !

More Related Content

PPT
Arima model (time series)
PPS
CHAC Algorithm ECAL'07 Presentation
PDF
Germany2003 gamg
PDF
Time series modelling arima-arch
PDF
922214 e002013
PPTX
Project time series ppt
PDF
International Journal of Engineering Research and Development
PDF
ARIMA Models - [Lab 3]
Arima model (time series)
CHAC Algorithm ECAL'07 Presentation
Germany2003 gamg
Time series modelling arima-arch
922214 e002013
Project time series ppt
International Journal of Engineering Research and Development
ARIMA Models - [Lab 3]

Viewers also liked (6)

PDF
Relat%c3%b3rio%20 final%20fgv%20sp
PDF
A General Framework for Enhancing Prediction Performance on Time Series Data
PDF
Forecasting Techniques - Data Science SG
PDF
Automatic algorithms for time series forecasting
PDF
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
PPT
Specialty packaging corporation, part a
Relat%c3%b3rio%20 final%20fgv%20sp
A General Framework for Enhancing Prediction Performance on Time Series Data
Forecasting Techniques - Data Science SG
Automatic algorithms for time series forecasting
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Specialty packaging corporation, part a
Ad

Similar to BIM_2010_20_Bioinformatics_Project (20)

PDF
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
PPTX
Protein motif pdf this is very useful for students
PPT
Powerpoint
PPT
Prediction of transcription factor binding to DNA using rule induction methods
PDF
MULISA : A New Strategy for Discovery of Protein Functional Motifs and Residues
PDF
BITS: Basics of sequence analysis
PPTX
Flexscore: Ensemble-based evaluation for protein Structure models
PDF
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
PPT
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
PPT
Subtypes of Associated Protein-DNA (Transcription Factor-Transcription Factor...
PPT
Prediction of protein structure, homology Modeling
PPTX
Seasonal Decomposition of Time Series Data
PPTX
Bioinformatica t5-database searching
PDF
Temporal Graph Pattern Mining
PPTX
PPTX
The application of artificial intelligence
PDF
Development of Predictor for Sequence Derived Features From Amino Acid Sequen...
PDF
RDataMining slides-time-series-analysis
PDF
ANTIC-2021_paper_95.pdf
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Protein motif pdf this is very useful for students
Powerpoint
Prediction of transcription factor binding to DNA using rule induction methods
MULISA : A New Strategy for Discovery of Protein Functional Motifs and Residues
BITS: Basics of sequence analysis
Flexscore: Ensemble-based evaluation for protein Structure models
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Kamada-filehhhhhhhhhhhhhhhhhhhhhhhhhhhh.ppt
Subtypes of Associated Protein-DNA (Transcription Factor-Transcription Factor...
Prediction of protein structure, homology Modeling
Seasonal Decomposition of Time Series Data
Bioinformatica t5-database searching
Temporal Graph Pattern Mining
The application of artificial intelligence
Development of Predictor for Sequence Derived Features From Amino Acid Sequen...
RDataMining slides-time-series-analysis
ANTIC-2021_paper_95.pdf
Ad

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf

BIM_2010_20_Bioinformatics_Project

  • 1. Validation of Time Series Technique for Prediction of Conformational States of Amino Acids Dr. Sangeeta Sawant , Bioinformatics Centre, UoP, Pune (Guide) Dr. Mohan Kale, Dept. of Statistics, UoP, Pune (co-guide)
  • 2. Concepts Used Ramachandran Plot Time series AR,ARMA,ARIMA models AIC criteria Euclidean distance Potential values for AA residues Feynman Problem Solving Algorithm
  • 4. Time Series a sequence of data points or set of observations, measured typically at successive time instants spaced at uniform time intervals. Patterns, variations forecasting
  • 5. Time Series Models (probability model) Autoregressive (AR) models Autoregressive-moving average (ARMA) Autoregressive integrated moving average (ARIMA) models - depend linearly on previous data points
  • 6. Materials & Methods R R-Studio, Tinn-R bio3d,itsmr,forecast,tseries,timsac,wordcloud ITSM_2000- Standalone R Nabble BioStars stats.stackexchange
  • 7. Methods A) Calculation of Potential values for AA residues B)Forecasting of AA states C) Clustering
  • 8. Calculation of Potential values for AA residues Dataset-I 3829 proteins selected from PDB (Protein Data Bank) –PDBSelect dataset list(25 % seq. similarity) Expt. method-X-ray, R-factor: - 0-0.25 (for best resolved structures) Chain breaks, only CA atoms Phi-Psi values –torsion.pdb() of “bio3d” & verified via PDBGoodies (IISC, Bangalore) & Protein Angle Descriptor utility (IIT, Delhi ) Assignment of Conformational state 1, 2, or 3 - to regions I, II, or III of the Rama. Plot, to each amino-acid residue (Phi_psi values)
  • 9. ᶲ Figure No- 2 Ramachandran plot showing three conformational regions I ,II and III I- closely/tightly packed conformations, Phi-140 to 0,Psi -100 to 0 II-extended conformations, Phi -180 to 0, Psi 80 to 180 III- all remaining confirmations
  • 10. Frequencies of single residues in three states calculated & normalized using (Kolaskar, A.S. & Sawant, S.V. -1996 ) nik N Pik =  nik  nik Nik –no. of times the AA of type (i) occurs in state k=1-3; N -total no. of residues Pik -potential values of AA of type (i) in state k Potential values in pdf
  • 15. ACF –Stat Vs. Non-stationary Stationary Non-stationary
  • 16. Time Series ACF plot Stationary Non- stationary Stationary
  • 18. TS model building….. AR (p) ARMA(p,q) ARIMA (p,q)
  • 19. Best model Selection AR (p) ARMA (p, q) ARIMA (p, q) AIC
  • 20. Forecasting of AA states for best models
  • 21. Forecasting of AA states for best models…. e.g. for AR(1) process, X t = φ X (t-1) + Z (t), t=0,± 1,…. Where {Z t}~ WN (0, s2) & | φ | <1 1st observed potential for AA with index given as data points & t respectively, prediction starts from 2nd position up to last index using forecast() “itsmr”
  • 22. Similarly for ARMA (1,1) /ARIMA (1,1) X t = φ X (t-1) + Z (t) + θ Z (t-1), θ+φ Forecasting Quality by coefficient of determination (R2) using formula R =1 2  (Yi  Fi )2  (Yi  Y )2 Yi =True value /Observed value Fi = Forecasted/predicted value
  • 23. Clustering Dataset-II SCOP Domain specific PDB-style files(ATOM & HETATM records ) downloaded from ASTRAL Compendium for Sequence and Structure Analysis - release 1.75 (June 2009) Scan for chain breaks & presence of CA atoms only, breaked files kept aside
  • 24. Length of AA residues(100-110) e.g. 10gsa1_a_133_pot.txt File
  • 25. Potential values (Time series),each domain divided into stationary (506) & non-stationary process (1692) Non-stationary data kept aside for further transformations AR,ARMA & ARIMA models Best model (minimum AIC criteria) Best-AR(22),ARMA(484),ARIMA(No model) AR(p), ARMA(p,q) -distance matrix (Euclidean distance ) Dendrogram-Neighbour-joing ( Phylip packages)
  • 28. Results & Discussion For each AA of all the proteins, 3D- Cartesian co-ordinates were transformed into 2D info. i.e. conformational states of AA and potential values were computed and used to build time-distance (index of AA) dependent statistical model as time series for forecasting purposes.
  • 29. AR values Autoregressive order (p)  1-18 range Short & long range dependence  variations in protein structural arrangements Variations proves  diversity exhibits through structural components
  • 30. Table No. II – Forecasting results for AR models (44) out of best 90 models (Note- for 46 models, class information not found in SCOP database) All values are in % accuracy All  (a)-12 All  (b)-5 /  (c)-9  +  (d)-13 Small Coiled-coil Designed proteins (h)-3 proteins (g)-1 (k)-1 Max Min Max Min Max Min Max Min Max Min AA 26.82 2.41 16.30 8.88 27.77 1.47 28.57 7.04 19.51 22.5 5.88 29.03 seq (%) States 55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78 26 15 26.88 (%) Conformational states accuracy > AA residues accuracy due to low resolution of potential values(forecasted values)
  • 31. Table No. III– Forecasting results for ARMA models (557) out of best 1239 models (Note- for 682 models, class information not found in SCOP database) —All values are in % accuracy All  (a)-123 All  (b)-146 /  (c)-120  +  (d)-127 Multi domains Membrane & Small proteins (e)-13 cell surface proteins(g)- (f)-3 17 Max Min Max Min Max Min Max Min Max Min Max Min Max Min AA 32.55 2.63 32.81 3.96 43.47 5 37.96 2.70 24.39 6.034 12.65 7.01 30.64 6.60 seq (%) States 65.77 8.06 65.01 17.94 62.89 8.97 68.15 11.11 50 17.80 34.33 11.42 64.51 14.28 (%) Due to non-representative dataset & inadequate info. about class, can’t say that for any particular class i) pred. accuracy ↑ or ↓ & ii) follows mostly ARMA process
  • 32. Discussion TS graphs opens new door in scientific visualization of proteins (no 3D str. info) i.e. specific AA can be visualized on line plot with its value proportional to frequency to occur into allowed regions of Ramachandran plot. Potential value for each AA adds new feature of selection in machine learning techniques. Order of AR model tells how current value linearly related to past p value Intra-dependency of AA shown using models of TS e.g. AR(4),ARMA(1,3)
  • 33. CONCLUSIONS Found new way of looking at protein structure prediction. Application of TS technique for predicting conformational states based on the conformational state potentials instead of secondary str. has been attempted. Accuracy of prediction of conformational states for AA, using time series is higher than that for prediction of AA residues. To increase accuracy for prediction, multivariate time series concept may be useful instead of uni-variate time series Intra-fluctuations inside proteins, due to AA arrangement can be traced out by stationary & non-stationary groups
  • 34. FUTURE WORK AR and MA order of TS models -as point of genetic information (distances) to predict evolutionary relationship between different proteins. TS concept can be used to predict conformational states of missing residues in PDB data files Hierarchical clustering/classification of TS of proteins -birth to new concept of time dependent clustering (pseudo-clustering) & pseudo-phylogeny. Development of synthetic proteins to combat seasonal diseases & to tackle chemical warfare attacks. TS fluctuations for specific class of proteins can be used as “Pattern” for data analysis and pattern-dependent classification of proteins
  • 35. References Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge- based prediction of protein structures and the design of novel molecules. Nature. 1987 Mar 26-Apr 1;326(6111):347-52. Review Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational states of amino acids using a Ramachandran plot. Int.J.Peptide Protein Res.110-116 Alessandro G.,Romualdo B.,(2000). Nonlinear Methods in the Analysis of Protein Sequences:A Case Study in Rubredoxins. Biophysical Journal.136-148