SlideShare a Scribd company logo
ABSTRACT
To improve the crop plant yield, agriculture companies have successfully adopted
development of insect resistant crops by expressing insecticidal (insect killing) proteins in
plants. As a leader in Agriculture Biotechnology industry, Bayer tests hundreds of genes
every year for insecticidal activity in their proprietary pipeline to develop next generation of
insect control solutions. Identification and nomination insecticidal proteins using traditional
methods like blast and structure similarity have some drawbacks because of which more
than 90% of the nominated proteins end up displaying no or less activity against insects. The
testing of these proteins consumes enormous amount of time and resource. So we adopted
machine learning (ML) approach to identify these proteins. We generated numerous features
for more than 5000 amino acid sequences using a Python toolkit, iFeature, developed by
Chen et al, in 2018 and built ML models to identify proteins with insecticidal activity.
Proteins identified using this method are tested in the pipeline to check their efficacy against
insect pests. Challenges faced while building the model and methods to overcome those
challenges are discussed in this presentation.
1
HOW WE BUILT A ML MODEL
TO PREDICT PROTEINS WITH
INSECTICIDAL ACTIVITY?
Karnam Vasudeva Rao,
Senior Scientist, Data Science,
Monsanto (A Subsidiary of Bayer)
CONTENTS
▰ What are insecticidal proteins?
▰ Why machine learning for protein activity identification?
▰ Different approaches used by researchers
▰ Why not general methods?
▰ iFeature Python tool kit
▰ Why did we choose iFeature?
▰ What features iFeature has?
▰ How we adopted it for our need?
▰ What were the challenges?
▰ How did we overcome those?
▰ Key learnings 3
IMPROVE CROP YIELD BY DEVELOPING PEST RESISTANT
CROPS BY EXPRESSING INSECTICIDAL PROTEINS IN THEM
4
WHY WE NEED ML FOR GENE NOMINATIONS?
5Current state
What?
Predict protein activity
against insect pests based
on Amino Acid sequence
features to enable quality
nominations to insect control
pipeline in Bayer.
Why?
100’s of proteins are
nominated and analyzed in
each year. Many
nominations have turned out
to be inactive proteins /
toxins. Goal is to develop a
model to predict the
propensity of toxicity.
How?
Extract features from
>5000 Protein (amino
acid) sequences and
develop a predictive
model using historical
data to predict inactive
toxins.
Future state
Pipeline
THREE MAJOR APPROACHES ARE USED BY
RESEARCHERS TO PREDICT PROTEIN FUNCTIONS
6
1 2 3
Sequence similarity between
AA sequences
Protein structure
comparison
Disadvantages with traditional methods:
High-similarity BLAST does not always imply homology.
Proteins with the same function can have different
structures.
Proteins that have diverged from a common ancestral
gene may have the same function but different
sequences.
Sequence similarity-based approaches are often
inadequate in the absence of similar sequences or when
the sequence similarity among known protein sequences
is statistically weak (called the "twilight zone" or
"midnight zone") (reference: Proteome Science 2009,
7:27).
Biological experiments for protein identification are time
consuming and resource intensive.
Sequence and structure
derived features
iFeature - AN OPEN-SOURCE PYTHON TOOLKIT FOR
PREDICTION OF PROTEINS ACTIVITY
7
iFeature
▰ http://iFeature.erc.monash.edu/
▰ https://github.com/Superzchen/iFeature/
▰ Features:
▰ Protein length, molecular weight, number of atoms,
grand average of hydropathicity (GRAVY), amino
acid composition, periodicity, physicochemical
properties, predicted secondary structures,
subcellular location, sequence motifs or highly
conserved regions, classification of protein function,
hydrophobicity, solvent accessibility, secondary
structure, surface tension, charge, polarisability,
polarity, and normalized van der Waals volume and
annotations in protein databases.
•Predicting protein–protein interactions
through sequence-based deep
learning.
•Bioinformatics, 34, 2018, i802–i810
DPPI
•Predicting protein functions from
sequence and interactions using a
deep ontology-aware classifier.
•Bioinformatics, 34(4), 2018, 660–668
DeepGO
•Predicting protein function by
machine learning on amino acid
sequences – a critical evaluation
•BMC Genomics 2007, 8:78
Classifiers
Place your screenshot here
8
iFeature - AN OPEN-
SOURCE PYTHON TOOLKIT
GitHuB repository with codes,
usage instructions and examples.
9
SAMPLE DATA
toxin sequence score
protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINKLIPSSG0
protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDGKPVYIA0
protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEWDGEPYS0
protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMANFSLDI0
protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQLDGGA0
protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVERNKKSL0
protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNKSAYYSF0
protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNVDGMY0
protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILGFLGLGF0
protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANFNEPVC0
protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLADSQQQ1
protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITTKNHPK0
protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTWNPLLV0
protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDLEVKNSY0
• 5000+ amino acid sequences
and activity scores 0-5.
• *0-5: inactive to highly active
10
cluster.py
iFeaturePse
KRAAC.py
feaSelector.py
pcaAnalysis.py
python iFeature.py --file examples/test-protein.txt --type CKSAAP
python iFeature.py --file examples/test-protein.txt --type DDE
POSSESS 37 FEATURE DESCRIPTIONS
• three dimensionality reduction
algorithms (PCA, LDA and t-SNE)
• program used to implement the
feature selection algorithms
• program used for running the feature or
sample clustering algorithms.
• program used to extract the 16 types
of pseudo K-tuple reduced amino acid
composition (PseKRAAC) feature
descriptors.
• k-spaced Amino Acid Pairs
11
LIST OF VARIOUS DESCRIPTORS
CALCULATED BY
Descriptor groups Descriptor Dimn.
AA composition Amino acid composition (AAC) 20
Enhanced amino acid composition (EAAC) —
Composition of k-spaced AA pairs (CKSAAP) 2400
Dipeptide composition (DPC) 400
Dipeptide deviation from expected mean (DDE) 400
Tripeptide composition (TPC) 8000
Grouped AA composition Grouped amino acid composition (GAAC) 5
Enhanced grouped AA composition (GEAAC) —
Composition of k-spaced AA group pairs (CKSAAGP) 150
Grouped dipeptide composition (GDPC) 25
Grouped tripeptide composition (GTPC) 125
Binary Binary (BINARY) —
Autocorrelation Moran (Moran) 240
Geary (Geary) 240
Normalized Moreau-Broto (NMBroto) 240
C/T/D Composition (CTDC) 39
Transition (CTDT) 39
Distribution (CTDD) 195
Conjoint triad
Conjoint triad (CTriad) 343
Conjoint k-spaced triad (KSCTriad) 343x(k+1)
Feature selectionFeature extraction Model building
Performance of the
modelsData preparation
1 2 3 4 5
What to explore in
Data?
Only 2 independent
variables
• Sequences
• Assay values
No independent
variables!
Need to generate
features using
sequences.
1000s of features;
which ones to
select?
What these
features explain?
Which model to
choose?
Confusion matrix
Biologically whether
it makes sense?
Meaningful features for protein function
prediction
CHALLENGES IN USING SEQUENCE BASED
ML APPROACHES
iFeature
13
toxin sequence score
protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINK0
protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDG0
protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEW0
protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMA0
protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQ0
protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVER0
protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNK0
protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNV0
protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILG0
protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANF0
protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLA1
protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITT0
protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTW0
protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDL0
NUMEROUS SEQUENCE FEATURES
WERE GENERATED USING
MODEL EVALUATION
RANDOM FOREST WAS THE FAVORITE
14
KEY LEARNINGS
FEATURES
▰iFeature - ‘all in one package’
▰Very few independent variables
before using iFeature and too
many after using iFeature.
▰Use not only Importance but
domain knowledge to choose input
variables (e.g. K space, conjoint
triad).
DATA
▰Data bias can be overcome
using domain knowledge – 0:
active; 1-5: active (Multinomial
to binomial).
MODEL BUILDING
▰Build multiple models instead of
one or two and choose the best
based on business needs and
parameters.
▰Where multiple models perform
equally select model based on
business needs / domain
knowledge (False Positives |
False negatives) – sensitivity and
specificity.
15
OTHER APPLICATIONS
▰iFeature and above approach – to
identify disease related proteins and
Protein-protein interaction studies.
16
THANKS!
https://www.linkedin.com/in/karnam-vasudeva-rao-phd-9032759/
vasukarnam@gmail.com
vkarnam@monsanto.com; vasudevarao.karnam@bayer.com
Senior Scientist - Data Science,
Monsanto (Subsidiary of
Bayer), Bengaluru, India.

More Related Content

PDF
JBEI Highlights September 2015
PDF
JBEI Research Highlights December 2016
PDF
JBEI Research Highlights - May 2018
PDF
JBEI Research Highlights - August 2018
PDF
JBEI highlights july 2019
PDF
JBEI September 2020 Highlights
PPTX
NGS-Based Clinical Analysis
PDF
JBEI Highlights July 2015
JBEI Highlights September 2015
JBEI Research Highlights December 2016
JBEI Research Highlights - May 2018
JBEI Research Highlights - August 2018
JBEI highlights july 2019
JBEI September 2020 Highlights
NGS-Based Clinical Analysis
JBEI Highlights July 2015

What's hot (20)

PDF
JBEI Research Highlights - October 2017
PDF
JBEI Research Highlights - January 2018
PDF
JBEI Research Highlight Slides - February 2021
PDF
JBEI Research Highlights - February 2018
PDF
JBEI Research Highlights - March 2018
PDF
JBEI Research Highlights - December 2017
PDF
JBEI Research Highlights - May 2019
PDF
JBEI Research Highlights - January 2019
PDF
JBEI Research Highlights - March 2019
PPTX
JBEI highlights September 2019
PDF
20140710 1 day1_nist_ercc2.0workshop
PDF
JBEI highlights November 2019
PDF
JBEI Research Highlights - April 2019
PDF
September 2021 - JBEI Research Highlights Slides
DOCX
Ripa buffer - Invent Biotechnologies
PPTX
JBEI August 2020 Highlights
PDF
JBEI Highlights - October 2014
PDF
JBEI Highlights December 2014
PDF
JBEI October 2020 Research Highlights
PDF
JBEI August 2019 highlights
JBEI Research Highlights - October 2017
JBEI Research Highlights - January 2018
JBEI Research Highlight Slides - February 2021
JBEI Research Highlights - February 2018
JBEI Research Highlights - March 2018
JBEI Research Highlights - December 2017
JBEI Research Highlights - May 2019
JBEI Research Highlights - January 2019
JBEI Research Highlights - March 2019
JBEI highlights September 2019
20140710 1 day1_nist_ercc2.0workshop
JBEI highlights November 2019
JBEI Research Highlights - April 2019
September 2021 - JBEI Research Highlights Slides
Ripa buffer - Invent Biotechnologies
JBEI August 2020 Highlights
JBEI Highlights - October 2014
JBEI Highlights December 2014
JBEI October 2020 Research Highlights
JBEI August 2019 highlights
Ad

Similar to Prediction of proteins for insecticidal activity using python toolkit iFeature (20)

PDF
Sample Prep Solutions for Microbiome Research
PPTX
AlphaFold-Revolutionizing Protein Structure Prediction.pptx
PDF
Introduction to Bioprocessing Sample Slides
PPTX
AI in Bioinformatics
PPT
SooryaKiran Bioinformatics
PPTX
Web-based access to experimental and predicted data for environmental fate, t...
PDF
PPTX
Protein microarray
PDF
Functional annotation
ODP
The roles communities play in improving bioinformatics: better software, bett...
PDF
Primer designing
PPT
Proteomics & Metabolomics
PPTX
Introduction to Proteogenomics
PDF
Importance of microbial research
PDF
Proteomics contributes to your microbial research
PDF
Introduction to Bioinformatics for Molecular Studies
PPTX
Novozymes Enzyme Stability Prediction
PDF
Protein Qualitative Analysis Services
PDF
Biopharma VS Small Molecules Therapeutic
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Sample Prep Solutions for Microbiome Research
AlphaFold-Revolutionizing Protein Structure Prediction.pptx
Introduction to Bioprocessing Sample Slides
AI in Bioinformatics
SooryaKiran Bioinformatics
Web-based access to experimental and predicted data for environmental fate, t...
Protein microarray
Functional annotation
The roles communities play in improving bioinformatics: better software, bett...
Primer designing
Proteomics & Metabolomics
Introduction to Proteogenomics
Importance of microbial research
Proteomics contributes to your microbial research
Introduction to Bioinformatics for Molecular Studies
Novozymes Enzyme Stability Prediction
Protein Qualitative Analysis Services
Biopharma VS Small Molecules Therapeutic
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PDF
Business Analytics and business intelligence.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Foundation of Data Science unit number two notes
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Business Analytics and business intelligence.pdf
Mega Projects Data Mega Projects Data
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Supervised vs unsupervised machine learning algorithms
Foundation of Data Science unit number two notes
ISS -ESG Data flows What is ESG and HowHow
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
annual-report-2024-2025 original latest.
Clinical guidelines as a resource for EBP(1).pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Prediction of proteins for insecticidal activity using python toolkit iFeature

  • 1. ABSTRACT To improve the crop plant yield, agriculture companies have successfully adopted development of insect resistant crops by expressing insecticidal (insect killing) proteins in plants. As a leader in Agriculture Biotechnology industry, Bayer tests hundreds of genes every year for insecticidal activity in their proprietary pipeline to develop next generation of insect control solutions. Identification and nomination insecticidal proteins using traditional methods like blast and structure similarity have some drawbacks because of which more than 90% of the nominated proteins end up displaying no or less activity against insects. The testing of these proteins consumes enormous amount of time and resource. So we adopted machine learning (ML) approach to identify these proteins. We generated numerous features for more than 5000 amino acid sequences using a Python toolkit, iFeature, developed by Chen et al, in 2018 and built ML models to identify proteins with insecticidal activity. Proteins identified using this method are tested in the pipeline to check their efficacy against insect pests. Challenges faced while building the model and methods to overcome those challenges are discussed in this presentation. 1
  • 2. HOW WE BUILT A ML MODEL TO PREDICT PROTEINS WITH INSECTICIDAL ACTIVITY? Karnam Vasudeva Rao, Senior Scientist, Data Science, Monsanto (A Subsidiary of Bayer)
  • 3. CONTENTS ▰ What are insecticidal proteins? ▰ Why machine learning for protein activity identification? ▰ Different approaches used by researchers ▰ Why not general methods? ▰ iFeature Python tool kit ▰ Why did we choose iFeature? ▰ What features iFeature has? ▰ How we adopted it for our need? ▰ What were the challenges? ▰ How did we overcome those? ▰ Key learnings 3
  • 4. IMPROVE CROP YIELD BY DEVELOPING PEST RESISTANT CROPS BY EXPRESSING INSECTICIDAL PROTEINS IN THEM 4
  • 5. WHY WE NEED ML FOR GENE NOMINATIONS? 5Current state What? Predict protein activity against insect pests based on Amino Acid sequence features to enable quality nominations to insect control pipeline in Bayer. Why? 100’s of proteins are nominated and analyzed in each year. Many nominations have turned out to be inactive proteins / toxins. Goal is to develop a model to predict the propensity of toxicity. How? Extract features from >5000 Protein (amino acid) sequences and develop a predictive model using historical data to predict inactive toxins. Future state Pipeline
  • 6. THREE MAJOR APPROACHES ARE USED BY RESEARCHERS TO PREDICT PROTEIN FUNCTIONS 6 1 2 3 Sequence similarity between AA sequences Protein structure comparison Disadvantages with traditional methods: High-similarity BLAST does not always imply homology. Proteins with the same function can have different structures. Proteins that have diverged from a common ancestral gene may have the same function but different sequences. Sequence similarity-based approaches are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak (called the "twilight zone" or "midnight zone") (reference: Proteome Science 2009, 7:27). Biological experiments for protein identification are time consuming and resource intensive. Sequence and structure derived features
  • 7. iFeature - AN OPEN-SOURCE PYTHON TOOLKIT FOR PREDICTION OF PROTEINS ACTIVITY 7 iFeature ▰ http://iFeature.erc.monash.edu/ ▰ https://github.com/Superzchen/iFeature/ ▰ Features: ▰ Protein length, molecular weight, number of atoms, grand average of hydropathicity (GRAVY), amino acid composition, periodicity, physicochemical properties, predicted secondary structures, subcellular location, sequence motifs or highly conserved regions, classification of protein function, hydrophobicity, solvent accessibility, secondary structure, surface tension, charge, polarisability, polarity, and normalized van der Waals volume and annotations in protein databases. •Predicting protein–protein interactions through sequence-based deep learning. •Bioinformatics, 34, 2018, i802–i810 DPPI •Predicting protein functions from sequence and interactions using a deep ontology-aware classifier. •Bioinformatics, 34(4), 2018, 660–668 DeepGO •Predicting protein function by machine learning on amino acid sequences – a critical evaluation •BMC Genomics 2007, 8:78 Classifiers
  • 8. Place your screenshot here 8 iFeature - AN OPEN- SOURCE PYTHON TOOLKIT GitHuB repository with codes, usage instructions and examples.
  • 9. 9 SAMPLE DATA toxin sequence score protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINKLIPSSG0 protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDGKPVYIA0 protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEWDGEPYS0 protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMANFSLDI0 protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQLDGGA0 protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVERNKKSL0 protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNKSAYYSF0 protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNVDGMY0 protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILGFLGLGF0 protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANFNEPVC0 protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLADSQQQ1 protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITTKNHPK0 protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTWNPLLV0 protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDLEVKNSY0 • 5000+ amino acid sequences and activity scores 0-5. • *0-5: inactive to highly active
  • 10. 10 cluster.py iFeaturePse KRAAC.py feaSelector.py pcaAnalysis.py python iFeature.py --file examples/test-protein.txt --type CKSAAP python iFeature.py --file examples/test-protein.txt --type DDE POSSESS 37 FEATURE DESCRIPTIONS • three dimensionality reduction algorithms (PCA, LDA and t-SNE) • program used to implement the feature selection algorithms • program used for running the feature or sample clustering algorithms. • program used to extract the 16 types of pseudo K-tuple reduced amino acid composition (PseKRAAC) feature descriptors. • k-spaced Amino Acid Pairs
  • 11. 11 LIST OF VARIOUS DESCRIPTORS CALCULATED BY Descriptor groups Descriptor Dimn. AA composition Amino acid composition (AAC) 20 Enhanced amino acid composition (EAAC) — Composition of k-spaced AA pairs (CKSAAP) 2400 Dipeptide composition (DPC) 400 Dipeptide deviation from expected mean (DDE) 400 Tripeptide composition (TPC) 8000 Grouped AA composition Grouped amino acid composition (GAAC) 5 Enhanced grouped AA composition (GEAAC) — Composition of k-spaced AA group pairs (CKSAAGP) 150 Grouped dipeptide composition (GDPC) 25 Grouped tripeptide composition (GTPC) 125 Binary Binary (BINARY) — Autocorrelation Moran (Moran) 240 Geary (Geary) 240 Normalized Moreau-Broto (NMBroto) 240 C/T/D Composition (CTDC) 39 Transition (CTDT) 39 Distribution (CTDD) 195 Conjoint triad Conjoint triad (CTriad) 343 Conjoint k-spaced triad (KSCTriad) 343x(k+1)
  • 12. Feature selectionFeature extraction Model building Performance of the modelsData preparation 1 2 3 4 5 What to explore in Data? Only 2 independent variables • Sequences • Assay values No independent variables! Need to generate features using sequences. 1000s of features; which ones to select? What these features explain? Which model to choose? Confusion matrix Biologically whether it makes sense? Meaningful features for protein function prediction CHALLENGES IN USING SEQUENCE BASED ML APPROACHES iFeature
  • 13. 13 toxin sequence score protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINK0 protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDG0 protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEW0 protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMA0 protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQ0 protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVER0 protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNK0 protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNV0 protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILG0 protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANF0 protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLA1 protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITT0 protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTW0 protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDL0 NUMEROUS SEQUENCE FEATURES WERE GENERATED USING
  • 14. MODEL EVALUATION RANDOM FOREST WAS THE FAVORITE 14
  • 15. KEY LEARNINGS FEATURES ▰iFeature - ‘all in one package’ ▰Very few independent variables before using iFeature and too many after using iFeature. ▰Use not only Importance but domain knowledge to choose input variables (e.g. K space, conjoint triad). DATA ▰Data bias can be overcome using domain knowledge – 0: active; 1-5: active (Multinomial to binomial). MODEL BUILDING ▰Build multiple models instead of one or two and choose the best based on business needs and parameters. ▰Where multiple models perform equally select model based on business needs / domain knowledge (False Positives | False negatives) – sensitivity and specificity. 15 OTHER APPLICATIONS ▰iFeature and above approach – to identify disease related proteins and Protein-protein interaction studies.