SlideShare a Scribd company logo
Computational methods for
metabolite identification from
tandem mass spectrometry
Dai Hai Nguyen
Kyoto University, Japan
20/07/2018 D. H. Nguyen, Kyoto University 1
Background of metabolites identification
Metabolites
 Intermediate or end products of metabolism
 Small molecules with important functions: energy transport, building
blocks of cells, etc.
 Many applications, e.g. drug discovery
 Identifying or profiling them is challenging
20/07/2018 D. H. Nguyen, Kyoto University 2
Background of metabolites identification
Tandem Mass spectrometry
 fragments compound into
many fragments
 each fragment corresponds to a peak
 There exist peak interactions
(co-occurrence of peaks)
20/07/2018 D. H. Nguyen, Kyoto University 3
Peak interaction
Background of metabolites identification
 Task: given a query spectrum, find similar molecules in database.
 Approaches:
20/07/2018 D. H. Nguyen, Kyoto University
MS library In silico fragmentation Machine learning
4
I. Mass spectra library
 Simply compare query spectrum with spectra in
library
 Best matching candidates are returned
 Drawback: size of library is limited
 E.g., Human metabolome database ~ 2000 compounds
20/07/2018 D. H. Nguyen, Kyoto University
MS library
5
II. In silico fragmentation
 To mitigate insufficiency of spectra library by
taking advantage of structural database.
 Can be divided into groups:
1) rule-based
2) combinatorial based
3) machine learning based
20/07/2018 D. H. Nguyen, Kyoto University
In silico fragmentation
6
II. In silico fragmentation (1)
1) Rule based fragmentation, e.g., Mass Frontier
 Use set of fragmentation rules to predict spectra from compound
structures.
 Rules are extracted from the literature.
 Not preferred in practice due to:
 fragmentation process can be variant due to small changes in molecular structure
 # rules insufficient to identify fragments with high accuracy
 intensities of peaks are ignored
20/07/2018 D. H. Nguyen, Kyoto University 7
II. In silico fragmentation (2)
2) Combinatorial based fragmentation, e.g. FiD
 From molecular structure, generate graph of all
connected substructures.
 Find most likely fragmentation trees that best
matches spectrum.
 Drawbacks:
 computationally expensive -> applied for small molecules
Intensities of fragments are ignored
20/07/2018 D. H. Nguyen, Kyoto University 8
II. In silico fragmentation (3)
3) Machine learning based fragmentation
 Use ML to learn fragmentation process from data.
 Peak intensities are considered and learned
 Very few work
20/07/2018 D. H. Nguyen, Kyoto University 9
II. In silico fragmentation (3)
Competitive Fragmentation Modeling (CFM)
models fragmentation as a Markov process of state
transition between fragments
1. Transition model
2. Observation model
20/07/2018
D. H. Nguyen, Kyoto University
10
Background of metabolites identification
 Task: given a query spectrum, find similar molecules in database.
 Approaches:
20/07/2018 D. H. Nguyen, Kyoto University
MS library In silico fragmentation Machine learning
11
III. Machine learning Approach
a) Supervised ML for
substructure prediction
b) Unsupervised ML for
substructure annotation
20/07/2018 D. H. Nguyen, Kyoto University 12
IV. Machine learning Approach
supervised ML for substructure prediction
Step 1:
fingerprint prediction
Step 2:
Candidate retrieval
20/07/2018 D. H. Nguyen, Kyoto University 13
Machine learning Approach
Supervised ML for substructure prediction
FingerID (Bioinformatics, 2012)
Kernel method
• Define probability product kernel (PPK) for spectra.
• Then, use SVM for classification.
 Drawback
 Peak interactions are ignored.
 Limited accuracy
𝑝 𝑋 =
1
𝑛 𝑋
𝑘=1
𝑛 𝑋
𝑝 𝑋(𝑘) 𝑝 𝑌 =
1
𝑛 𝑌
𝑘=1
𝑛 𝑌
𝑝 𝑌(𝑘)
𝐾 𝑋, 𝑌 =
1
𝑛 𝑋 𝑛 𝑌
𝑖,𝑗
𝑝 𝑋(𝑖)𝑝 𝑌(𝑗)
20/07/2018 D. H. Nguyen, Kyoto University 14
Machine learning Approach
Supervised ML for substructure prediction
CSI:FingerID (Bioinformatics, 2014)
 Improved version of FingerID
 Define kernel for spectra by PPK
 Kernels for fragmentation trees are defined and combined with PPK
via MKL.
 Then, use SVM for classification.
20/07/2018 D. H. Nguyen, Kyoto University 15
Machine learning Approach
Supervised ML for substructure prediction
CSI:FingerID (Bioinformatics, 2014)
Fragmentation trees
 Models of fragmentation of a molecule in MS/MS
 Nodes ~ peaks ~ molecular formula of fragments.
 Edges ~ losses ~ uncaptured uncharged fragments.
 Trees can be predicted from spectra provide structural information of
spectra.
20/07/2018 D. H. Nguyen, Kyoto University 16
Machine learning Approach
Supervised ML for substructure prediction
CSI:FingerID (Bioinformatics, 2014)
Pros & Cos
 Improved accuracy due to
additional structural information
provided by trees
 Computationally expensive due to
conversion of trees from spectra
 Lack of interpretation
20/07/2018 D. H. Nguyen, Kyoto University 17
Machine learning Approach
Supervised ML for substructure prediction
SIMPLE (Bioinformatics, 2018)
• Idea: introducing interaction term to model (two-way interaction model)
• Prediction model:
𝑓 𝑥 = 𝑏 + 𝑤 𝑇 𝑥 + 𝑥 𝑇 𝑊𝑥 , 𝑦 𝑥 = 𝑠𝑔𝑛(𝑓(𝑥))
• Objective function :
min
𝑏,𝑤,𝑊
𝑖=1
𝑛
[1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗
• Convexity guarantees to find globally optimal solution.
Hinge loss Sparsity Low-rank
Peaks Interactions
20/07/2018 D. H. Nguyen, Kyoto University 18
SIMPLE (Bioinformatics, 2018)
 Idea: use background knowledge (interactions from trees) to regularize W.
 Laplacian regularization
𝑥 𝑇 𝑊𝑥 = 𝑖,𝑗 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 = 𝑖,𝑗(𝑣𝑖
𝑇
𝑣𝑗)𝑥𝑖 𝑥𝑗
𝑊 can be decomposed as 𝑉 𝑇
𝑉 (low rank decomposition)
 𝑅 𝑉 = 𝑖,𝑗 𝐴𝑖𝑗 𝑣𝑖 − 𝑣𝑗
2
= trace 𝑊𝐿 ,
where 𝐿 is Laplacian matrix.
 New objective function :
min
𝑏,𝑤,𝑊
𝑖=1
𝑛
[1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ + 𝛾 trace(𝑊𝐿)
 Still convex
Machine learning Approach
Supervised ML for substructure prediction
20/07/2018 D. H. Nguyen, Kyoto University 19
+
Machine learning Approach
Supervised ML for substructure prediction
Input Output Kernel Regression (IOKR) (Bioinformatics, 2017)
Idea: using IOKR to learn the mapping between spectra and molecular structure.
Two steps:
1. Estimation of the output feature map by solving
2. Computation of pre-image problem
20/07/2018 D. H. Nguyen, Kyoto University 20
Machine learning Approach
Unsupervised ML for substructure annotation
 Metabolites/molecules may have common substructures,
yielding similar fragments/peaks in spectra.
 Such substructures are pertaining to biochemical processes
 Allows to group metabolites based on shared substructures
 Improve the accuracy of metabolite identification
20/07/2018 D. H. Nguyen, Kyoto University 21
IV. Machine learning Approach
Unsupervised ML for substructure annotation
MS2LDA (Bioinformatics 2017)
 Automatically extract relevant substructures in
molecules in metabolites based on co-occurrence of
fragments and losses.
 Motivated by topic modeling for text application.
e.g. Latent Dirichlet Allocation (LDA)
 LDA for MS data (MS2LDA)
 Peaks ~ words
 set of peaks (substructures) ~ topics
 LDA decompose a text into topics, while MS2LDA
decomposes a molecule into substructures.
 Drawbacks: extracted substructures need to be annotated
based on expert knowledge (complex process and time-
consuming)
20/07/2018 D. H. Nguyen, Kyoto University 22
Machine learning Approach
Unsupervised ML for
substructure annotation
Automated recommendation of subtructures
from MS/MS (Aida Mrzic et al, bioRxiv)
 Automatically extract relevant substructures
in molecules based on co-occurrence of
fragments and losses
 Applied Frequent Itemset Mining to extract
association rules.
 Given query spectrum, get recommendation
of substructures present in it by applying
extracted rules.
20/07/2018 D. H. Nguyen, Kyoto University 23
Conclusion
• Metabolite Identification is an essential part in metabolomics to enlarge
knowledge of biological systems.
• Many techniques/software with different approaches have been
proposed to deal with this task and can be categorized into groups
• ML methods are the key to recent progress in metabolite identification
20/07/2018 D. H. Nguyen, Kyoto University 24

More Related Content

PPTX
Cheminformatics
PDF
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
PDF
Pattern recognition using context dependent memory model (cdmm) in multimodal...
PDF
Genetic algorithms in molecular design of novel fabrics Sylvia Wower
PDF
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER
PDF
Text documents clustering using modified multi-verse optimizer
PDF
Clustering and Classification of Cancer Data Using Soft Computing Technique
PDF
algorithms
Cheminformatics
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Pattern recognition using context dependent memory model (cdmm) in multimodal...
Genetic algorithms in molecular design of novel fabrics Sylvia Wower
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER
Text documents clustering using modified multi-verse optimizer
Clustering and Classification of Cancer Data Using Soft Computing Technique
algorithms

What's hot (20)

PDF
D1803012022
PDF
Ijmet 10 01_029
PDF
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
PDF
PDF
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
PDF
Analysis on different Data mining Techniques and algorithms used in IOT
PDF
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
PPTX
Heterogeneous data annotation
PDF
chalenges and apportunity of deep learning for big data analysis f
DOCX
Advances of neural networks in 2020
PDF
Applying Soft Computing Techniques in Information Retrieval
PDF
Decision Support System for Bat Identification using Random Forest and C5.0
PDF
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
PDF
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
PDF
Deliverable_5.1.2
PDF
Computational of Bioinformatics
PDF
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
PDF
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
PPSX
Prototype-based classifiers and their applications in the life sciences
PDF
Pattern recognition system based on support vector machines
D1803012022
Ijmet 10 01_029
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
Analysis on different Data mining Techniques and algorithms used in IOT
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
Heterogeneous data annotation
chalenges and apportunity of deep learning for big data analysis f
Advances of neural networks in 2020
Applying Soft Computing Techniques in Information Retrieval
Decision Support System for Bat Identification using Random Forest and C5.0
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
Deliverable_5.1.2
Computational of Bioinformatics
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
Prototype-based classifiers and their applications in the life sciences
Pattern recognition system based on support vector machines
Ad

Similar to IBSB tutorial (20)

PDF
Advanced machine learning for metabolite identification
PDF
Towards smart modeling of mechanical properties of a bio composite based on ...
PPTX
Large Graph Mining
PDF
The interplay between data-driven and theory-driven methods for chemical scie...
PDF
Machine Learning for Molecules
PDF
Kernel based approaches in drug target interaction prediction
PDF
Digging deeper into data processing with emphasis on computational and micros...
PDF
AI that/for matters
PDF
Classifier Model using Artificial Neural Network
PDF
A study on cloud computing ppt n_24-12-2017
PDF
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
PPTX
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
PDF
EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Da...
PDF
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
PDF
Metamaterials offer unique and fascinating properties that are not found in n...
PDF
Algorithms 14-00122
PDF
Exploring New Frontiers in Inverse Materials Design with Graph Neural Network...
PDF
Current_Research_in_Future_Information_and_Communi (18).pdf
PDF
Current_Research_in_Future_Information_and_Communi (15).pdf
PDF
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Advanced machine learning for metabolite identification
Towards smart modeling of mechanical properties of a bio composite based on ...
Large Graph Mining
The interplay between data-driven and theory-driven methods for chemical scie...
Machine Learning for Molecules
Kernel based approaches in drug target interaction prediction
Digging deeper into data processing with emphasis on computational and micros...
AI that/for matters
Classifier Model using Artificial Neural Network
A study on cloud computing ppt n_24-12-2017
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Da...
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
Metamaterials offer unique and fascinating properties that are not found in n...
Algorithms 14-00122
Exploring New Frontiers in Inverse Materials Design with Graph Neural Network...
Current_Research_in_Future_Information_and_Communi (18).pdf
Current_Research_in_Future_Information_and_Communi (15).pdf
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Ad

More from Dai-Hai Nguyen (7)

PDF
Metrics for generativemodels
PDF
Brief introduction on GAN
PDF
Hierarchical selection
PDF
Semi-supervised learning model for molecular property prediction
PDF
DL for molecules
PDF
PDF
Collaborative DL
Metrics for generativemodels
Brief introduction on GAN
Hierarchical selection
Semi-supervised learning model for molecular property prediction
DL for molecules
Collaborative DL

Recently uploaded (20)

PDF
Well-logging-methods_new................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Construction Project Organization Group 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Artificial Intelligence
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Project quality management in manufacturing
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
737-MAX_SRG.pdf student reference guides
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
Well-logging-methods_new................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
OOP with Java - Java Introduction (Basics)
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Current and future trends in Computer Vision.pptx
Construction Project Organization Group 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Artificial Intelligence
Fundamentals of safety and accident prevention -final (1).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Safety Seminar civil to be ensured for safe working.
Embodied AI: Ushering in the Next Era of Intelligent Systems
Project quality management in manufacturing
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
737-MAX_SRG.pdf student reference guides
Automation-in-Manufacturing-Chapter-Introduction.pdf
III.4.1.2_The_Space_Environment.p pdffdf

IBSB tutorial

  • 1. Computational methods for metabolite identification from tandem mass spectrometry Dai Hai Nguyen Kyoto University, Japan 20/07/2018 D. H. Nguyen, Kyoto University 1
  • 2. Background of metabolites identification Metabolites  Intermediate or end products of metabolism  Small molecules with important functions: energy transport, building blocks of cells, etc.  Many applications, e.g. drug discovery  Identifying or profiling them is challenging 20/07/2018 D. H. Nguyen, Kyoto University 2
  • 3. Background of metabolites identification Tandem Mass spectrometry  fragments compound into many fragments  each fragment corresponds to a peak  There exist peak interactions (co-occurrence of peaks) 20/07/2018 D. H. Nguyen, Kyoto University 3 Peak interaction
  • 4. Background of metabolites identification  Task: given a query spectrum, find similar molecules in database.  Approaches: 20/07/2018 D. H. Nguyen, Kyoto University MS library In silico fragmentation Machine learning 4
  • 5. I. Mass spectra library  Simply compare query spectrum with spectra in library  Best matching candidates are returned  Drawback: size of library is limited  E.g., Human metabolome database ~ 2000 compounds 20/07/2018 D. H. Nguyen, Kyoto University MS library 5
  • 6. II. In silico fragmentation  To mitigate insufficiency of spectra library by taking advantage of structural database.  Can be divided into groups: 1) rule-based 2) combinatorial based 3) machine learning based 20/07/2018 D. H. Nguyen, Kyoto University In silico fragmentation 6
  • 7. II. In silico fragmentation (1) 1) Rule based fragmentation, e.g., Mass Frontier  Use set of fragmentation rules to predict spectra from compound structures.  Rules are extracted from the literature.  Not preferred in practice due to:  fragmentation process can be variant due to small changes in molecular structure  # rules insufficient to identify fragments with high accuracy  intensities of peaks are ignored 20/07/2018 D. H. Nguyen, Kyoto University 7
  • 8. II. In silico fragmentation (2) 2) Combinatorial based fragmentation, e.g. FiD  From molecular structure, generate graph of all connected substructures.  Find most likely fragmentation trees that best matches spectrum.  Drawbacks:  computationally expensive -> applied for small molecules Intensities of fragments are ignored 20/07/2018 D. H. Nguyen, Kyoto University 8
  • 9. II. In silico fragmentation (3) 3) Machine learning based fragmentation  Use ML to learn fragmentation process from data.  Peak intensities are considered and learned  Very few work 20/07/2018 D. H. Nguyen, Kyoto University 9
  • 10. II. In silico fragmentation (3) Competitive Fragmentation Modeling (CFM) models fragmentation as a Markov process of state transition between fragments 1. Transition model 2. Observation model 20/07/2018 D. H. Nguyen, Kyoto University 10
  • 11. Background of metabolites identification  Task: given a query spectrum, find similar molecules in database.  Approaches: 20/07/2018 D. H. Nguyen, Kyoto University MS library In silico fragmentation Machine learning 11
  • 12. III. Machine learning Approach a) Supervised ML for substructure prediction b) Unsupervised ML for substructure annotation 20/07/2018 D. H. Nguyen, Kyoto University 12
  • 13. IV. Machine learning Approach supervised ML for substructure prediction Step 1: fingerprint prediction Step 2: Candidate retrieval 20/07/2018 D. H. Nguyen, Kyoto University 13
  • 14. Machine learning Approach Supervised ML for substructure prediction FingerID (Bioinformatics, 2012) Kernel method • Define probability product kernel (PPK) for spectra. • Then, use SVM for classification.  Drawback  Peak interactions are ignored.  Limited accuracy 𝑝 𝑋 = 1 𝑛 𝑋 𝑘=1 𝑛 𝑋 𝑝 𝑋(𝑘) 𝑝 𝑌 = 1 𝑛 𝑌 𝑘=1 𝑛 𝑌 𝑝 𝑌(𝑘) 𝐾 𝑋, 𝑌 = 1 𝑛 𝑋 𝑛 𝑌 𝑖,𝑗 𝑝 𝑋(𝑖)𝑝 𝑌(𝑗) 20/07/2018 D. H. Nguyen, Kyoto University 14
  • 15. Machine learning Approach Supervised ML for substructure prediction CSI:FingerID (Bioinformatics, 2014)  Improved version of FingerID  Define kernel for spectra by PPK  Kernels for fragmentation trees are defined and combined with PPK via MKL.  Then, use SVM for classification. 20/07/2018 D. H. Nguyen, Kyoto University 15
  • 16. Machine learning Approach Supervised ML for substructure prediction CSI:FingerID (Bioinformatics, 2014) Fragmentation trees  Models of fragmentation of a molecule in MS/MS  Nodes ~ peaks ~ molecular formula of fragments.  Edges ~ losses ~ uncaptured uncharged fragments.  Trees can be predicted from spectra provide structural information of spectra. 20/07/2018 D. H. Nguyen, Kyoto University 16
  • 17. Machine learning Approach Supervised ML for substructure prediction CSI:FingerID (Bioinformatics, 2014) Pros & Cos  Improved accuracy due to additional structural information provided by trees  Computationally expensive due to conversion of trees from spectra  Lack of interpretation 20/07/2018 D. H. Nguyen, Kyoto University 17
  • 18. Machine learning Approach Supervised ML for substructure prediction SIMPLE (Bioinformatics, 2018) • Idea: introducing interaction term to model (two-way interaction model) • Prediction model: 𝑓 𝑥 = 𝑏 + 𝑤 𝑇 𝑥 + 𝑥 𝑇 𝑊𝑥 , 𝑦 𝑥 = 𝑠𝑔𝑛(𝑓(𝑥)) • Objective function : min 𝑏,𝑤,𝑊 𝑖=1 𝑛 [1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ • Convexity guarantees to find globally optimal solution. Hinge loss Sparsity Low-rank Peaks Interactions 20/07/2018 D. H. Nguyen, Kyoto University 18
  • 19. SIMPLE (Bioinformatics, 2018)  Idea: use background knowledge (interactions from trees) to regularize W.  Laplacian regularization 𝑥 𝑇 𝑊𝑥 = 𝑖,𝑗 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 = 𝑖,𝑗(𝑣𝑖 𝑇 𝑣𝑗)𝑥𝑖 𝑥𝑗 𝑊 can be decomposed as 𝑉 𝑇 𝑉 (low rank decomposition)  𝑅 𝑉 = 𝑖,𝑗 𝐴𝑖𝑗 𝑣𝑖 − 𝑣𝑗 2 = trace 𝑊𝐿 , where 𝐿 is Laplacian matrix.  New objective function : min 𝑏,𝑤,𝑊 𝑖=1 𝑛 [1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ + 𝛾 trace(𝑊𝐿)  Still convex Machine learning Approach Supervised ML for substructure prediction 20/07/2018 D. H. Nguyen, Kyoto University 19 +
  • 20. Machine learning Approach Supervised ML for substructure prediction Input Output Kernel Regression (IOKR) (Bioinformatics, 2017) Idea: using IOKR to learn the mapping between spectra and molecular structure. Two steps: 1. Estimation of the output feature map by solving 2. Computation of pre-image problem 20/07/2018 D. H. Nguyen, Kyoto University 20
  • 21. Machine learning Approach Unsupervised ML for substructure annotation  Metabolites/molecules may have common substructures, yielding similar fragments/peaks in spectra.  Such substructures are pertaining to biochemical processes  Allows to group metabolites based on shared substructures  Improve the accuracy of metabolite identification 20/07/2018 D. H. Nguyen, Kyoto University 21
  • 22. IV. Machine learning Approach Unsupervised ML for substructure annotation MS2LDA (Bioinformatics 2017)  Automatically extract relevant substructures in molecules in metabolites based on co-occurrence of fragments and losses.  Motivated by topic modeling for text application. e.g. Latent Dirichlet Allocation (LDA)  LDA for MS data (MS2LDA)  Peaks ~ words  set of peaks (substructures) ~ topics  LDA decompose a text into topics, while MS2LDA decomposes a molecule into substructures.  Drawbacks: extracted substructures need to be annotated based on expert knowledge (complex process and time- consuming) 20/07/2018 D. H. Nguyen, Kyoto University 22
  • 23. Machine learning Approach Unsupervised ML for substructure annotation Automated recommendation of subtructures from MS/MS (Aida Mrzic et al, bioRxiv)  Automatically extract relevant substructures in molecules based on co-occurrence of fragments and losses  Applied Frequent Itemset Mining to extract association rules.  Given query spectrum, get recommendation of substructures present in it by applying extracted rules. 20/07/2018 D. H. Nguyen, Kyoto University 23
  • 24. Conclusion • Metabolite Identification is an essential part in metabolomics to enlarge knowledge of biological systems. • Many techniques/software with different approaches have been proposed to deal with this task and can be categorized into groups • ML methods are the key to recent progress in metabolite identification 20/07/2018 D. H. Nguyen, Kyoto University 24

Editor's Notes