SlideShare a Scribd company logo
Introduction to Bioinformatics:
      Mining Your Data

          Gerry Lushington
         Lushington in Silico

  modeling / informatics consultant
What is Data Mining?
 Use of computational methods to perceive trends in data that
 can be used to explain or predict important outcomes or
 properties

Applicable across many disciplines:
Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics
Example Applications:
     Find relationships between:

Convenient Observables                vs.    Important Outcomes
a)    Relative gene expression data          1.   Disease susceptibility
b)    Relative protein abundance data        2.   Drug efficacy
c)    Relative lipid & metabolite profiles   3.   Toxin susceptibility
d)    Glycosylation variants                 4.   Immunity
e)    SNPs, alleles                          5.   Genetic disorders
f)    Cellular traits                        6.   Microbial virulence
g)    Organism traits                        7.   Species adaptive success
h)    Behavioral traits                      8.   Species complementarity
i)    Case history
Goals for this lecture:
Focus on Data Mining: how to approach your data and use it to
understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be
useful?




        Don’t worry about grasping everything:
     K-INBRE Bioinformatics Core is here to help!!
Basic Data Mining:
Find relationships between:
a) Easy to measure properties   vs.
b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for
outcomes in b)

Use relationships to predict outcomes in new cases where
outcome has not yet been measured
Basic Data Mining: simple measureables
Basic Data Mining: general observation




          Unhappy         Happy
Basic Data Mining: relationship (#1)




              Unhappy                Happy


    Blue = happy; Red = unhappy   accuracy = 12/20 = 60%
Basic Data Mining: relationship (#2)




               Unhappy                     Happy


  Blue + BIG Red = happy; little red = unhappy     accuracy = 17/20 = 85%
Data Mining: procedure

1.   Data Acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing       Peak heights?
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                   Peak positions?
Key issues include:
a) format conversion from instrument
b) any necessary mathematical manipulations (e.g., Density = M/V)
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                   Use controls to
4.   Classification                                        scale data
5.   Validation
6.   Prediction & Iteration


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers




C    C 1 2 3      C 1 2 3                C 1 2 3    C 1 2 3
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                   Subjective
4.   Classification                                 (requires experience
5.   Validation                                        and/or domain
6.   Prediction & Iteration                             knowledge)


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                  x              x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                                 x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                                          x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training

                                                               1 2 3 4
Data Mining: procedure

1.   Data acquisition                 • Train preliminary models based on
                                        random sets of properties
2.   Data Preprocessing
                                      • Evaluate models according to
3.   Feature Selection                  correlative or predictive performance
4.   Classification                   • Experiment with promising sets adding
5.   Validation                         or deleting descriptors to gauge impact
6.   Prediction & Iteration             on performance


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


 Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                          y



1.   Data acquisition
2.   Data Preprocessing                             x
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                          y



1.   Data acquisition
2.   Data Preprocessing                                          x
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                     -n y +n
Predict which sample will have which outcome?   NO               YES

a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure
                                                x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                     x1

Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                                  y1            y2

                                                 x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification                                                             y3
5.   Validation                                 y4
6.   Prediction & Iteration
                                                                           x1

Predict which sample will have which outcome?
a)   Correlative methods                y1 = resistant to types I & II diabetes
b)   Distance-based clustering          y2 = susceptible only to type II
c)   Boundary detection
d)   Rule learning                      y3 = susceptible only to type I
e)   Weighted probability               y4 = susceptible to types I & II
Data Mining: procedure                               Resistant to type I

                                                x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                                        x1
                                                     Susceptible to type I
Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                               Resistant to type I

                                                x2
1.   Data acquisition
2.   Data Preprocessing                                                         b
3.   Feature Selection
4.   Classification                              a
5.   Validation
6.   Prediction & Iteration
                                                                   c       x1
                                                     Susceptible to type I
Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering              If x1 < c and x2 > a then resistant
c)   Boundary detection                     Else if x1 > c and x2 > b then resistant
d)   Rule learning                          Else susceptible
e)   Weighted probability

                                                                           E=9
Data Mining: procedure
                                             Resistant                      Susc.
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                           a                 x1
4.   Classification
5.   Validation                                   Susc.                    Resistant
6.   Prediction & Iteration

                                                                b                  x2
Predict which sample will have which outcome?
a)   Correlative methods                     Resistant                      Susc.
b)   Distance-based clustering
c)   Boundary detection
                                                                    c      Fx1 -
d)   Rule learning
                                                                           Gx2
e)   Weighted probability                If Fx1 - Gx2 < c then resistant
                                         Else susceptible
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Data Mining: procedure
                                                    x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection          Resistant (Neg.)
4.   Classification
5.   Validation                                                                    Susc.
6.   Prediction & Iteration
                                                                                x1 (Pos.)

Define criteria and tests to prove model validity
a)   Accuracy                                            Accuracy =      (TP + TN)
b)   Sensitivity vs. Specificity                                      TP + TN + FP + FN
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation                                             = 142 / 154
Data Mining: procedure
                                                       x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection             Resistant (Neg.)
4.   Classification
5.   Validation                                                            Susc.
6.   Prediction & Iteration
                                                                        x1 (Pos.)

Define criteria and tests to prove model validity
a)   Accuracy                                   Sensitivity =   TP      = 67 / 72
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot             TP + FN
d)   Cross-validation
                                                      FPR =     FP      = 6 / 81
                                                              TN + FP
                 Note: Specificity = 1 - FPR
Data Mining: procedure
                                                         x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection             Resistant (Neg.)
4.   Classification                               less
5.   Validation                        Varying                             Susc.
6.   Prediction & Iteration            model
                                                     more               x1 (Pos.)
                                     stringency

Define criteria and tests to prove model validity
a)   Accuracy                                   Sensitivity =   TP      = 69 / 72
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot             TP + FN
d)   Cross-validation
                                                      FPR =     FP      = 19 / 81
                                                              TN + FP
                 Note: Specificity = 1 - FPR
Data Mining: procedure
                                               Sens
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                      FPR

Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Data Mining: procedure
                                               Sens
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                                  FPR

Define criteria and tests to prove model validity
                                                      Area under curve is
a)   Accuracy
                                                      excellent measure of
b)   Sensitivity vs. Specificity                      model performance
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation                                 1.0: perfect model
                                                      0.5: random
Data Mining: procedure

1.   Data acquisition                     Predictions are imperfect due to:
2.   Data Preprocessing                   • Imperfect Algorithms
3.   Feature Selection                    • Imperfect Data
4.   Classification
5.   Validation
6.   Prediction & Iteration


Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Cross-Validation:

• Carefully monitor features that are useful across different
  independent data subsets
• This can be accomplished with N-fold cross-validation:

     Trial 1        Trial 2      Trial 3      Trial 4      Trial 5
            Test




    Train          Model performance = mean predictive performance over 5 trials


• Best feature selection and classification algorithms will yield
  best consistent performance across independent trials
• Best features will be consistently important across trials
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Analysis is only useful if it is used; only improves if it is tested
a) Good validation requires successful new predictions
b) Imperfect predictions can lead to method refinement and
   greater understanding
Questions?


      Lushington in Silico
Geraldlushington3117 at aol.com
     Geraldlushington.org

More Related Content

PPTX
PAM matrices evolution
PPTX
Entrez databases
PDF
LECTURE NOTES ON BIOINFORMATICS
PDF
Ab Initio Protein Structure Prediction
PDF
Structural databases
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPTX
Protein database
PPTX
(Expasy)
PAM matrices evolution
Entrez databases
LECTURE NOTES ON BIOINFORMATICS
Ab Initio Protein Structure Prediction
Structural databases
Sequence alig Sequence Alignment Pairwise alignment:-
Protein database
(Expasy)

What's hot (20)

PDF
Tools and database of NCBI
PDF
Data mining
PPT
Directed Evolution
PDF
Bioinformatics data mining
PPT
An Introduction to "Bioinformatics & Internet"
DOCX
Bioinformatics on internet
PPTX
CODON BIAS
PPTX
Next Generation Sequencing of DNA
PPTX
Sequence homology search and multiple sequence alignment(1)
PPT
PPTX
Expression and purification of recombinant proteins in Bacterial and yeast sy...
PPTX
protein sequence analysis
PPTX
YEAST TWO HYBRID SYSTEM
PPTX
Protein fold recognition and ab_initio modeling
PPTX
Protein purification
PPTX
Express sequence tags
PPT
Protein protein interaction
PPTX
Protein Databases
PPTX
L21. techniques for selection, screening and characterization of transformants
Tools and database of NCBI
Data mining
Directed Evolution
Bioinformatics data mining
An Introduction to "Bioinformatics & Internet"
Bioinformatics on internet
CODON BIAS
Next Generation Sequencing of DNA
Sequence homology search and multiple sequence alignment(1)
Expression and purification of recombinant proteins in Bacterial and yeast sy...
protein sequence analysis
YEAST TWO HYBRID SYSTEM
Protein fold recognition and ab_initio modeling
Protein purification
Express sequence tags
Protein protein interaction
Protein Databases
L21. techniques for selection, screening and characterization of transformants
Ad

Viewers also liked (20)

PDF
Classification using L1-Penalized Logistic Regression
PPTX
Cross-validation aggregation for forecasting
PPTX
Data mining ppt
PDF
Machine learning group computer science department ULB - Lab'InSight Artifici...
PDF
Learning from data: data mining approaches for Energy & Weather/Climate appli...
PDF
Lecture7 cross validation
PDF
100505 koenig biological_databases
PDF
Introduction to Bioinformatics
PPTX
Introduction to Bioinformatics
PPT
1.bioinformatics introduction 32.03.2071
PPT
B.sc biochem i bobi u-1 introduction to bioinformatics
PDF
Bioinformatics issues and challanges presentation at s p college
PPTX
Bioinformatics
PDF
Nucleic Acid Sequence databases
PPT
Biological Databases
PPTX
Major databases in bioinformatics
PPT
Biological databases
PPTX
blast bioinformatics
PPTX
databases in bioinformatics
PPT
Ch 1 Intro to Data Mining
Classification using L1-Penalized Logistic Regression
Cross-validation aggregation for forecasting
Data mining ppt
Machine learning group computer science department ULB - Lab'InSight Artifici...
Learning from data: data mining approaches for Energy & Weather/Climate appli...
Lecture7 cross validation
100505 koenig biological_databases
Introduction to Bioinformatics
Introduction to Bioinformatics
1.bioinformatics introduction 32.03.2071
B.sc biochem i bobi u-1 introduction to bioinformatics
Bioinformatics issues and challanges presentation at s p college
Bioinformatics
Nucleic Acid Sequence databases
Biological Databases
Major databases in bioinformatics
Biological databases
blast bioinformatics
databases in bioinformatics
Ch 1 Intro to Data Mining
Ad

Similar to Introduction to Data Mining / Bioinformatics (20)

PDF
Online Chemical Modeling Environment: Models
PPTX
Geog2 question 2
PPT
clustering.ppt
PPT
Machine Learning Techniques for the Evaluating of External ...
KEY
Software Engineering Course 2009 - Mining Software Archives
PPTX
About Data Science Introduction about Data mining
PPT
DataMining and Knowledge Discovery in DB.ppt
PPTX
Machine Learning Challenges For Automated Prompting In Smart Homes
PDF
Model-driven decision support for monitoring network design based on analysis...
PDF
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
PDF
Introduction to Data Mining
PPTX
Web Security and Privacy: Privacy of Genomic Data
PDF
A_Study_of_K-Nearest_Neighbour_as_an_Imputation_Me.pdf
PPTX
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
PPTX
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
PDF
solved mcq- project cycle ARTIFICIAL INTELLIGENCE.pdf
PDF
Data analytcis-first-steps
PDF
Interscience discovering knowledge in data an introduction to data mining
PDF
Address common data analytics challenges with effective solutions.
PDF
Cluster Analysis : Assignment & Update
Online Chemical Modeling Environment: Models
Geog2 question 2
clustering.ppt
Machine Learning Techniques for the Evaluating of External ...
Software Engineering Course 2009 - Mining Software Archives
About Data Science Introduction about Data mining
DataMining and Knowledge Discovery in DB.ppt
Machine Learning Challenges For Automated Prompting In Smart Homes
Model-driven decision support for monitoring network design based on analysis...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Introduction to Data Mining
Web Security and Privacy: Privacy of Genomic Data
A_Study_of_K-Nearest_Neighbour_as_an_Imputation_Me.pdf
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
solved mcq- project cycle ARTIFICIAL INTELLIGENCE.pdf
Data analytcis-first-steps
Interscience discovering knowledge in data an introduction to data mining
Address common data analytics challenges with effective solutions.
Cluster Analysis : Assignment & Update

More from Gerald Lushington (6)

PDF
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
PDF
Report ghl20130320
PDF
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
PDF
LiS services
PDF
Open source
PDF
Personalized medicine via molecular interrogation, data mining and systems bi...
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
Report ghl20130320
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
LiS services
Open source
Personalized medicine via molecular interrogation, data mining and systems bi...

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
sap open course for s4hana steps from ECC to s4
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation

Introduction to Data Mining / Bioinformatics

  • 1. Introduction to Bioinformatics: Mining Your Data Gerry Lushington Lushington in Silico modeling / informatics consultant
  • 2. What is Data Mining? Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties Applicable across many disciplines: Molecular bioinformatics Medical Informatics Health Informatics Biodiversity informatics
  • 3. Example Applications: Find relationships between: Convenient Observables vs. Important Outcomes a) Relative gene expression data 1. Disease susceptibility b) Relative protein abundance data 2. Drug efficacy c) Relative lipid & metabolite profiles 3. Toxin susceptibility d) Glycosylation variants 4. Immunity e) SNPs, alleles 5. Genetic disorders f) Cellular traits 6. Microbial virulence g) Organism traits 7. Species adaptive success h) Behavioral traits 8. Species complementarity i) Case history
  • 4. Goals for this lecture: Focus on Data Mining: how to approach your data and use it to understand biology Overview of available techniques Understanding model validation Try to think about data you’ve seen: what techniques might be useful? Don’t worry about grasping everything: K-INBRE Bioinformatics Core is here to help!!
  • 5. Basic Data Mining: Find relationships between: a) Easy to measure properties vs. b) Important (but harder to measure) outcomes or attributes Use relationships to understand the conceptual basis for outcomes in b) Use relationships to predict outcomes in new cases where outcome has not yet been measured
  • 6. Basic Data Mining: simple measureables
  • 7. Basic Data Mining: general observation Unhappy Happy
  • 8. Basic Data Mining: relationship (#1) Unhappy Happy Blue = happy; Red = unhappy accuracy = 12/20 = 60%
  • 9. Basic Data Mining: relationship (#2) Unhappy Happy Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
  • 10. Data Mining: procedure 1. Data Acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration
  • 11. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing Peak heights? 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Peak positions? Key issues include: a) format conversion from instrument b) any necessary mathematical manipulations (e.g., Density = M/V)
  • 12. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers
  • 13. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Use controls to 4. Classification scale data 5. Validation 6. Prediction & Iteration Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers C C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
  • 14. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Subjective 4. Classification (requires experience 5. Validation and/or domain 6. Prediction & Iteration knowledge) Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers
  • 15. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 16. Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 17. Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 18. Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training 1 2 3 4
  • 19. Data Mining: procedure 1. Data acquisition • Train preliminary models based on random sets of properties 2. Data Preprocessing • Evaluate models according to 3. Feature Selection correlative or predictive performance 4. Classification • Experiment with promising sets adding 5. Validation or deleting descriptors to gauge impact 6. Prediction & Iteration on performance Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 20. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 21. Data Mining: procedure y 1. Data acquisition 2. Data Preprocessing x 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 22. Data Mining: procedure y 1. Data acquisition 2. Data Preprocessing x 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration -n y +n Predict which sample will have which outcome? NO YES a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 23. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration x1 Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 24. Data Mining: procedure y1 y2 x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification y3 5. Validation y4 6. Prediction & Iteration x1 Predict which sample will have which outcome? a) Correlative methods y1 = resistant to types I & II diabetes b) Distance-based clustering y2 = susceptible only to type II c) Boundary detection d) Rule learning y3 = susceptible only to type I e) Weighted probability y4 = susceptible to types I & II
  • 25. Data Mining: procedure Resistant to type I x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration x1 Susceptible to type I Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 26. Data Mining: procedure Resistant to type I x2 1. Data acquisition 2. Data Preprocessing b 3. Feature Selection 4. Classification a 5. Validation 6. Prediction & Iteration c x1 Susceptible to type I Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering If x1 < c and x2 > a then resistant c) Boundary detection Else if x1 > c and x2 > b then resistant d) Rule learning Else susceptible e) Weighted probability E=9
  • 27. Data Mining: procedure Resistant Susc. 1. Data acquisition 2. Data Preprocessing 3. Feature Selection a x1 4. Classification 5. Validation Susc. Resistant 6. Prediction & Iteration b x2 Predict which sample will have which outcome? a) Correlative methods Resistant Susc. b) Distance-based clustering c) Boundary detection c Fx1 - d) Rule learning Gx2 e) Weighted probability If Fx1 - Gx2 < c then resistant Else susceptible
  • 28. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 29. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification 5. Validation Susc. 6. Prediction & Iteration x1 (Pos.) Define criteria and tests to prove model validity a) Accuracy Accuracy = (TP + TN) b) Sensitivity vs. Specificity TP + TN + FP + FN c) Receiver Operating Characteristic (ROC) plot d) Cross-validation = 142 / 154
  • 30. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification 5. Validation Susc. 6. Prediction & Iteration x1 (Pos.) Define criteria and tests to prove model validity a) Accuracy Sensitivity = TP = 67 / 72 b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot TP + FN d) Cross-validation FPR = FP = 6 / 81 TN + FP Note: Specificity = 1 - FPR
  • 31. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification less 5. Validation Varying Susc. 6. Prediction & Iteration model more x1 (Pos.) stringency Define criteria and tests to prove model validity a) Accuracy Sensitivity = TP = 69 / 72 b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot TP + FN d) Cross-validation FPR = FP = 19 / 81 TN + FP Note: Specificity = 1 - FPR
  • 32. Data Mining: procedure Sens 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration FPR Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 33. Data Mining: procedure Sens 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration FPR Define criteria and tests to prove model validity Area under curve is a) Accuracy excellent measure of b) Sensitivity vs. Specificity model performance c) Receiver Operating Characteristic (ROC) plot d) Cross-validation 1.0: perfect model 0.5: random
  • 34. Data Mining: procedure 1. Data acquisition Predictions are imperfect due to: 2. Data Preprocessing • Imperfect Algorithms 3. Feature Selection • Imperfect Data 4. Classification 5. Validation 6. Prediction & Iteration Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 35. Cross-Validation: • Carefully monitor features that are useful across different independent data subsets • This can be accomplished with N-fold cross-validation: Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Test Train Model performance = mean predictive performance over 5 trials • Best feature selection and classification algorithms will yield best consistent performance across independent trials • Best features will be consistently important across trials
  • 36. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Analysis is only useful if it is used; only improves if it is tested a) Good validation requires successful new predictions b) Imperfect predictions can lead to method refinement and greater understanding
  • 37. Questions? Lushington in Silico Geraldlushington3117 at aol.com Geraldlushington.org