Introduction to Data Mining / Bioinformatics

Introduction to Bioinformatics:
Mining Your Data

Gerry Lushington
Lushington in Silico

modeling / informatics consultant

What is Data Mining?
Use of computational methods to perceive trends in data that
can be used to explain or predict important outcomes or
properties

Applicable across many disciplines:
Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics

Example Applications:
Find relationships between:

Convenient Observables vs. Important Outcomes
a) Relative gene expression data 1. Disease susceptibility
b) Relative protein abundance data 2. Drug efficacy
c) Relative lipid & metabolite profiles 3. Toxin susceptibility
d) Glycosylation variants 4. Immunity
e) SNPs, alleles 5. Genetic disorders
f) Cellular traits 6. Microbial virulence
g) Organism traits 7. Species adaptive success
h) Behavioral traits 8. Species complementarity
i) Case history

Goals for this lecture:
Focus on Data Mining: how to approach your data and use it to
understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be
useful?

Don’t worry about grasping everything:
K-INBRE Bioinformatics Core is here to help!!

Basic Data Mining:
Find relationships between:
a) Easy to measure properties vs.
b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for
outcomes in b)

Use relationships to predict outcomes in new cases where
outcome has not yet been measured

Basic Data Mining: simple measureables

Basic Data Mining: general observation

Unhappy Happy

Basic Data Mining: relationship (#1)

Unhappy Happy

Blue = happy; Red = unhappy accuracy = 12/20 = 60%

Basic Data Mining: relationship (#2)

Unhappy Happy

Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%

Data Mining: procedure

1. Data Acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration


1. Data acquisition
2. Data Preprocessing Peak heights?
4. Classification
5. Validation
Peak positions?
Key issues include:
a) format conversion from instrument
b) any necessary mathematical manipulations (e.g., Density = M/V)


1. Data acquisition
4. Classification
5. Validation

Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers


1. Data acquisition
3. Feature Selection Use controls to
4. Classification scale data
5. Validation

Key issues include:

C C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3


1. Data acquisition
3. Feature Selection Subjective
4. Classification (requires experience
5. Validation and/or domain
6. Prediction & Iteration knowledge)

Key issues include:


1. Data acquisition
4. Classification
5. Validation

Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information content
b) Redundancy relative to other properties
c) Correlation with target attribute
d) Iterative model training


1. Data acquisition
2.
3.
Data Preprocessing
Feature Selection
x x
4. Classification
5. Validation
6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4



1. Data acquisition
2.
3.
Data Preprocessing
Feature Selection
x
4. Classification
5. Validation



1. Data acquisition
2.
3.
Data Preprocessing
Feature Selection
x
4. Classification
5. Validation


1 2 3 4


1. Data acquisition • Train preliminary models based on
random sets of properties
• Evaluate models according to
3. Feature Selection correlative or predictive performance
4. Classification • Experiment with promising sets adding
5. Validation or deleting descriptors to gauge impact
6. Prediction & Iteration on performance



1. Data acquisition
4. Classification
5. Validation

Predict which sample will have which outcome?
a) Correlative methods
b) Distance-based clustering
c) Boundary detection
d) Rule learning
e) Weighted probability

Data Mining: procedure y

1. Data acquisition
2. Data Preprocessing x
4. Classification
5. Validation

d) Rule learning

Data Mining: procedure y

1. Data acquisition
2. Data Preprocessing x
4. Classification
5. Validation
-n y +n
Predict which sample will have which outcome? NO YES

d) Rule learning

x2
1. Data acquisition
4. Classification
5. Validation
x1

d) Rule learning

Data Mining: procedure y1 y2

x2
1. Data acquisition
4. Classification y3
5. Validation y4
x1

a) Correlative methods y1 = resistant to types I & II diabetes
b) Distance-based clustering y2 = susceptible only to type II
d) Rule learning y3 = susceptible only to type I
e) Weighted probability y4 = susceptible to types I & II

Data Mining: procedure Resistant to type I

x2
1. Data acquisition
4. Classification
5. Validation
x1
Susceptible to type I
d) Rule learning

Data Mining: procedure Resistant to type I

x2
1. Data acquisition
2. Data Preprocessing b
4. Classification a
5. Validation
c x1
Susceptible to type I
b) Distance-based clustering If x1 < c and x2 > a then resistant
c) Boundary detection Else if x1 > c and x2 > b then resistant
d) Rule learning Else susceptible

E=9

Resistant Susc.
1. Data acquisition
3. Feature Selection a x1
4. Classification
5. Validation Susc. Resistant

b x2
a) Correlative methods Resistant Susc.
c Fx1 -
d) Rule learning
Gx2
e) Weighted probability If Fx1 - Gx2 < c then resistant
Else susceptible


1. Data acquisition
4. Classification
5. Validation

Define criteria and tests to prove model validity
a) Accuracy
b) Sensitivity vs. Specificity
c) Receiver Operating Characteristic (ROC) plot
d) Cross-validation

x2
1. Data acquisition
3. Feature Selection Resistant (Neg.)
4. Classification
5. Validation Susc.
x1 (Pos.)

a) Accuracy Accuracy = (TP + TN)
b) Sensitivity vs. Specificity TP + TN + FP + FN
d) Cross-validation = 142 / 154

x2
1. Data acquisition
4. Classification
5. Validation Susc.
x1 (Pos.)

a) Accuracy Sensitivity = TP = 67 / 72
c) Receiver Operating Characteristic (ROC) plot TP + FN
d) Cross-validation
FPR = FP = 6 / 81
TN + FP
Note: Specificity = 1 - FPR

x2
1. Data acquisition
4. Classification less
5. Validation Varying Susc.
6. Prediction & Iteration model
more x1 (Pos.)
stringency

a) Accuracy Sensitivity = TP = 69 / 72
c) Receiver Operating Characteristic (ROC) plot TP + FN
d) Cross-validation
FPR = FP = 19 / 81
TN + FP
Note: Specificity = 1 - FPR

Sens
1. Data acquisition
4. Classification
5. Validation
FPR

a) Accuracy
d) Cross-validation

Sens
1. Data acquisition
4. Classification
5. Validation
FPR

Area under curve is
a) Accuracy
excellent measure of
b) Sensitivity vs. Specificity model performance
d) Cross-validation 1.0: perfect model
0.5: random


1. Data acquisition Predictions are imperfect due to:
2. Data Preprocessing • Imperfect Algorithms
3. Feature Selection • Imperfect Data
4. Classification
5. Validation

a) Accuracy
d) Cross-validation

Cross-Validation:

• Carefully monitor features that are useful across different
independent data subsets
• This can be accomplished with N-fold cross-validation:

Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Test

Train Model performance = mean predictive performance over 5 trials

• Best feature selection and classification algorithms will yield
best consistent performance across independent trials
• Best features will be consistently important across trials


1. Data acquisition
4. Classification
5. Validation

Analysis is only useful if it is used; only improves if it is tested
a) Good validation requires successful new predictions
b) Imperfect predictions can lead to method refinement and
greater understanding

Questions?

Lushington in Silico
Geraldlushington3117 at aol.com
Geraldlushington.org

Introduction to Data Mining / Bioinformatics

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Introduction to Data Mining / Bioinformatics (20)

More from Gerald Lushington (6)

Recently uploaded (20)

Introduction to Data Mining / Bioinformatics