A Method to facilitate cancer detection and type classification from gene expression using a deep auto-encoder and neural network

A Method to Facilitate Cancer Detection
and Type Classification from Gene
Expression Data using a Deep Autoencoder
and Neural Network
By Xi Chen
March 27, 2019

Gene Expression Data Properties.
• Gene expresses differently depending upon various factors such as the type
of cells, environment and disease conditions.
• Gene expression data are highly available due to the increased affordability
of the sequencing technology.
• Gene expression data are multimodality, high dimensional with small
observation number (#row << #column).
• Gene expression data can be used for disease detection and classification,
and drug suggestion.
2

Gene Expression Data With Dimension
Reduction
• Using dimension reduction methods, such as PCA, for feature
selection, since gene expression data have high dimension.
• Apply traditional statistical and machine learning methods for
application such as disease detection or classification.
• Problem: how to explain the selected features. E.g. Each PC is a linear
combination of the gene expression features.
3

Proposed Drug Suggestion Scheme.
2D Gene Expression Representation
Feature 1
Feature2 Drug Sensitivity
Drug A
Drug B
Drug C
Drug D
Cluster Approaches:
• K-means
• Gaussian Mixture Models 4

Problem: Current Gene Expression Data Don’t
Include Drug Results.
• Most gene expression data aren’t associated with well documented
medical records.
• Available records often miss drug information and patient disease
outputs.
5

Solving The Harder Classification Problem First,
Then We Could Infer Cluster Approach Works
• In general, a classification problem is similar to a cluster problem, e.g.
k-Nearest Neighbors algorithm.
• If using gene expression data we could achieve high accurate
classification results, we might be able to suggest clustering gene
expression data for drug suggestion.
6

Data Processing
60,483
14,157
7

Autoencoder For Feature Learning
Minimize 𝑓(𝐼𝑛𝑝𝑢𝑡 − 𝑂𝑢𝑡𝑝𝑢𝑡)
100
50
25
50
100
Training Autoencoder
1st hidden layer:
2nd hidden layer:
3rd hidden layer:
4th hidden layer:
5th hidden layer:
Model
Configuration
9

Learned Feature + Neural Network
10

Single Type Classification
Lung cancer, abundant and balanced data 11

Why Not PCA?
• PCA is a descriptive model.
• Each component is a linear
combination of all the
features.
• Hard to explain.
12

Cancer Type
Acronym
Full Name
LGG Lower Grad Glioma
UVM Uveal Melanoma
LUSC
Lung squamous cell
carcinoma
GBM Glioblastoma Multiforme
Multiple Type
Classification
• Misclassifications are due to small
sample size.
• Misclassifications are sparse,
clustering potential.
13

Conclusion
• Autoencoder to automatically generate feature representations, thus
addressing the very high dimensionality of gene expression data.
• The extracted feature vector captures the non-linearity of the data.
• This approach is scalable for new data after training, and it can
generalize in multi-classification of different types of cancer.
• We have demonstrated the high accuracy and low FNR/FPR of this
method for the majority of the abundant cancer types, and its
potential for handling sub-classification within certain cancers and
identifying metastasis cancers.
14

Other Projects—Deep Learning Behind The
Scenes
• Almost all machine learning applications use
similar approaches—Feature Engineering +
Deep Learning.
• E.g. Self-driving cars = CNN + DNN
• Feature engineering  CNN
• Deep Learning training  DNN
• Deployment
15

Thank you so
much!
Questions?
16

A Method to facilitate cancer detection and type classification from gene expression using a deep auto-encoder and neural network

More Related Content

What's hot (20)

Similar to A Method to facilitate cancer detection and type classification from gene expression using a deep auto-encoder and neural network (20)

More from Xi Chen (8)

Recently uploaded (20)

A Method to facilitate cancer detection and type classification from gene expression using a deep auto-encoder and neural network