Comparative Study of Granger Causality Algorithm for Gene Regulatory Network

INVESTIGATION OF
IMAGE PROCESSING
ALGORITHMS FOR
MEDICAL APPLICATION
Zhafir AglnaTijani
U1120208F
A final year project presentation in partial fullfilment of the
requirement for the degree of Bachelor of Engineering
1

Background
and Theory
Implementation
Result and
Discussion
Conclusion
Outline
2

Background
and Theory
•Problems and Objectives
•Gene Regulatory Network
•Granger Causality
•3 Methods of Granger Causality
•Project Focus
Implementation
Result and
Discussion
Conclusion
Outline
3

“It is more pragmatic to cure the
cause of disease at its sources
than to handle the actual
diseases”
Gene
4

• The Interaction between genes is
called Gene Regulatory Network
• The discovery of this network still
have a lot of challenge because
of complexity of the network
• Efficient Computational Tools are
required
To find an effective and efficient
means to discover unknown Gene
Regulatory Network
Objective
5

Modelling of GRN
• Nodes and Edges
• Depicting the
relation between
genes
• Obtained from DNA
Microarray
• Prominent Method :
Granger Causality
http://img.medicalxpress.com/newman/gfx/news/hires/2013/1-novelnoninva.jpg
6

Granger Causality
• Method for Time Series Analysis
• Utilized Vector Auto-regression (VAR) Model to calculate
causality based on Time Series data.
Granger (1969)
A B
Time Series Time Series
Ut =
𝑘=1
𝑝
AkUt−k + εt 𝐹𝑌→𝑋 ≡ ln
|Σ 𝑥𝑥
′
|
|Σ 𝑥𝑥|
7

Granger Causality
“If past values of A and B can predict future value
of B better than past values of B alone,
Then, time series A granger cause time series B”
Granger (1969)
A B
Time Series Time Series
8

MVGC Lasso CopulaBarnett et al. (2013) Arnold et al. (2007) Liu and Bahadori (2012)
3 Methods of Implementing
Granger Causality
“These 3 Methods has been implemented independently,
but never been compared using the same condition.” 9

Main Focus of the Project
• Comparative Study of
Algorithms
• Focus on the Performance
of 3 Algorithms
• Finding Strength and
Weaknesses
• Utilizing Control Variables
and Metrics Performance
10

Background
and Theory
Implementation
Result and
Discussion
Conclusion
•Control Variables
•Causality Graph and Matrix
•Edge Analysis)
•Performance Metrics
•Data for Analysis
Outline
11

Implementation
Time Series
input
GC
Algorithm
Causality
Matrix and
Graph
Edge
Analysis
Data for
Discussion
• Implementation using MATLAB 2010b
• Based on Existing Toolboxes :
• MVGC Toolbox ( Barnett, 2013 )
• Lasso Granger
• Copula Granger ( Liu and Bahadori, 2012 )
• GLMnet
Program Flow
12

Implementation
Control Variables
• Based on Set of Equations
• Linear Time Series Dataset
• Generated by specifying The Number of Time Points
• Advantages :
• Provide Ground Truth Network : Actual Causality of the Time Series
• Ground Truth can be compared with the Algorithm Output to measure the
performance of Algorithms
• 2 Types of Dataset : 3-VAR and 5-VAR Time Series
• 8 different Number of Time Points : 200, 400, 800, 1200, 1600, 2400, 3200, 4000
Synthetic Time Series Dataset
13

3 Granger Causality Algorithms
14

Causality Matrix
• 1 represent : Link Exist between Variables
• 0 represent : Link Does not Exist
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
• Output of GC Algorithm is the Causality Matrix
• Depict granger causality between time series
15

Edge Analysis
• The result of Algorithm are masked with Binary Masking with
the threshold of 0.0001
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
• Edge Analysis is a method to measure the performance of
an Algorithm by comparing it with the Benchmark
• Benchmark = Ground Truth
0 0 1 0 1
1 0 1 1 1
1 1 1 0 1
0 1 0 1 1
1 1 1 0 1
Ground Truth Lasso Method
16

Edge Analysis
For above example
• TP : 4
• TN : 6
• FP : 13
• FN : 2
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
• Using Parameters from Confusion Matrix :
• True Positives, True Negatives, False Positives, and False
Negatives
0 0 1 0 1
1 0 1 1 1
1 1 1 0 1
0 1 0 1 1
1 1 1 0 1
Ground Truth Lasso Method
17

7 Performance Metrics
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝐹𝑃
𝑇𝑁 + 𝐹𝑃
𝐹𝑎𝑙𝑠𝑒 𝐷𝑖𝑠𝑐𝑜𝑣𝑒𝑟𝑦 𝑅𝑎𝑡𝑒 =
𝐹𝑃
𝑇𝑃 + 𝐹𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝐹1 𝑆𝑐𝑜𝑟𝑒 =
2𝑇𝑃
2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
• Calculated based on the value of TP, TN, FP, and FN
• Used in Past Research in Similar Topic
18

Data for Analysis
• The Result of Granger Causality depends
on the generated time series
• Few sample was not sufficient, Since time
series generated was different each time
• The experiment was iterated by 2000 times
• Mean Value of each performance metrics
will be the basis for comparative study
0 0 1 0 1
1 0 1 1 1
1 1 1 0 1
0 1 0 1 1
1 1 1 0 1
0 0 1 0 0
0 0 1 1 0
1 1 1 0 1
0 1 0 1 1
0 1 0 1 1
Lasso : 1st Iteration
Lasso : 2nd Iteration
19

Background
and Theory
Implementation
Result and
Discussion
Conclusion
Outline
•Performance Metrics Scores
•Specific Result
•5-VAR Accuracy
•3-VAR and 5-VAR F1 Score
•Overall Score Result
20

Scores of Metrics
• Bar chart to represent the score
of each performance metrics on
3 methods
• X axis : Number of Time Points
• Y axis : Score of Metrics
• 7 Metrics Performance
• 2 Scenario : 3-VAR and 5-VAR
0
0.1
0.2
0.3
0.4
0.5
0.6
200 400 800 1200 1600 2400 3200 4000
Score
Number of Time Points
VAR5 F1 Score
MVGC
LASSO
COPULA
21

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 Specificity
MVGC
LASSO
COPULA
0
0.2
0.4
0.6
0.8
1
1.2
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 Sensitivity
MVGC
LASSO
COPULA
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 Precision
MVGC
LASSO
COPULA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 False Positive Rate
MVGC
LASSO
COPULA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 False Discovery Rate
MVGC
LASSO
COPULA
0
0.1
0.2
0.3
0.4
0.5
0.6
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 Accuracy
MVGC
LASSO
COPULA
22

0
0.1
0.2
0.3
0.4
0.5
0.6
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 Accuracy
MVGC
LASSO
COPULA
5-VAR Accuracy
• Accuracy
• Proportion of true result among total links
available
• MVGC
• Increasing as Number of time Points
Increase
• Score range was small ( around 0,1 )
• Lasso
• Increasing as Number of Time Points
Increase
• Two extreme scores, Wide score Range
• Copula
• Optimized during number of time points
around 400
• Bad performance at higher number of
time points
23

3-VAR and 5-VAR F1 Score
• F1 Score
• Statistical Significance based on
Harmonic mean of Precision and Recall
• MVGC
• Consistent Pattern, Increases as time
point increases
• Lasso
• Contrast Pattern
• Heavily affected by number of variables
• Copula
• Unique Pattern
• Has a certain point / range where
performance is optimized
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
200 400 800 1200 1600 2400 3200 4000
Score
VAR3 F1 Score
MVGC
LASSO
COPULA
0
0.1
0.2
0.3
0.4
0.5
0.6
200 400 800 1200 1600 2400 3200 4000
Score
VAR5 F1 Score
MVGC
LASSO
COPULA
24

Overall Performance
Metrics type Best Performance Average Performance Worst Performance
Sensitivity Lasso Copula MVGC
Specificity MVGC Lasso Copula
Precision MVGC Lasso Copula
False Positive Rate MVGC Lasso Copula
False Discovery Rate MVGC Lasso Copula
Accuracy MVGC Lasso Copula
F1 – Score MVGC Lasso Copula
3 – Variable Time Series
• Overall performance based on average score of all time points
• MVGC Outperforms other two methods in 3-VAR Scenario
• Lasso scores was good during high number of time points
• Copula has certain range which their score was high ( around 200 – 800 time points ),
but outside of that the score were lower than other method
25

Metrics type Best Performance Average Performance Worst Performance
Sensitivity MVGC Copula Lasso
Specificity Lasso Copula MVGC
Precision MVGC Copula Lasso
False Positive Rate Lasso Copula MVGC
False Discovery Rate MVGC Copula Lasso
Accuracy Copula MVGC Lasso
F1 – Score MVGC Copula Lasso
5 – Variable Time Series
• MVGC shows Consistency in both 5-VAR and 3-VAR
• Copula provides best accuracy compared to other method, especially during 200 –
800 time points
• Lasso score is the highest during high number of time points, but the score during low
number of time points were low.
26

Background
and Theory
Implementation
Result and
Discussion
Conclusion
Outline
•Conclusion
•Future Works
27

Conclusion
• 3 Methods of GC : MVGC, Lasso, and
Copula can be compared using 7
Performance Metrics
• MVGC provides consistency in most of
condition
• Lasso has advantages in handling
high number of time points
• Copula has certain range which their
performance was optimized
• Even though overall score favours
MVGC compared to other methods,
the results are still conditional
28

Suggestions for Future Work
• Granger Causality Algorithms for non-linear Data
• Non-linear data provides better representation for Gene Regulatory Network
• Application to Real Dataset
• Granger Causality Analysis may be applied to real dataset
• Other Algorithm for GRN ( Dynamic Bayesian Network )
• DBN is another prominent method in this topic
29

Comparative Study of Granger Causality Algorithm for Gene Regulatory Network

More Related Content

Viewers also liked (6)

Similar to Comparative Study of Granger Causality Algorithm for Gene Regulatory Network (20)

Recently uploaded (20)

Comparative Study of Granger Causality Algorithm for Gene Regulatory Network