Basic course on computer-based methods

Intereg Project
Biomedical Informatics
Ljiljana Majnarić Trtica
II. Basic course on computer-based methods

I. Data Mining
 DM is defined as “the process of seeking interesting or valuable information
(patterns) within the large databases”
 At first glance, this definition seems more like a new name for statistics
 However, DM is actually performed on sets of data that are far larger than
statistical methods can accurately analyze

Data Mining methods
 DM involves methods that are at the intersection of arteficial intelligence, machine
learning, statistics and database systems
 Sometimes, these methods support dimensionality reduction, by mapping a set of
maximally informative dimensions
 Sometimes, they represent definite mathematical models
 Often, combination of methods is used to problem solving

Data Mining methods
 Essentially, patterns are often defined relative to the overall model of the data set from which it is
derived
 There are many tools involved in data mining that help find these structures
 Some of the most important tools include
 Clustering - the act of partitioning data sets of many random items into subsets of smaller size
that show commonality between them - by looking at such clusters, analysts are able to extract
statistical models from the data fields
 Regression - the method of fitting a curve through a set of points using some goodness-of-fit
criterion - while examining predefined goodness-of-fit parameters - analysts can locate and
describe patterns
 Rule extraction - the method of using relationships between variables to establish some sort of
rule
 Data visualization - a sort of technique that can help us to explain (understand) trends and
complexity in data much easily

Data Mining methods
most commonly used in health science
 Logistic Regression (LR)
 Support Vector Machine (SVM)
 Appriori and other association rule mining (AR)
 Decision Tree algorithms(DT)
 Classification algorithms: K-means, SOM (Self-organizing Map), Naive Bayes
 Arteficial Neural Networks (ANN)

Yet a combination of techniques can elicite a particular mining function
Techniques Utility
Appriori
& FP Growth
Association rule mining for finding frequent item sets
(e.g. diseases) in medical databases
ANN
& Genetic algorithm
Extracting patterns
Detecting trends
Classifcation
Decision Tree algorithms (ID3, C4, C5, CART) Decision support
Classification
Combined use of K-means, SOM & Naive Bayes Accurate classification
Combination of SVM, ANN & ID3 Classification

Logistic Regression (LR)
 A popular method for classifying individuals, given the values of a set of explanatory
variables
 Will a subject develop diabetes ?
 Will a subject respond to a treatment ?
 It estimates the probability that an individaul is in a particular group
 LR does not make any assumptions of normality, linearity and homogeneity of variance
for the independent variables

Fig. 1. Logistic regression curve
 Value produced by logistic regression is a probability value between 0.0 and 1.0
 If the probability for group membership in the modeled category is above some cut point (the
default is 0.50) - the subject is predicted to be a member of the modeled group
 If the probability is below the cut point - the subject is predicted to be a member of the other
group
-7.5 -5 -2.5 2.5 5 7.5
0.2
0.4
0.6
0.8
1

Testing the LR model performances (a fit to a series of data)
 Testing the models depending on the probability p
 ROC curve
 C statistics
 GINI coefficient
 KS test
 Testing the models depending on the cuf-off values
 Sensitivity (true positive rate)
 Specificity (true negative rate)
 Accuracy
 Type I error (misclassification of diabetic)
 Type II error (misclassification of healty)

Linear vs Logistic regression model
 In linear regression - the outcome (dependent variable) is continuous - it can have any
of an infinite number of possible values.
 In logistic regression - the outcome (dependent variable) has only a limited number of
possible values - it is used when the response variable is categorical in nature
 The logistic model is unavoidable if it fits the data much better than the linear model
 In many situations - the linear model fits just as well, or almost as well as the logistic
model
 In fact, in many situations, the linear and logistic model give results that are practically
indistinguishable

Fig. 2. Linear vs Logistic regression model
The linear model assumes that the probability p is a linear function of the regressors
The logistic model assumes that the log of the odds p/(1-p) is a linear function of the regressors

Support Vector Machine
 Supervised ML method
 For classification and regression challenges (mostly for classification)
 The principle algorithm is laying on:
 Each data item is plotted as a point in n-dimensional space (n= number of features the
varible posses) with the value of each feature being the value of a particular coordinate
 Then, classification is performed - by finding the hyper-plane that differentiates the two
classes very well

Supervised ML Unsupervised ML
The major part of practical ML uses supervised learning
When there are input variables (x) and an output variable (Y) - an algorithm is used to
learn the mapping function from the input to the output: Y = f(X)
The goal is to approximate the mapping function so well that when you have new
input data (x) - you can predict the output variables (Y) for that data
It is called supervised learning because the process of an algorithm learning from the
training dataset can be thought of as a teacher supervising the learning process.
We know the correct answers, the algorithm iteratively makes predictions on the
training data and is corrected by the teacher
Learning stops when the algorithm achieves an acceptable level of performance
Supervised learning problems can be grouped into regression and classification
problems
Classification - when the output variable is a category, such as “disease” and “no
disease”
Regression - when the output variable is a real value, such as “weight”
Usual methods of Supervised ML are:
Linear regression - for regression problems
Random forest - for classification and regression problems
Support vector machines -for classification problems
When there are only input data (X) and no corresponding
output variables
The goal is to model the underlying structure or
distribution in the data - in order to learn more about the
data
It is called unsupervised learning because unlike supervised
learning - there is no known answer and there is no teacher
Algorithms are left to their own devises to discover and
present the interesting structure in the data
Unsupervised learning problems can be grouped into
clustering and association problems
Clustering - when the problem is to discover the inherent
groupings in the data, such as grouping by purchasing
behavior
Association - when the problem is to discover rules that
describe large portions of your data
Usual methods of Unsupervised ML are:
k-means - for clustering problems
Apriori algorithm - for association rule learning problems

Appriori algorithm (AA)
/ other Association Rule Mining (ARM)
 ARM - a technique to uncover how items are associated to each other
 AA - mining association rules between frequent sets of Items in large databases (Fig. 3.)

Decision Tree (DT) algorithms
 In supervised learning algorithms
 For classification and regression problems
 The DT algorithm tries to solve the problem by using tree representation (Fig. 4.)
 A flow-chart-like structure (Fig. )
 Each internal node denotes a test on an attribute
 Each branch represents the outcome of a test
 Each leaf (a terminal node) holds a class label
 The topmost node in a tree is the root node
 There are many specific decision-tree algorithms

Fig. 4. DT algorithm simulate the brancing logic of the tree

Fig.5. DT-based classification results
(the personal archive)

Arteficial Neural Networks (ANN)
 A method of artificial intelligence inspired by and structured according to the human brain
 It is a ML & DM method - a method that learn on examples
 Uses retrospective data
 It can be used for prediction, classification and pattern recognition (e.g. association problems)
 Prediction - a numeric value is predicted as the output (e.g. blood pressure, age etc.) and MSE
or RMSE error is used as the evaluation measure of model performance
 Classification - cases are assigned into two or more categories of the output (e.g.
presence/absence of a disease, treatment outcome, etc.) and classification rate is used as the
evaluation measure of model performance
 ANNs have shown success in modelling real world situations, so they can be used both in
research purpose and for practical usage as a decision support or a simulation tool

Biological vs Arteficial Neural Network
(Fig. 6.)
 Biological neural network - consists of mutually connected biological neurons
 A biological neuron - a cell that receives information from other neurons through dendrites, processes it
and sends impuls through the axon and synapses to other neurons in the network
 Learning - is being performed by the change of the weights of synaptic connections - millions of neurons
can parallely process information
 Artificial neural network
 An artificial neuron - a processing unit (variable) that receives weighted input from other variables,
transforms the input according to a formula and sends the output to other variables
 Learning - is being performed by the change of weight values of variables (weights wji are ponders by
which the inputs are multiplied)

Fig. 6. - Biological vs arteficial NN

Fig. 7. - Generalization ability of the ANN model needs to be tested
 It does not rely on results obtained on a single sample - many learning iterations
on the training set take place within the middle (hidden) layer - staying between
input and output layers

Criteria for distinguishing ANN algorithms
 Nummber of layers
 Type of learning
• Supervised - real output values are known from the past and provided in the dataset
• Unsupervised - real output values are not known, and not provided in the dataset, these networks are used
to cluster data in groups by characteristics
 Type of connections among neurons
 Connection among input and output data
 Input and transfer functions
 Time characteristics
 Learning time
 etc.

II. Modern computer-based methods
 Graph-based DM
 Data Visualization and Visual Analytics
 Topological DM
 Similar techniques that can be used to organize highly complex and heterogeneous
data
 Data can be very powerful, if you can actually understand what it's telling you
 It's not easy to get clear takeaways by looking at a slew of numbers and stats - you
need the data presented in a logical, easy-to-understand way – that`s the situation
when to enter some of these techniques

Graph-based DM
 In order to apply graph-based data mining techniques, such as classification and
clustering - it is necessary to define proximity measures between data represented in the
graph form (Fig. 8. and 9.)
 There are several within-graph proximity measures
 Hyperlink-Induced Topic Search (HITS)
 The Neumann Kernel (NK)
 Shared Nearest Neighbor (SNN)

Fig. 8. - Defining proximity measures enables structure visible
Scatter plots showing the similarity from -1 to 1

Fig. 9. - Citation graph by using NK-proximity measures
- n1…n8 vertices (articles)
- edges indicate a citation
Citation Matrix C can be formed - If an edge between two vertices exists
then the matrix cell = 1 else = 0

Fig. 10. - How to generalize mathematically
the pattern of a dalmatian dog?

Data Visualization
 The human brain processes visual information better than it processes text - so by
using charts, graphs and design elements - data visualization can help us to explain
(understand) trends and stats much more easily (Fig. 10.)
Fig. 10. - The structure of population by age - commoly used data
visualisation procedure in public health domain

Data visualization
 The samples of data being mined are so vast that scatter plots and histograms will
often fall short representing any information of realistic value (Fig. 11.)
 For that very reason, the analysts concerned with data mining are constantly looking
for better ways to graphically represent data
 No matter what tools analysts will have at their fingertips - the patterns and models
being mined will only be as good in quality as the data that it is being derived from

Fig. 11. - Making graph more simple and easier for understanding

Application domains of Data Visualization and Visual Analytics
techniques
 Visualization of large, complex, multivariate, biological networks
 Visual text analytics and classify relevant related work on biological entities in publi
cation databases (e.g. PubMed)
 Visualization for exploring heterogeneous data
and data from multiple data sources
 Visual analytics as support for understanding uncertainty
and data quality issues

Fig. 12. - Complex data visual analytics computer-based tool
(the personal archive)

Fig. 13. - First visualization of the human
Protein-Protein-Interaction structure

Topological DM
 Applying topological techniques to DM and KDD is a hot and promising future research
area.
 Topology has its roots in theoretical mathematics, but within the last decade,
computational topology rapidly gains interest among computer scientists.
 It is a study of abstract shapes and spaces and mappings between them. It originated
from the study of geometry and set theory.
 Topological methods can be applied to data represented by point clouds, that is, finite
subsets of the n-dimensional Euclidean space.
 The input is presented with a sample of some unknown space which one wishes to
reconstruct and understand.
 Distinguishing between the ambient (embedding) dimension n, and the intrinsic
dimension of the data is of primary interest towards understanding the intrinsic
structure of data.

Topological DM
 Geometrical and topological methods are tools allowing us to analyse highly complex data
 Modern data science uses topological methods to find the structural features of data sets before
further supervised or unsupervised analysis
 Mathematical formalism, which has been developed for incorporating geometric and topological
techniques, deals with point cloud data sets, i.e. finite sets of points
 The point clouds are finite samples taken from a geometric object
 Tools from the various branches of geometry and topology are then used to study the point
cloud data sets
 Topology provides a formal language for qualitative mathematics, whereas geometry is mainly
quantitative.
 Topology studies the relationships of proximity or nearness, since geometry can be regarded as
the study of distance functions
 These methods create a summary or compressed representation of all of the data features to
help to rapidly uncover particular patterns and relationships in data.
 The idea of constructing summaries of entire domains of attributes involves understanding the
relationship between topological and geometric objects constructed from data using various
features

Topological DM
 Fig. 14.
 Forming the computational
structure (down below) from
the shape which one wishes to
reconstruct and understand
(up above)

Basic course on computer-based methods

More Related Content

What's hot (20)

Similar to Basic course on computer-based methods (20)

More from improvemed (20)

Recently uploaded (20)

Basic course on computer-based methods