2. Covariance
Covariance measures the direction of the relationship
between two variables.
A positive covariance means that both variables tend to be
high or low at the same time.
A negative covariance means that when one variable is
high, the other tends to be low
3. Covariance vs Correlation Coefficient
Covariance measures the direction of a
relationship between two variables
Correlation measures the strength of that
relationship.
Both correlation and covariance are positive when
the variables move in the same direction, and
negative when they move in opposite directions.
However, a correlation coefficient must always be
������� − � ��� + �, with the extreme values
indicating a strong relationship.
4. Computing Covariance
��� �, � = �=1
�
�� − � × �� − �
�
Average of
variable A
Average of
variable B
��� �, � = �=1
�
�� × ��
�
− � × �
�������� ��
=
�
�
�=�
�
�� − � �
�������� ��
=
�
�
�=�
�
��
�
− ��
Recall from Lecture 4:
Note: Variance is special case of covariance where two attributes are identical (covariance with
itself)
6. Correlation
Correlation is a statistical term describing the degree to
which two variables move in coordination with one
another.
If the two variables move in the same direction, then those
variables are said to have a positive correlation.
If they move in opposite directions, then they have a negative
correlation.
The strength of the correlation is determined by the
correlation coefficient, which varies between −� and +�.
What if correlation is 0?
7. Correlation Between Two Variables
Weak relationship have
small correlation
value…
Strong relationship have
large correlation
value…
Correlation is 1 if a
straight line with positive
slope can be drawn from
the center of all data
points
16. Correlation Heatmaps: Complex Examples
We can also create heatmaps for Similarity and
Dissimilarity (Distance) between two
objects/records…
But … What is similarity and dissimilarity
between two objects and how to measure it?
18. Correlation does NOT imply causality…
if A and B are correlated, this does not necessarily imply
that A causes B or that B causes A.
For example, in analyzing a demographic database, we may
find that attributes representing the number of hospitals and
the number of car thefts in a region are correlated.
This does not mean that one causes the other.
Both are actually causally linked to a third attribute, namely,
population.
19. What Correlation Tells Us?
Some attributes are redundant
Redundancy here means that an attribute (such as
annual_revenue) may be redundant if it can be “derived”
from other attribute or set of attributes.
Inconsistencies in attribute (or dimension) naming can also
cause redundancies in the resulting dataset.
Some redundancies can be
detected by correlation
analysis.
21. Correlation Analysis
Correlation Analysis:
Given two attributes, how strongly one attribute implies the other,
based on the available data
For Numerical Data:
Correlation coefficient (or Pearson’s Correlation Coefficient)
Covariance
For Nominal Data:
�2
(Chi-Square) test