Fall 2023
Instructor:
Ajit Rajwade
Descriptive Statistics
1
Topic Overview
⚫Some important terminology
⚫Methods of data representation: frequency tables,
graphs, pie-charts, scatter-plots
⚫Data mean, median, mode, quantiles
⚫Chebyshev’s inequality
⚫Correlation coefficient
2
Terminology
⚫ Population: The collection of all elements which we wish to
study, example: data about occurrence of tuberculosis all over
the world
⚫ In this case, “population” refers to the set of people in the entire
world.
⚫ The population is often too large to examine/study.
⚫ So we study a subset of the population – called as a sample.
⚫ In an experiment, we basically collect values for attributes of
each member of the sample – also called a sample point.
⚫ Example of a relevant attribute in the tuberculosis study would
be whether or not the patient yielded a positive result on the
serum TB Gold test.
⚫ See http://www.who.int/tb/publications/global_report/en/ for
more information.
3
Terminology
⚫Discrete data: Data whose values are restricted to a
finite or countably infinite set. Eg: letter grades at
IITB, genders, marital status (single, married,
divorced), income brackets in India for tax purposes
⚫Continuous data: Data whose values belong to an
uncountably infinite set (Eg: a person’s height,
temperature of a place, speed of a car at a time instant).
4
Methods of Data
Representation/Visualization
5
Frequency Tables
⚫For discrete data having a relatively small number of
values, one can use a frequency table.
⚫Each row of the table lists the data value followed by
the number of sample points with that value (frequency
of that value).
⚫The values need not always be numeric!
Grade Number of students for that
grade (total 100)
AA 100
AB 0
BB 0
BC 0
CC 0
The definition of an
ideal course (per
student perspective)
at IITB
;-)
6
Frequency Tables
⚫The frequency table can be visualized using a line
graph or a bar graph or a frequency polygon.
Grade Number of
students
AA 5
AB 10
BB 30
BC 35
CC 20
A bar graph plots the distinct
data values on the X axis and
their frequency on the Y axis by
means of the height of a thick
vertical bar!
7
Grade Number of
students
AA 5
AB 10
BB 30
BC 35
CC 20
A line diagram plots the distinct data values on the X axis and their
frequency on the Y axis by means of the height of a vertical line!
8
Grade Number of
students
AA 5
AB 10
BB 30
BC 35
CC 20
A frequency polygon plots the frequency of each data value on the Y
axis, and connects consecutive plotted points by means of a line.
9
Relative frequency tables
⚫Sometimes the actual frequencies are not important.
⚫We may be interested only in the percentage or
fraction of those frequencies for each data value – i.e.
relative frequencies.
Grade Fraction of
number of
students
AA 0.05
AB 0.10
BB 0.30
BC 0.35
CC 0.20
10
Pie charts
⚫For a small number of distinct data values which are
non-numerical, one can use a pie-chart (it can also be
used for numerical values).
⚫It consists of a circle divided into sectors
corresponding to each data value.
⚫The area of each sector = relative frequency for that
data value.
Population of native English speakers:
https://en.wikipedia.org/wiki/Pie_chart
11
Pie charts can be confusing
A big no-no with too many categories.
http://stephenturbek.com/articles/2009/06/better-charts-from-simple-questions.html
12
Dealing with continuous data
⚫Many a time the data can acquire continuous values
(eg: temperature of a place at a time instant, speed of a
car at a given time instant, weight or height of an
animal, etc.)
⚫In such cases, the data values are divided into intervals
called as bins.
⚫The frequency now refers to the number of sample
points falling into each bin.
⚫The bins are often taken to be of equal length, though
that is not strictly necessary.
13
Dealing with continuous data
⚫Let the sample points be {xi}, 1 <= i <= N.
⚫Let there be some K (K << N) bins, where the jth bin
has interval [aj,bj).
⚫Thus frequency fj for the jth bin is defined as follows:
⚫Such frequency tables are also called histograms and
they can also be used to store relative frequency
instead of frequency.
14
Example of a histogram: in image
processing
⚫ A grayscale image is a 2D array of size (say) H x W.
⚫ Each entry of this array is called a pixel and is indexed as
(x,y) where x is the column index and y is the row index.
⚫ At each pixel, we have an intensity value which tells us
how bright the pixel is (smaller values = darker shades,
larger value = brighter shades).
⚫ Commonly, pixel values in grayscale photographic are 8 bit
(ranging from 0 to 255).
⚫ Histograms are widely used in image processing – in fact a
histogram is often used in image retrieval (eg: finding
images from the web that are most similar to a query
image).
15
Example: histogram of the
well-known “barbara
image”, using bins of
length 10. This image has
values from 0 to 255 and
hence there are 26 bins.
16
The histogram binning problem
17
⚫ If you have too few bins (each bin is
very wide), there is very little idea
you get about the data distribution
from the histogram.
⚫ Extreme: only one bin to represent
all intensities in an image.
⚫ If you have many bins (all will be
narrow), then there are very points
falling into each bin. Again there is
very little idea you get about the data
distribution from the histogram.
⚫ Extreme: For intensities from a 512 x
512 image, if you had 5122 histogram
bins.
Cumulative frequency plot
⚫The cumulative (relative) frequency plot (also called
ogive) tells you the (proportion) number of sample
points whose value is less than or equal to a given data
value.
The cumulative frequency
plot for the frequency plot
from two slides back!
18
Digression: A curious looking
histogram in image processing
⚫Given the image I(x,y), let’s say we compute the x-
gradient image in the following manner:
⚫And we plot the histogram of the absolute values of
the x-gradient image.
⚫The next slide shows you how these histograms
typically look! What do you observe?
19
20
21
Summarizing the Data
22
Summarizing a sample-set
⚫There are some values that can be considered
“representative” of the entire sample-set. Such values
are called as a “statistic”.
⚫The most common statistic is the sample (arithmetic)
mean:
⚫It is basically what is commonly regarded as “average
value”.
23
Summarizing a sample-set
⚫Another common statistic is the sample median,
which is the “middle value”.
⚫We sort the data array A from smallest to largest. If N
is odd, then the median is the value at the (N+1)/2
position in the sorted array.
⚫If N is even, the median can take any value in the
interval (A[N/2],A[N/2+1]) – why?
24
Properties of the mean and median
⚫Consider each sample point xi were replaced by axi + b
for some constants a and b.
⚫What happens to the mean? What happens to the
median?
⚫Consider each sample point xi were replaced by its
square.
⚫What happens to the mean? What happens to the
median?
25
Properties of the mean and median
⚫Question: Consider a set of sample points x1, x2, …,
xN. For what value y, is the sum total of the squared
difference with every sample point, the least? That is,
what is:
⚫Question: For what value y, is the sum total of the
absolute difference with every sample point, the least?
That is, what is:
Total squared deviation
(or total squared loss)
Total absolute deviation
(or total absolute loss)
Answer: mean
(proof done in
class)
Answer: median
(two proofs done
in class – with
and without
calculus)
26
Properties of the mean and median
⚫The mean need not be a member of the original
sample-set.
⚫The median is always a member of the original
sample-set if N is odd.
⚫The median is not unique and will not be a member of
the set if N is even.
27
Properties of the mean and median
⚫Consider a set of sample points x1, x2, …, xN. Let us
say that some of these values get grossly corrupted.
⚫What happens to the mean?
⚫What happens to the median?
28
Example
⚫Let A ={1,2,3,4,6}
⚫Mean (A) = 3.2, median (A) = 3
⚫Now consider A = {1,2,3,4,20}
⚫Mean (A) = 6, median(A) = 3.
29
Concept of percentiles
⚫The sample 100p percentile (0 ≤ p ≤ 1) is defined as
the data value y such that 100p% of the data have a
value less than or equal to y, and 100(1-p)% of the data
have a larger value.
⚫For a data set with n sample points, the sample 100p
percentile is that value such that at least np of the
values are less than or equal to it. And at least n(1-p)
of the values are greater than it.
30
Concept of quantiles
⚫The sample 25 percentile = first quartile.
⚫The sample 50 percentile = second quartile.
⚫The sample 75 percentile = third quartile.
⚫Quantiles can be inferred from the cumulative relative
frequency plot (how?).
⚫Or by sorting the data values (how?).
31
1st quartile
2nd quartile
3rd quartile
32
Concept of mode
⚫The value that occurs with the highest frequency is
called the mode.
33
Concept of mode
⚫The mode may not be unique, in which case all the
highest frequency values are called modal values.
Mode at 0
34
Histogram for finding mean
⚫Given the histogram, the mean of a sample can be
approximated as follows:
⚫ Here fj is the frequency of the jth bin.
35
Histogram for finding median
⚫Given the histogram, the median of a sample is the
value at which you can split the histogram into two
regions of equal areas.
⚫Keep adding areas from the leftmost bins till you reach
more than N/2 – now you know the bin in which the
median will lie – the median is the midpoint of the bin.
⚫More useful for histograms whose “bins” contain
single values.
36
Variance and Standard deviation
⚫The variance is (approximately) the average value of
the squared distance between the sample points and the
sample mean. The formula is:
⚫The variance measures the “spread of the data around
the sample mean”.
⚫Its positive square-root is called as the standard
deviation.
The division by N-1 instead of N is for a
very technical reason which we will
understand after many lectures. As such,
the variance is computed usually when N
is large so the numerical difference is not
much.
37
Image source
38
Variance and Standard deviation:
Properties
⚫Consider each sample point xi were replaced by axi + b
for some constants a and b. What happens to the
standard deviation?
39
Standard deviation: practical
application 1
⚫Let us say a factory manufactures a product which is
required to have a certain weight w.
⚫In practice, the weight of each instance of the product
will deviate from w.
⚫In such a case, we need to see whether the average
weight is close to (or equal to w).
⚫But we also need to see that the standard deviation is
small.
⚫In fact, the standard deviation can be used to predict
how likely it is that the product weight will deviate
significantly from the mean.
40
Standard deviation: practical
application 2
⚫In the definition of diseases such as osteoporosis (low
bone density)
⚫A person whose bone density is less than 2.5σ below
the average bone density for that age-group, gender
and geographical region, is said to be suffering from
osteoporosis. Here σ is the standard deviation of the
bone density of that particular population.
Image source
41
Chebyshev’s inequality
⚫Suppose I told you that the average marks for this
course was 75 (out of 100). And that the variance of
the marks was 25.
⚫Can you say something about how many students
secured marks from 65 to 85?
⚫You obviously cannot predict the exact number – but
you can say something about this number.
⚫That something is given by Chebyshev’s inequality.
42
Chebyshev’s inequality: and
Chebyshev
https://en.wikipedia.org/wiki/Pafnuty_Chebyshev
Two-sided Chebyshev’s inequality:
The proportion of sample points k or more than k (k>0)
standard deviations away from the sample mean is less
than 1/k2.
Russian mathematician:
Stellar contributions in probability and statistics,
geometry, mechanics
43
Chebyshev’s inequality: and
Chebyshev
Two-sided Chebyshev’s inequality:
The proportion of sample points k or more than k (k>0) standard deviations away
from the sample mean is less than or equal to 1/k2.
Proof: on the board!
And in the book.
44
Chebyshev’s inequality
⚫Applying this inequality to the previous problem, we
see that the fraction of students who got less than 65 or
more than 85 marks is as follows:
⚫So the fraction of students who got from 65 to 85 is
more than 1-0.25 = 0.75.
45
Chebyshev’s inequality
1 Kerala 93.91
2
Lakshadwee
p
92.28
3 Mizoram 91.58
4 Tripura 87.75
5 Goa 87.40
6
Daman &
Diu
87.07
7 Puducherry 86.55
8 Chandigarh 86.43
9 Delhi 86.34
10
Andaman &
Nicobar
Islands
86.27
11
Himachal
Pradesh
83.78
12 Maharashtra 82.91
https://en.wikipedia.org/wiki/India
n_states_ranking_by_literacy_rate
Mean = 87.69
Std. dev. = 3.306
Fraction of states with literacy rate
in the range
(μ-1.5σ, μ+1.5σ) is 11/12 ≈ 91%
As predicted by Chebyshev’s
inequality, it is at least
1-1/(1.5*1.5) ≈ 0.55
The bounds predicted by this
inequality are loose – but they
are correct!
46
One-sided Chebyshev’s inequality
⚫Also called the Chebyshev-Cantelli inequality.
The proportion of sample points k or more than k (k>0)
standard deviations away from the sample mean and
greater than the sample mean is less than or equal to
1/(1+k2).
Proof: on the board!
And in the book.
Notice: no absolute
value!
47
One-sided Chebyshev’s inequality
(Another form)
⚫Also called the Chebyshev-Cantelli inequality.
The proportion of sample points k or more than k (k>0)
standard deviations away from the sample mean and less
than the sample mean is less than or equal to 1/(1+k2).
Proof: on the board!
And in the book.
Notice: no absolute
value!
48
Correlation between different data
values
⚫Sometimes each sample-point can have a pair of
attributes.
⚫And it may so happen that large values of the first
attribute are accompanied with large (or small) values
of the second attribute for a large number of sample-
points.
49
Correlation between different data
values
⚫Example 1: Populations with higher levels of fat
intake show higher incidence of heart disease.
⚫Example 2: People with higher levels of education
often have higher incomes.
⚫Example 3: Literacy Rate in India as a function of
time?
50
Image source
51
Visualizing such relationships?
⚫Can be done by means of a scatter plot
⚫X axis: values of attribute 1, Y axis: values of attribute
2
⚫Plot a marker at each such data point. The marker may
be a small circle, a +, a *, and so on.
52
Visualizing such relationships?
⚫Image processing example: pixel intensity value and
intensity value of the pixel right neighbor
53
Correlation coefficient
⚫Let the sample-points be given as (xi,yi), 1 <= i <= N.
⚫Let the sample standard deviations be σx and σy, and
the sample means be μx and μy.
⚫The correlation-coefficient is given as:
54
Correlation coefficient
⚫ The correlation-coefficient is given as:
⚫ r > 0 means the data are positively correlated (one
attribute being higher implies the other is higher)
⚫ r < 0 means the data are negatively correlated (one
attribute being higher implies the other is lower)
⚫ r = 0 means the data are uncorrelated (there is no such
relationship!)
⚫ r is undefined if the standard deviation of either x or y is 0.
55
Correlation coefficient: Properties
⚫The correlation-coefficient is given as:
⚫-1 <= r <= 1 always!
56
https://en.wikipedia.org/wiki/Correlation_and_dependence
Correlation coefficient values for various toy datasets in 2D:
for each dataset, a scatter plot is provided
undefined
57
Correlation coefficient: geometric
interpretation
⚫Consider the N values x1, x2, …, xN. We will assemble
them into a vector x (1D array) of N elements.
⚫We will also create vector y from y1, y2, …, yN.
⚫Now create vectors x-μx and y-μy – by deducting μx
from each element of x, and μy from each element of y.
⚫Note that you may be used to vectors in 2D or 3D, but
in statistics or machine learning, we frequently use
vectors in N-D!
58
Correlation coefficient: geometric
interpretation
⚫ Then r(x, y) is basically the cosine of the angle between x-
μx and y-μy!
⚫ Note that the cosine of an angle has a value from -1 to +1.
Vector magnitude -
also called the L2-
norm of the vector.
59
Correlation coefficient: Properties
⚫In the following, we have a,b,c,d constant.
⚫If yi = a+bxi where b > 0, then r(x,y) = 1.
⚫If yi = a+bxi where b < 0, then r(x,y) = -1.
⚫If r is the correlation coefficient of data pairs as (xi,yi),
1 <= i <= N, then it is also the correlation coefficient
of data pairs (b+axi,d+cyi) when a and c have the same
sign.
60
Correlation coefficient: a word of
caution
⚫Sensitive to outliers!
r = 1 r = 0.33
61
Caution with correlation: Anscombe’s
quartet
⚫The correlation coefficient can be a misleading value,
and graphical examination of the data is important.
⚫This was illustrated beautifully by a British statistician
named Frank Anscombe – by showing four examples
that graphically appear very different – even though
they produce identical correlation coefficients.
⚫These examples are famously called Anscombe’s
quartet.
62
Caution with correlation: Anscombe’s
quartet
Image source
In each of these examples, the
following quantities were the
same:
• Mean and variance of x
• Mean and variance of y
• Correlation coefficient
r(x,y)
But the data are graphically
very different!
63
Reflective (or Uncentered)
correlation coefficient
⚫A version of the correlation coefficient in which you
do not deduct the mean values from the vectors!
⚫Uncentered c.c. is not “translation invariant”:
64
Correlation does not necessarily
imply causation
⚫A high correlation between two attributes does not
mean that one causes the other.
⚫Example 1: Fast rotating windmills are observed when
the wind speed is high. Hence can one say that the
windmill rotation produces speedy wind? (a windmill
in the literal sense ☺)
65
Correlation does not necessarily
imply causation
⚫In example 1, the cause and effect were swapped. High
wind speed leads to fast rotation and not vice-versa.
⚫Example 2: High sale of ice-cream is correlated with
larger occurrence of drowning. Hence can one say that
ice-cream causes drowning?
⚫In this case, there is a third factor that is highly
correlated with both – ice-cream sales, as well as
drowning. Ice-cream sales and swimming activities are
on the rise in the summer!
66
Correlation does not necessarily
imply causation
⚫The above statement does not mean that correlation is
never associated with causation (example: increase in
age does cause increase in height in children or
adolescents) – just that it is not sufficient to establish
causation.
⚫Consider the argument: “High correlation between
tobacco usage and lung cancer occurrence does not
imply that smoking causes lung cancer.”
67
Correlation does not necessarily
imply causation – but it may!
⚫ However multiple observational studies that
eliminate other possible causes do lead to the
conclusion that smoking causes cancer!
❑ higher tobacco dosage associated with higher occurrence of cancer
❑ stopping smoking associated with lower occurrence of cancer
❑ higher duration of smoking associated with higher occurrence of
cancer
❑ unfiltered (as opposed to filtered) cigarettes associated with higher
occurrence of cancer
• See
https://www.sciencebasedmedicine.org/evidence-
in-medicine-correlation-and-causation/ and
http://www.americanscientist.org/issues/pub/wha
t-everyone-should-know-about-statistical-
correlation for more details.
68

More Related Content

PPTX
An Introduction to Statistics
PPTX
EMA104 Mod 1.1 - Introduction to Statistics.pptx
PPTX
Descriptive statistics
PPTX
Stat-Lesson.pptx
PDF
Lesson2 - chapter 2 Measures of Tendency.pptx.pdf
PDF
Lesson2 - chapter two Measures of Tendency.pptx.pdf
PDF
Lessontwo - Measures of Tendency.pptx.pdf
PPTX
RVO-STATISTICS_Statistics_Introduction To Statistics IBBI.pptx
An Introduction to Statistics
EMA104 Mod 1.1 - Introduction to Statistics.pptx
Descriptive statistics
Stat-Lesson.pptx
Lesson2 - chapter 2 Measures of Tendency.pptx.pdf
Lesson2 - chapter two Measures of Tendency.pptx.pdf
Lessontwo - Measures of Tendency.pptx.pdf
RVO-STATISTICS_Statistics_Introduction To Statistics IBBI.pptx

Similar to Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tendency,Basic of Statistics described (20)

PPT
Manpreet kay bhatia Business Statistics.ppt
PPTX
Statistics
PPT
Statistics.ppt
PPTX
Type of data @ Web Mining Discussion
PPTX
1. Descriptive statistics.pptx engineering
PDF
statistics - Populations and Samples.pdf
PPTX
Lesson2 lecture two in Measures mean.pptx
PPTX
Descriptive Statistics in Biomedical Research .pptx
PPTX
Descriptive Statistics.pptx
PPTX
LECTURE 3 - inferential statistics bmaths
PPTX
2.1 frequency distributions for organizing and summarizing data
PPTX
Statistics and optimization (1)
PPTX
Lesson3 lpart one - Measures mean [Autosaved].pptx
PPT
Statistics for math (English Version)
PPTX
Type of data @ web mining discussion
DOCX
STATISTICS 1
PPTX
Business statistics
PDF
Day2 session i&amp;ii - spss
PPT
Ch.2 ppt - descriptive stat - Larson-fabers.ppt
PDF
BIOSTATICS & RESEARCH METHODOLOGY UNIT-1.pdf
Manpreet kay bhatia Business Statistics.ppt
Statistics
Statistics.ppt
Type of data @ Web Mining Discussion
1. Descriptive statistics.pptx engineering
statistics - Populations and Samples.pdf
Lesson2 lecture two in Measures mean.pptx
Descriptive Statistics in Biomedical Research .pptx
Descriptive Statistics.pptx
LECTURE 3 - inferential statistics bmaths
2.1 frequency distributions for organizing and summarizing data
Statistics and optimization (1)
Lesson3 lpart one - Measures mean [Autosaved].pptx
Statistics for math (English Version)
Type of data @ web mining discussion
STATISTICS 1
Business statistics
Day2 session i&amp;ii - spss
Ch.2 ppt - descriptive stat - Larson-fabers.ppt
BIOSTATICS & RESEARCH METHODOLOGY UNIT-1.pdf
Ad

Recently uploaded (20)

PPT
Image processing and pattern recognition 2.ppt
PDF
Global Data and Analytics Market Outlook Report
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
Business_Capability_Map_Collection__pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Microsoft 365 products and services descrption
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
Microsoft Core Cloud Services powerpoint
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Steganography Project Steganography Project .pptx
PPTX
Managing Community Partner Relationships
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
modul_python (1).pptx for professional and student
Image processing and pattern recognition 2.ppt
Global Data and Analytics Market Outlook Report
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Business_Capability_Map_Collection__pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Microsoft 365 products and services descrption
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Leprosy and NLEP programme community medicine
Microsoft Core Cloud Services powerpoint
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Steganography Project Steganography Project .pptx
Managing Community Partner Relationships
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
STERILIZATION AND DISINFECTION-1.ppthhhbx
Navigating the Thai Supplements Landscape.pdf
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
modul_python (1).pptx for professional and student
Ad

Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tendency,Basic of Statistics described

  • 2. Topic Overview ⚫Some important terminology ⚫Methods of data representation: frequency tables, graphs, pie-charts, scatter-plots ⚫Data mean, median, mode, quantiles ⚫Chebyshev’s inequality ⚫Correlation coefficient 2
  • 3. Terminology ⚫ Population: The collection of all elements which we wish to study, example: data about occurrence of tuberculosis all over the world ⚫ In this case, “population” refers to the set of people in the entire world. ⚫ The population is often too large to examine/study. ⚫ So we study a subset of the population – called as a sample. ⚫ In an experiment, we basically collect values for attributes of each member of the sample – also called a sample point. ⚫ Example of a relevant attribute in the tuberculosis study would be whether or not the patient yielded a positive result on the serum TB Gold test. ⚫ See http://www.who.int/tb/publications/global_report/en/ for more information. 3
  • 4. Terminology ⚫Discrete data: Data whose values are restricted to a finite or countably infinite set. Eg: letter grades at IITB, genders, marital status (single, married, divorced), income brackets in India for tax purposes ⚫Continuous data: Data whose values belong to an uncountably infinite set (Eg: a person’s height, temperature of a place, speed of a car at a time instant). 4
  • 6. Frequency Tables ⚫For discrete data having a relatively small number of values, one can use a frequency table. ⚫Each row of the table lists the data value followed by the number of sample points with that value (frequency of that value). ⚫The values need not always be numeric! Grade Number of students for that grade (total 100) AA 100 AB 0 BB 0 BC 0 CC 0 The definition of an ideal course (per student perspective) at IITB ;-) 6
  • 7. Frequency Tables ⚫The frequency table can be visualized using a line graph or a bar graph or a frequency polygon. Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20 A bar graph plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a thick vertical bar! 7
  • 8. Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20 A line diagram plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a vertical line! 8
  • 9. Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20 A frequency polygon plots the frequency of each data value on the Y axis, and connects consecutive plotted points by means of a line. 9
  • 10. Relative frequency tables ⚫Sometimes the actual frequencies are not important. ⚫We may be interested only in the percentage or fraction of those frequencies for each data value – i.e. relative frequencies. Grade Fraction of number of students AA 0.05 AB 0.10 BB 0.30 BC 0.35 CC 0.20 10
  • 11. Pie charts ⚫For a small number of distinct data values which are non-numerical, one can use a pie-chart (it can also be used for numerical values). ⚫It consists of a circle divided into sectors corresponding to each data value. ⚫The area of each sector = relative frequency for that data value. Population of native English speakers: https://en.wikipedia.org/wiki/Pie_chart 11
  • 12. Pie charts can be confusing A big no-no with too many categories. http://stephenturbek.com/articles/2009/06/better-charts-from-simple-questions.html 12
  • 13. Dealing with continuous data ⚫Many a time the data can acquire continuous values (eg: temperature of a place at a time instant, speed of a car at a given time instant, weight or height of an animal, etc.) ⚫In such cases, the data values are divided into intervals called as bins. ⚫The frequency now refers to the number of sample points falling into each bin. ⚫The bins are often taken to be of equal length, though that is not strictly necessary. 13
  • 14. Dealing with continuous data ⚫Let the sample points be {xi}, 1 <= i <= N. ⚫Let there be some K (K << N) bins, where the jth bin has interval [aj,bj). ⚫Thus frequency fj for the jth bin is defined as follows: ⚫Such frequency tables are also called histograms and they can also be used to store relative frequency instead of frequency. 14
  • 15. Example of a histogram: in image processing ⚫ A grayscale image is a 2D array of size (say) H x W. ⚫ Each entry of this array is called a pixel and is indexed as (x,y) where x is the column index and y is the row index. ⚫ At each pixel, we have an intensity value which tells us how bright the pixel is (smaller values = darker shades, larger value = brighter shades). ⚫ Commonly, pixel values in grayscale photographic are 8 bit (ranging from 0 to 255). ⚫ Histograms are widely used in image processing – in fact a histogram is often used in image retrieval (eg: finding images from the web that are most similar to a query image). 15
  • 16. Example: histogram of the well-known “barbara image”, using bins of length 10. This image has values from 0 to 255 and hence there are 26 bins. 16
  • 17. The histogram binning problem 17 ⚫ If you have too few bins (each bin is very wide), there is very little idea you get about the data distribution from the histogram. ⚫ Extreme: only one bin to represent all intensities in an image. ⚫ If you have many bins (all will be narrow), then there are very points falling into each bin. Again there is very little idea you get about the data distribution from the histogram. ⚫ Extreme: For intensities from a 512 x 512 image, if you had 5122 histogram bins.
  • 18. Cumulative frequency plot ⚫The cumulative (relative) frequency plot (also called ogive) tells you the (proportion) number of sample points whose value is less than or equal to a given data value. The cumulative frequency plot for the frequency plot from two slides back! 18
  • 19. Digression: A curious looking histogram in image processing ⚫Given the image I(x,y), let’s say we compute the x- gradient image in the following manner: ⚫And we plot the histogram of the absolute values of the x-gradient image. ⚫The next slide shows you how these histograms typically look! What do you observe? 19
  • 20. 20
  • 21. 21
  • 23. Summarizing a sample-set ⚫There are some values that can be considered “representative” of the entire sample-set. Such values are called as a “statistic”. ⚫The most common statistic is the sample (arithmetic) mean: ⚫It is basically what is commonly regarded as “average value”. 23
  • 24. Summarizing a sample-set ⚫Another common statistic is the sample median, which is the “middle value”. ⚫We sort the data array A from smallest to largest. If N is odd, then the median is the value at the (N+1)/2 position in the sorted array. ⚫If N is even, the median can take any value in the interval (A[N/2],A[N/2+1]) – why? 24
  • 25. Properties of the mean and median ⚫Consider each sample point xi were replaced by axi + b for some constants a and b. ⚫What happens to the mean? What happens to the median? ⚫Consider each sample point xi were replaced by its square. ⚫What happens to the mean? What happens to the median? 25
  • 26. Properties of the mean and median ⚫Question: Consider a set of sample points x1, x2, …, xN. For what value y, is the sum total of the squared difference with every sample point, the least? That is, what is: ⚫Question: For what value y, is the sum total of the absolute difference with every sample point, the least? That is, what is: Total squared deviation (or total squared loss) Total absolute deviation (or total absolute loss) Answer: mean (proof done in class) Answer: median (two proofs done in class – with and without calculus) 26
  • 27. Properties of the mean and median ⚫The mean need not be a member of the original sample-set. ⚫The median is always a member of the original sample-set if N is odd. ⚫The median is not unique and will not be a member of the set if N is even. 27
  • 28. Properties of the mean and median ⚫Consider a set of sample points x1, x2, …, xN. Let us say that some of these values get grossly corrupted. ⚫What happens to the mean? ⚫What happens to the median? 28
  • 29. Example ⚫Let A ={1,2,3,4,6} ⚫Mean (A) = 3.2, median (A) = 3 ⚫Now consider A = {1,2,3,4,20} ⚫Mean (A) = 6, median(A) = 3. 29
  • 30. Concept of percentiles ⚫The sample 100p percentile (0 ≤ p ≤ 1) is defined as the data value y such that 100p% of the data have a value less than or equal to y, and 100(1-p)% of the data have a larger value. ⚫For a data set with n sample points, the sample 100p percentile is that value such that at least np of the values are less than or equal to it. And at least n(1-p) of the values are greater than it. 30
  • 31. Concept of quantiles ⚫The sample 25 percentile = first quartile. ⚫The sample 50 percentile = second quartile. ⚫The sample 75 percentile = third quartile. ⚫Quantiles can be inferred from the cumulative relative frequency plot (how?). ⚫Or by sorting the data values (how?). 31
  • 33. Concept of mode ⚫The value that occurs with the highest frequency is called the mode. 33
  • 34. Concept of mode ⚫The mode may not be unique, in which case all the highest frequency values are called modal values. Mode at 0 34
  • 35. Histogram for finding mean ⚫Given the histogram, the mean of a sample can be approximated as follows: ⚫ Here fj is the frequency of the jth bin. 35
  • 36. Histogram for finding median ⚫Given the histogram, the median of a sample is the value at which you can split the histogram into two regions of equal areas. ⚫Keep adding areas from the leftmost bins till you reach more than N/2 – now you know the bin in which the median will lie – the median is the midpoint of the bin. ⚫More useful for histograms whose “bins” contain single values. 36
  • 37. Variance and Standard deviation ⚫The variance is (approximately) the average value of the squared distance between the sample points and the sample mean. The formula is: ⚫The variance measures the “spread of the data around the sample mean”. ⚫Its positive square-root is called as the standard deviation. The division by N-1 instead of N is for a very technical reason which we will understand after many lectures. As such, the variance is computed usually when N is large so the numerical difference is not much. 37
  • 39. Variance and Standard deviation: Properties ⚫Consider each sample point xi were replaced by axi + b for some constants a and b. What happens to the standard deviation? 39
  • 40. Standard deviation: practical application 1 ⚫Let us say a factory manufactures a product which is required to have a certain weight w. ⚫In practice, the weight of each instance of the product will deviate from w. ⚫In such a case, we need to see whether the average weight is close to (or equal to w). ⚫But we also need to see that the standard deviation is small. ⚫In fact, the standard deviation can be used to predict how likely it is that the product weight will deviate significantly from the mean. 40
  • 41. Standard deviation: practical application 2 ⚫In the definition of diseases such as osteoporosis (low bone density) ⚫A person whose bone density is less than 2.5σ below the average bone density for that age-group, gender and geographical region, is said to be suffering from osteoporosis. Here σ is the standard deviation of the bone density of that particular population. Image source 41
  • 42. Chebyshev’s inequality ⚫Suppose I told you that the average marks for this course was 75 (out of 100). And that the variance of the marks was 25. ⚫Can you say something about how many students secured marks from 65 to 85? ⚫You obviously cannot predict the exact number – but you can say something about this number. ⚫That something is given by Chebyshev’s inequality. 42
  • 43. Chebyshev’s inequality: and Chebyshev https://en.wikipedia.org/wiki/Pafnuty_Chebyshev Two-sided Chebyshev’s inequality: The proportion of sample points k or more than k (k>0) standard deviations away from the sample mean is less than 1/k2. Russian mathematician: Stellar contributions in probability and statistics, geometry, mechanics 43
  • 44. Chebyshev’s inequality: and Chebyshev Two-sided Chebyshev’s inequality: The proportion of sample points k or more than k (k>0) standard deviations away from the sample mean is less than or equal to 1/k2. Proof: on the board! And in the book. 44
  • 45. Chebyshev’s inequality ⚫Applying this inequality to the previous problem, we see that the fraction of students who got less than 65 or more than 85 marks is as follows: ⚫So the fraction of students who got from 65 to 85 is more than 1-0.25 = 0.75. 45
  • 46. Chebyshev’s inequality 1 Kerala 93.91 2 Lakshadwee p 92.28 3 Mizoram 91.58 4 Tripura 87.75 5 Goa 87.40 6 Daman & Diu 87.07 7 Puducherry 86.55 8 Chandigarh 86.43 9 Delhi 86.34 10 Andaman & Nicobar Islands 86.27 11 Himachal Pradesh 83.78 12 Maharashtra 82.91 https://en.wikipedia.org/wiki/India n_states_ranking_by_literacy_rate Mean = 87.69 Std. dev. = 3.306 Fraction of states with literacy rate in the range (μ-1.5σ, μ+1.5σ) is 11/12 ≈ 91% As predicted by Chebyshev’s inequality, it is at least 1-1/(1.5*1.5) ≈ 0.55 The bounds predicted by this inequality are loose – but they are correct! 46
  • 47. One-sided Chebyshev’s inequality ⚫Also called the Chebyshev-Cantelli inequality. The proportion of sample points k or more than k (k>0) standard deviations away from the sample mean and greater than the sample mean is less than or equal to 1/(1+k2). Proof: on the board! And in the book. Notice: no absolute value! 47
  • 48. One-sided Chebyshev’s inequality (Another form) ⚫Also called the Chebyshev-Cantelli inequality. The proportion of sample points k or more than k (k>0) standard deviations away from the sample mean and less than the sample mean is less than or equal to 1/(1+k2). Proof: on the board! And in the book. Notice: no absolute value! 48
  • 49. Correlation between different data values ⚫Sometimes each sample-point can have a pair of attributes. ⚫And it may so happen that large values of the first attribute are accompanied with large (or small) values of the second attribute for a large number of sample- points. 49
  • 50. Correlation between different data values ⚫Example 1: Populations with higher levels of fat intake show higher incidence of heart disease. ⚫Example 2: People with higher levels of education often have higher incomes. ⚫Example 3: Literacy Rate in India as a function of time? 50
  • 52. Visualizing such relationships? ⚫Can be done by means of a scatter plot ⚫X axis: values of attribute 1, Y axis: values of attribute 2 ⚫Plot a marker at each such data point. The marker may be a small circle, a +, a *, and so on. 52
  • 53. Visualizing such relationships? ⚫Image processing example: pixel intensity value and intensity value of the pixel right neighbor 53
  • 54. Correlation coefficient ⚫Let the sample-points be given as (xi,yi), 1 <= i <= N. ⚫Let the sample standard deviations be σx and σy, and the sample means be μx and μy. ⚫The correlation-coefficient is given as: 54
  • 55. Correlation coefficient ⚫ The correlation-coefficient is given as: ⚫ r > 0 means the data are positively correlated (one attribute being higher implies the other is higher) ⚫ r < 0 means the data are negatively correlated (one attribute being higher implies the other is lower) ⚫ r = 0 means the data are uncorrelated (there is no such relationship!) ⚫ r is undefined if the standard deviation of either x or y is 0. 55
  • 56. Correlation coefficient: Properties ⚫The correlation-coefficient is given as: ⚫-1 <= r <= 1 always! 56
  • 57. https://en.wikipedia.org/wiki/Correlation_and_dependence Correlation coefficient values for various toy datasets in 2D: for each dataset, a scatter plot is provided undefined 57
  • 58. Correlation coefficient: geometric interpretation ⚫Consider the N values x1, x2, …, xN. We will assemble them into a vector x (1D array) of N elements. ⚫We will also create vector y from y1, y2, …, yN. ⚫Now create vectors x-μx and y-μy – by deducting μx from each element of x, and μy from each element of y. ⚫Note that you may be used to vectors in 2D or 3D, but in statistics or machine learning, we frequently use vectors in N-D! 58
  • 59. Correlation coefficient: geometric interpretation ⚫ Then r(x, y) is basically the cosine of the angle between x- μx and y-μy! ⚫ Note that the cosine of an angle has a value from -1 to +1. Vector magnitude - also called the L2- norm of the vector. 59
  • 60. Correlation coefficient: Properties ⚫In the following, we have a,b,c,d constant. ⚫If yi = a+bxi where b > 0, then r(x,y) = 1. ⚫If yi = a+bxi where b < 0, then r(x,y) = -1. ⚫If r is the correlation coefficient of data pairs as (xi,yi), 1 <= i <= N, then it is also the correlation coefficient of data pairs (b+axi,d+cyi) when a and c have the same sign. 60
  • 61. Correlation coefficient: a word of caution ⚫Sensitive to outliers! r = 1 r = 0.33 61
  • 62. Caution with correlation: Anscombe’s quartet ⚫The correlation coefficient can be a misleading value, and graphical examination of the data is important. ⚫This was illustrated beautifully by a British statistician named Frank Anscombe – by showing four examples that graphically appear very different – even though they produce identical correlation coefficients. ⚫These examples are famously called Anscombe’s quartet. 62
  • 63. Caution with correlation: Anscombe’s quartet Image source In each of these examples, the following quantities were the same: • Mean and variance of x • Mean and variance of y • Correlation coefficient r(x,y) But the data are graphically very different! 63
  • 64. Reflective (or Uncentered) correlation coefficient ⚫A version of the correlation coefficient in which you do not deduct the mean values from the vectors! ⚫Uncentered c.c. is not “translation invariant”: 64
  • 65. Correlation does not necessarily imply causation ⚫A high correlation between two attributes does not mean that one causes the other. ⚫Example 1: Fast rotating windmills are observed when the wind speed is high. Hence can one say that the windmill rotation produces speedy wind? (a windmill in the literal sense ☺) 65
  • 66. Correlation does not necessarily imply causation ⚫In example 1, the cause and effect were swapped. High wind speed leads to fast rotation and not vice-versa. ⚫Example 2: High sale of ice-cream is correlated with larger occurrence of drowning. Hence can one say that ice-cream causes drowning? ⚫In this case, there is a third factor that is highly correlated with both – ice-cream sales, as well as drowning. Ice-cream sales and swimming activities are on the rise in the summer! 66
  • 67. Correlation does not necessarily imply causation ⚫The above statement does not mean that correlation is never associated with causation (example: increase in age does cause increase in height in children or adolescents) – just that it is not sufficient to establish causation. ⚫Consider the argument: “High correlation between tobacco usage and lung cancer occurrence does not imply that smoking causes lung cancer.” 67
  • 68. Correlation does not necessarily imply causation – but it may! ⚫ However multiple observational studies that eliminate other possible causes do lead to the conclusion that smoking causes cancer! ❑ higher tobacco dosage associated with higher occurrence of cancer ❑ stopping smoking associated with lower occurrence of cancer ❑ higher duration of smoking associated with higher occurrence of cancer ❑ unfiltered (as opposed to filtered) cigarettes associated with higher occurrence of cancer • See https://www.sciencebasedmedicine.org/evidence- in-medicine-correlation-and-causation/ and http://www.americanscientist.org/issues/pub/wha t-everyone-should-know-about-statistical- correlation for more details. 68