1. Presented by,
Dr. J.C. Miraclin Joyce Pamila,
Professor & HOD,
Department of CSE,GCT, CBE-13.
miraclin@gct.ac.in
1 08/11/25
FOUNDATIONS OF DATA SCIENCE
DESCRIBING DATA
2. Level of Measurement
Specifies the extent to which a number (or word or letter)
actually represents some attribute and, therefore, has
implications for the appropriateness of various arithmetic
operations and statistical procedures.
Nominal: Categorized
Ordinal : Categorized , Ranked
Interval : Categorized , Ranked, Equally spaced
Ratio : Categorized , Ranked, Equally spaced and has a natural zero
Data Types:
Nominal Vs Numeric
Variable Types:
Discrete Vs Continuous
08/11/25
2
3. DESCRIPTIVE STATISTICS
Describing Data with Tables and Graphs
Describing Data with Averages
Describing Variability
Normal Distributions and Standard (z) Scores
Describing Relationships: Correlation
Regression
08/11/25
3
4. Descriptive Statistics: statistics provides us with tools—tables, graphs,
averages, ranges, correlations—for organizing and summarizing the
inevitable variability in collections of actual observations or scores.
Inferential Statistics: Statistics also provides tools—a variety of tests and
estimates—for generalizing beyond collections of actual observations.
Examples:
(a) Students in my statistics class are, on average, 23 years old.
(b) The population of the world exceeds 7 billion
(c) Either four or eight years have been the most frequent terms of office
actually served by U.S. presidents.
(d) Sixty-four percent of all college students favor right-to-abortion laws
10. OUTLIERS
• Check for accuracy
• Might exclude from summaries
• Might enhance understanding
11. Relative Frequency Distribution
• Relative Frequency
Distributions show the
frequency of each class
as a part or fraction of
the total frequency for
the entire distribution.
• Percentage or
Proportion?
• Improves understanding
• Do not add up to one due
to rounding off!
13. Cumulative Frequency Distribution
•A frequency distribution showing the total
number of observations in each class and
all lower ranked classes.
•Cumulative Percentages are often referred
as percentiles.
•Add to the frequency of each class the
frequency of all the low ranked classes
which gives cumulative frequency
distribution.
• Percentile Rank: Percentage of scores in
the entire distribution with similar or
smaller values than that score.
14. FREQUENCY DISTRIBUTION FOR QUALITATIVE DATA
• Ordered Qualitative Data: The ordering of the data needs to be
preserved in the frequency table.
• Relative and Cumulative Frequency Distribution: Can be done as
done with Numeric data.
15. GRAPHS FOR QUANTITATIVE DATA
Histogram:
A Bar Graph for Quantitative Data
Common boundaries between adjacent
bars emphasize the continuity of data , as
with continuous variable.
X- class intervals ;
Y- Class Frequency;
wiggly lines to show breaks in scale.
16. Frequency Polygon – Line Graph
• A Line graph for quantitative data to
emphasize the continuity of
continuous variables.
• In the histogram, place dots in the
mid of each bar type, and at mid
points on he horizontal axis, in the
absence of bar tops. Connect all the
dots to get a line graph.
• Extend the lower and upper tails to
the mid of the previous and next
classes respectively.
17. Stem and Leaf Displays
• To sort Quantitative data on
the basis of leading and
trailing digits.
• Draw a vertical line to separate
the stem (Multiples of 10)
from the leaf (Multiples of 1).
• Selection of stems.(Thousands,
Hundreds, One Tenths….)
28. MEASURES OF VARIABILITY
• Measures the amount the values are dispersed or scattered in
the distribution
• Range, IQR, Variance and Standard Deviation
30. RANGE & VARIANCE
• Difference between Maximum and Minimum
• The size of the range vary with the size of the group
• Deviation of the mean: Distance between the value and the
mean.
• Deviation above means have +ve values while deviation
below the means have negative values.
• The sum of all these deviations nullify each other.
Variance: Sum of all squared deviations.
31. NOT VARIANCE BUT STANDARD DEVIATION
• Variance gives squared
dimension which is not
interpretable.
• Standard Deviation: Square root of the
sum of all squared deviations
• It is the rough measure of the average
amount by which values deviate on
either side of the mean.
32. STANDARD DEVIATION
• For most frequency distributions , majority(68%) of the values are
within one standard deviation on either side of the mean.
• For most frequency distributions , minority(5%) of the values are
within one standard deviation on either side of the mean
SOLVE:
You grow 20 crystals from a solution and measure the length of each
crystal in millimeters.
9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4
Calculate the range, mean, sample standard deviation of the length of the
crystals.
39. Properties of the Normal Curve
Obtained from a mathematical equation, the normal curve is a
theoretical curve defined for a continuous variable and noted for its
symmetrical bell-shaped form.
■ The normal curve is symmetrical; lower half is the mirror image of
upper half.
■ Being bell shaped, the normal curve peaks above a point midway
along the horizontal spread and then tapers off gradually in either
direction from the peak (without actually touching the horizontal
axis, the tails of a normal curve extend infinitely far).
■ The values of the mean, median and mode, located at a point
midway along the horizontal spread, are the same.
41. Z - SCORE
A z score is a unit-free, standardized score that, regardless of the original units
of measurement, indicates how many standard deviations a score is above or
below the mean of its distribution
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or
below the mean; and
2. a number indicating the size of its deviation from the mean in
standard deviation units.
(a) Margaret’s IQ of 135, given a mean of 100 and a standard deviation of 15 (135-
100/15=2.33)
(b) a score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100 (470-500/100)=-0.3)
(c) a daily production of 2100 loaves of bread by a bakery, given a mean of 2180
and a standard deviation of 50. (2100-2180/50=-1.60)
42. STANDARD NORMAL CURVE
If the original distribution approximates a normal curve, then the shift to
standard or z scores will always produce a new distribution that
approximates the standard normal curve.
Standard Normal Curve: The tabled normal curve for z scores, with a
mean of 0 and a standard deviation of 1.
Although there is an infinite number of different normal curves, each
with its own mean and standard deviation, there is only one standard
normal curve, with a mean of 0 and a standard deviation of 1.
45. FINDING PROPOTIONS
Using Table A in Appendix C, find the proportion of the total area identified with
the following statements:
(a)above a z score of 1.80 (b) between the mean and a z
score of –0.43
0.0359
0.1664
46. Assume that GRE scores
approximate a normal curve with a
mean of 500 and a standard
deviation of 100.
(a) Sketch a normal curve and
shade in the target area described
by each of the following statements:
(i) less than 400
(ii) more than 650
(iii) less than 700
47. EXERCISE
Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100.
(a) Sketch a normal curve and shade in the target area(s) described by each of
the following statements:
(i) more than 570
(ii) less than 515
(iii) between 520 and 540
48. FINDING SCORES
Exam scores for a large psychology class approximate a normal curve with a
mean of 230 and a standard deviation of 50. Furthermore, students are graded
“on a curve,” with only the upper 20 percent being awarded grades of A. What is
the lowest score on the exam that receives an A?
49. EXERCISE
Assume that the annual rainfall in the San Francisco area approximates a
normal curve with a mean of 22 inches and a standard deviation of 4
inches. What are the rainfalls for the more atypical years, defined as the
driest 2.5 percent of all years and the wettest 2.5 percent of all years?