SlideShare a Scribd company logo
31
1. Lesson Structure
1. What is partiality, preference
and prejudice?
2. How to identify partiality,
preference and prejudice?
3. Probability for Statistics
4. The Central Limit Theorem
5. Why is Central Limit Theorem
important?
2. Lesson Plan
Subtopics Method
What is partiality, preference
and prejudice?
Theory
How to identify partiality,
preference and prejudice?
Theory
Probability for Statistics Theory
The Central Limit Theorem Theory
Why is Central Limit Theorem
important?
Theory
Exercises
Practical
&
Theory
Teacher’s Note:
Discussion: What is partiality,
preference and prejudice?
The teacher should encourage the
students to think and come out with
practical problems that can be caused if
the data is biased.
CHAPTER
IDENTIFYING PATTERNS
Studying this chapter should
enable you to understand:
• How to identify partiality,
preference and prejudice
• What is Central Limit
Theorem?
32
Discussion: Probability for Statistics
The teacher should encourage the
students to formulate statistical
investigative questions during the
activity.
3. What is partiality,
preference and
prejudice?
We often come across situations where if
we have a special fondness towards a
particular thing, we tend to be slightly
partial towards it. This, in majority cases
may affect the outcome or you can say it
can deviate the outcome in favor of
certain thing. Naturally, it is not the
right way of dealing with the data on
larger scale.
This partiality, preference and prejudice
towards a set of data is called as a Bias.
In Data Science, bias is a deviation from
the expected outcome in the data.
Fundamentally, you can also call bias as
error in the data. However, it is observed
that this error is indistinct and goes
unnoticed. So, question to be asked is,
why does the bias occur in first place?
Bias basically occurs because of
sampling and estimation. If we would
know everything about all the entities in
our data and would store information on
all probable entities, our data would
never have any bias. However, data
science is often not conducted in
carefully controlled conditions. It is
mostly done of the “found data”, i.e. the
data that is collected for a purpose other
than modelling. That is the reason why
this data is very likely to have biases.
Next question that may arise in your
mind is, why does the bias really matter?
Well, the answer is that predictive
models often consider only the data that
is used for training. In fact, they know
no other reality other than the data that
is fed in their system. Naturally, if the
data that is fed into the system is biased,
model accuracy and fidelity are
compromised. Biased models can also
tend to discriminate against certain
groups of people. Therefore, it is very
important to eliminate the bias to avoid
these risks.
4. How to identify the
partiality,
preference and
prejudice?
We can categorize common statistical
and cognitive bias in following ways:
1. Selection Bias
2. Linearity Bias
3. Confirmation Bias
4. Recall Bias
5. Survivor Bias
Selection Bias
This type of bias usually occurs when a
model itself influences the creation of
data that is used to train it. Selection
bias is said to occur when the sample
data that is gathered is not
representative of the true future
population of cases that the model will
33
see. This bias occurs mostly in systems
that rank the content like
recommendation systems, polls or
personalized advertisements. This is
because, user responses for the content
that is displayed is collected and the
user response for the content that is not
displayed is unknown.
Linearity Bias
Linearity bias assumes that change in
one quantity produces an equal and
proportional change in another. Unlike
selection bias, linearity bias is a
cognitive bias. This is produced not
through some statistical process but
rather through how mistakenly we
perceive the world around us.
Confirmation Bias
Confirmation Bias or Observer Bias is an
outcome of seeing what you want to see
in the data. This can occur when
researchers go into a project with some
subjective thoughts about their study,
which is either conscious or
unconscious. We can also encounter
this when labelers allow their subjective
thoughts to control their labeling habits,
which results in inaccurate data.
Recall Bias
Recall Bias is a type of measurement
bias. It is common at the data labeling
stage of any project. This type of bias
occurs when you label similar type of
data inconsistently. Thus, resulting in
lower accuracy. For example, let us say
we have a team labeling images of
damaged laptops. The damaged laptops
are tagged across labels as damaged,
partially damaged, and undamaged.
Now, if someone in the team labels an
image as damaged and some similar
image as partially damaged, your data
will obviously be inconsistent.
Survivor Bias
The survivorship bias is based on the
concept that we usually tend to twist the
data sets by focusing on successful
examples and ignoring the failures. This
type of bias also occurs when we are
looking at the competitors. For example,
while starting a business we usually
take the examples of businesses in a
similar sector that have performed well
and often ignore the businesses which
have incurred heavy losses, gone
bankrupt, merged etc.
While this is arguable point that we don’t
want to copy the failure, we can still
learn a lot by understanding a range of
customer experiences. The only way to
avoid survivor bias in our systems is by
finding as many inputs as possible and
study the failures as well as average
performers.
5. Probability for
Statistics
Probability is all about counting
randomness. It is the basics of how we
make predictions in statistics. We can
use probability to predict how likely or
unlikely particular events may be. We
can also, if needed, consider informal
predictions beyond the scope of the data
which we have analyzed.
34
Probability is a very essential tool in
statistics. There are two problems and
nature of their solution that will
illustrate the difference.
Problem 1: Assume a coin is “fair”
Question: If a coin is tossed 10 times,
how many times will we get “tail” on the
top face.
Problem 2: You pick up a coin
Question:Is this a fair coin? That is, does
each face have an equal chance of
appearing?
Problem 1 is a mathematical probability
problem. Problem 2 is a statistics
problem that can use the mathematical
probability model determined in
Problem 1 as a tool to seek a solution.
The answer to neither question is
deterministic. Tossing coin produces
random outcomes, which suggests that
the answer is probabilistic. The solution
to Problem 1 starts with the assumption
that the coin is fair. It later proceeds to
logically deduce the numerical
probabilities for each possible count of
“tails” after a toss resulting from 10
tosses. The possible counts are
0,1….,10.
The solution to Problem 2 starts with an
unfamiliar toss; we do not know if it is
fair or biased. The search for an answer
is experimental: toss the coin, see what
happens, and examine the resulting
data to see whether they look as if they
came from a fair toss or a biased toss.
One possible approach to making this
judgement would be the following: Toss
the coin 10 times and record the number
of count when you got a “tail”. Repeat
this process of tossing the coin 100
times. Compile the number of times you
got a “tail” in each of these 100 trials.
Compare these results to the
frequencies produced by the
mathematical model for a fair toss in
Problem 1. If the frequencies from the
experiment are quite dissimilar from
those predicted by the mathematical
model for a fair toss and the observed
frequencies are not likely to be due to
chance variability, then we can conclude
that the toss is not fair.
In Problem 1, we form our answer from
logical deductions. In Problem 2, we
form our answer by observing
experimental results.
6. The Central Limit
Theorem
The Central Limit Theorem states that
distribution of sample approaches a
normal distribution as the sample size
gets larger irrespective of what is the
shape of the population distribution.
The Central Limit Theorem is a
statistical theory stating that given a
significantly large sample size from a
population with finite variance, the
mean of all samples from same set of
population will be roughly equal to the
mean of the population. This holds true
regardless of whether the source
population is normal or skewed provided
that the sample size is significantly
large.
35
Few points to note about the Central
Limit Theorem are:
✓ The Central Limit Theorem states
that the distribution of sample
means nears a normal distribution
as the sample size gets bigger.
✓ Sample sizes that are equal to or
greater than 30 are considered
enough for the Central Limit
Theorem to hold.
✓ Key aspect of the Central Limit
Theorem is that the average of
sample mean, and the standard
deviation will always equal the
population mean and the standard
deviation.
✓ A significantly large sample size can
predict the characteristics of a
population very accurately.
Let us now understand the Central Limit
Theorem with the help of an example.
Consider that there are 50 houses in
your area. And each house has 5 people.
Our task is to calculate average weight
of people in your area.
The usual approach that majority follow
is:
1. Measure the weights of all people in
your area
2. Add all the weights
3. Divide the total sum of weights with
the total number of students to
calculate the average
However, the question over here is,
what if the size of data is enormous?
Does this way of calculating the
average make sense? Of course, the
answer is no. Measuring weight of all
the people will be a very tiring and
lengthy process.
As a workaround, we have an
alternative approach that we can take.
1. To start with, draw groups of people
at random from your area. We will
call this a sample. We will draw
multiple samples in this case, each
consisting of 30 people
2. Calculate the individual mean of
each sample set
3. Calculate the mean of these sample
means
4. To add up to this, a histogram of
sample mean weights of people will
resemble a normal distribution.
This is what the Central Limit Theorem
is all about. Now let us move ahead and
understand what the formula for the
central limit theorem is.
And,
Where,
μ = Population mean
σ = Population standard deviation
μx¯¯¯ = Sample mean
σx¯¯¯ = Sample standard deviation
n = Sample size
Now, that we have understood what the
central limit theorem is, let us now see
36
what its real-life applications are and
what is the formula to calculate it, let us
learn why the central limit theorem is so
important.
Let us have a look at below example of
the Central Limit Theorem Formula:
Example:
In India, the recorded weights of the
male population are following a normal
distribution. The mean and the standard
deviations are 68 kgs and 10 kgs,
respectively. If a person is eager to find
the record of 50 males in the population,
then what would mean and the standard
deviation of the chosen sample?
Over here,
Mean of Population – 68 kgs
Population Standard Deviation (σ) – 10
kgs
Sample size (n) – 50
Solution:
Mean of Sample is the same as the mean
of population.
The mean of the population is 68 since
the sample size > 30.
Sample Standard Deviation is calculated
using below formula:
σx= σ/√n
Thus, Sample Standard Deviation =
10/√50
Sample Standard Deviation is 1.41.
7. Why is the Central
Limit Theorem
important?
The Central Limit Theorem states that
no matter what the distribution of
population is, the shape of the sampling
distribution will always approach
normality as the sample size increases.
This is helpful, as any research never
knows which mean in the sampling
distribution is the same as population
mean, however, by selecting many
random samples from population, the
sample means will cluster together,
allowing the researcher to make a good
estimate of the population mean.
Having said that, as the sample size
increases, the error will always decrease.
Some practical implementations of the
Central Limit Theorem include:
1. Voting polls estimate the count of
people who support a particular
election candidate. The results of
news channels that come with
confidence intervals are all
calculated using the Central Limit
Theorem.
Activity 3.1
Read about how population is measured
for your city and how the Central Limit
Theorem can help in counting the large
group of population.
37
2. The Central Limit Theorem can also
be used to calculate the mean
family income for a specific region.
Recap
• In Data Science, bias is a deviation from the expected outcome in the
data.
• Selection bias is said to occur when the sample data that is gathered is
not the representative of the true future population of cases that the
model will see.
• Linearity bias assumes that change in one quantity produces an equal
and proportional change in another.
• Confirmation Bias or Observer Bias is an outcome of seeing what you
want to see in the data.
• Recall bias occurs when you label similar type of data inconsistently.
• The survivorship bias is based on the concept that we usually tend to
twist the data sets by focusing on successful examples and ignoring the
failures.
• The Central Limit Theorem states that distribution of sample
approaches a normal distribution as the sample size gets larger
irrespective of what is the shape of the population distribution.
38
Exercises
Objective Type Questions
Please choose the correct option in the questions below.
1. What is the Data Science term used to describe partiality, preference, and
prejudice?
a) Bias
b) Favoritism
c) Influence
d) Unfairness
Answer: a
2. Which of the following is NOT a type of bias?
a) Selection Bias
b) Linearity Bias
c) Recall Bias
d) Trial Bias
Answer: d
3. Which of the following is not a correct statement about a probability
a) It must have a value between 0 and 1
b) It can be reported as a decimal or a fraction
c) A value near 0 means that the event is not likely to occur/happen
d) It is the collection of several experiments
Answer: d
4. The central limit theorem states that sampling distribution of the sample mean
is approximately normal if
a) All possible samples are selected
b) The sample size is large
c) The standard error of the sampling distribution is small
Answer: b
5. The central limit theorem says that the mean of the sampling distribution of the
sample mean is
39
a) Equal to the population mean divided by the square root of the sample size
b) Close to the population mean if the sample size is large
c) Exactly equal to the population mean
Answer: c
6. Sample of size 25 are selected from a population with mean 40 and standard
deviation 7.5. The mean of the sampling distribution sample mean is
a) 7.5
b) 8
c) 40
Answer: c
Standard Questions
1. Explain what is Bias and why it occurs in data science?
2. Explain Selection Bias with the help of an example
3. Explain Recall Bias with the help of an example
4. Explain Linearity Bias with the help of an example
5. Explain Confirmation Bias with the help of an example
6. What is the central limit theorem?
7. What is the formula for central limit theorem?
8. What is real life application of central limit theorem?
9. Why central limit theorem is important?
10.The coaches of various sports around the world use probability to better their
game and create gaming strategies. Can you explain how probability is applied in
this case and how does it help players?
Higher Order Thinking Skills
1. As per reports, in October 2019, researchers found that an algorithm used on
more than 200 million people in US hospitals to predict which patients who would
likely need extra medical care heavily favored white patients over black patients.
Can you reason about what must have caused this bias and categorize it into the
types of bias that you learnt in this chapter?
Teacher’s Notes: The teacher should revise the concept and types of bias that we
learnt in this chapter and encourage critical thinking between students to categorize
this problem in the right type.

More Related Content

DOCX
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
DOCX
BUS 308 Week 2 Lecture 1 Examining Differences - overview .docx
DOCX
BUS 308 Week 2 Lecture 1 Examining Differences - overview .docx
PDF
Data interpretation in description analysis
PPTX
CO 3. Hypothesis Testing which is basicl
PDF
Real Estate Data Set
DOCX
Case Study Hereditary AngioedemaAll responses must be in your .docx
DOCX
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS 308 Week 2 Lecture 1 Examining Differences - overview .docx
BUS 308 Week 2 Lecture 1 Examining Differences - overview .docx
Data interpretation in description analysis
CO 3. Hypothesis Testing which is basicl
Real Estate Data Set
Case Study Hereditary AngioedemaAll responses must be in your .docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx

Similar to DS-38data sciencehandbooknotescompiled-46.pdf (20)

PDF
Jsm big-data
PDF
Looking inside the Cheese - Engaging people in workplace health and safety
PPT
Anastasi Lecture 2008
PPT
The Role of Agent-Based Modelling in Extending the Concept of Bounded Rationa...
PPTX
Statistical Approaches to Missing Data
PDF
the Introduction to statistical analysis
DOCX
The Importance of Probability in Data Science.docx
PPTX
BASIC MATH PROBLEMS IN STATISCTICSS.pptx
PDF
Module 1 introduction to machine learning
PDF
Principles of Health Informatics: Informatics skills - searching and making d...
PPT
Bergman Psych- ch 01
PDF
009906275.pdf
PDF
Machine Learning Interview Questions Answers
PDF
41 essential machine learning interview questions!
PPT
Stat11t Chapter1
PPT
Stat11t chapter1
PDF
Module 4: Model Selection and Evaluation
PPTX
Non probability sampling
PPTX
anchoring-heuristic Decision Making
PPT
Chapter 1 Ap Psych- Research Methods
Jsm big-data
Looking inside the Cheese - Engaging people in workplace health and safety
Anastasi Lecture 2008
The Role of Agent-Based Modelling in Extending the Concept of Bounded Rationa...
Statistical Approaches to Missing Data
the Introduction to statistical analysis
The Importance of Probability in Data Science.docx
BASIC MATH PROBLEMS IN STATISCTICSS.pptx
Module 1 introduction to machine learning
Principles of Health Informatics: Informatics skills - searching and making d...
Bergman Psych- ch 01
009906275.pdf
Machine Learning Interview Questions Answers
41 essential machine learning interview questions!
Stat11t Chapter1
Stat11t chapter1
Module 4: Model Selection and Evaluation
Non probability sampling
anchoring-heuristic Decision Making
Chapter 1 Ap Psych- Research Methods
Ad

Recently uploaded (20)

PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Basic Mud Logging Guide for educational purpose
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Classroom Observation Tools for Teachers
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Lesson notes of climatology university.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPH.pptx obstetrics and gynecology in nursing
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
human mycosis Human fungal infections are called human mycosis..pptx
RMMM.pdf make it easy to upload and study
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
GDM (1) (1).pptx small presentation for students
Basic Mud Logging Guide for educational purpose
STATICS OF THE RIGID BODIES Hibbelers.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Supply Chain Operations Speaking Notes -ICLT Program
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Classroom Observation Tools for Teachers
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Computing-Curriculum for Schools in Ghana
Lesson notes of climatology university.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Sports Quiz easy sports quiz sports quiz
Microbial diseases, their pathogenesis and prophylaxis
Ad

DS-38data sciencehandbooknotescompiled-46.pdf

  • 1. 31 1. Lesson Structure 1. What is partiality, preference and prejudice? 2. How to identify partiality, preference and prejudice? 3. Probability for Statistics 4. The Central Limit Theorem 5. Why is Central Limit Theorem important? 2. Lesson Plan Subtopics Method What is partiality, preference and prejudice? Theory How to identify partiality, preference and prejudice? Theory Probability for Statistics Theory The Central Limit Theorem Theory Why is Central Limit Theorem important? Theory Exercises Practical & Theory Teacher’s Note: Discussion: What is partiality, preference and prejudice? The teacher should encourage the students to think and come out with practical problems that can be caused if the data is biased. CHAPTER IDENTIFYING PATTERNS Studying this chapter should enable you to understand: • How to identify partiality, preference and prejudice • What is Central Limit Theorem?
  • 2. 32 Discussion: Probability for Statistics The teacher should encourage the students to formulate statistical investigative questions during the activity. 3. What is partiality, preference and prejudice? We often come across situations where if we have a special fondness towards a particular thing, we tend to be slightly partial towards it. This, in majority cases may affect the outcome or you can say it can deviate the outcome in favor of certain thing. Naturally, it is not the right way of dealing with the data on larger scale. This partiality, preference and prejudice towards a set of data is called as a Bias. In Data Science, bias is a deviation from the expected outcome in the data. Fundamentally, you can also call bias as error in the data. However, it is observed that this error is indistinct and goes unnoticed. So, question to be asked is, why does the bias occur in first place? Bias basically occurs because of sampling and estimation. If we would know everything about all the entities in our data and would store information on all probable entities, our data would never have any bias. However, data science is often not conducted in carefully controlled conditions. It is mostly done of the “found data”, i.e. the data that is collected for a purpose other than modelling. That is the reason why this data is very likely to have biases. Next question that may arise in your mind is, why does the bias really matter? Well, the answer is that predictive models often consider only the data that is used for training. In fact, they know no other reality other than the data that is fed in their system. Naturally, if the data that is fed into the system is biased, model accuracy and fidelity are compromised. Biased models can also tend to discriminate against certain groups of people. Therefore, it is very important to eliminate the bias to avoid these risks. 4. How to identify the partiality, preference and prejudice? We can categorize common statistical and cognitive bias in following ways: 1. Selection Bias 2. Linearity Bias 3. Confirmation Bias 4. Recall Bias 5. Survivor Bias Selection Bias This type of bias usually occurs when a model itself influences the creation of data that is used to train it. Selection bias is said to occur when the sample data that is gathered is not representative of the true future population of cases that the model will
  • 3. 33 see. This bias occurs mostly in systems that rank the content like recommendation systems, polls or personalized advertisements. This is because, user responses for the content that is displayed is collected and the user response for the content that is not displayed is unknown. Linearity Bias Linearity bias assumes that change in one quantity produces an equal and proportional change in another. Unlike selection bias, linearity bias is a cognitive bias. This is produced not through some statistical process but rather through how mistakenly we perceive the world around us. Confirmation Bias Confirmation Bias or Observer Bias is an outcome of seeing what you want to see in the data. This can occur when researchers go into a project with some subjective thoughts about their study, which is either conscious or unconscious. We can also encounter this when labelers allow their subjective thoughts to control their labeling habits, which results in inaccurate data. Recall Bias Recall Bias is a type of measurement bias. It is common at the data labeling stage of any project. This type of bias occurs when you label similar type of data inconsistently. Thus, resulting in lower accuracy. For example, let us say we have a team labeling images of damaged laptops. The damaged laptops are tagged across labels as damaged, partially damaged, and undamaged. Now, if someone in the team labels an image as damaged and some similar image as partially damaged, your data will obviously be inconsistent. Survivor Bias The survivorship bias is based on the concept that we usually tend to twist the data sets by focusing on successful examples and ignoring the failures. This type of bias also occurs when we are looking at the competitors. For example, while starting a business we usually take the examples of businesses in a similar sector that have performed well and often ignore the businesses which have incurred heavy losses, gone bankrupt, merged etc. While this is arguable point that we don’t want to copy the failure, we can still learn a lot by understanding a range of customer experiences. The only way to avoid survivor bias in our systems is by finding as many inputs as possible and study the failures as well as average performers. 5. Probability for Statistics Probability is all about counting randomness. It is the basics of how we make predictions in statistics. We can use probability to predict how likely or unlikely particular events may be. We can also, if needed, consider informal predictions beyond the scope of the data which we have analyzed.
  • 4. 34 Probability is a very essential tool in statistics. There are two problems and nature of their solution that will illustrate the difference. Problem 1: Assume a coin is “fair” Question: If a coin is tossed 10 times, how many times will we get “tail” on the top face. Problem 2: You pick up a coin Question:Is this a fair coin? That is, does each face have an equal chance of appearing? Problem 1 is a mathematical probability problem. Problem 2 is a statistics problem that can use the mathematical probability model determined in Problem 1 as a tool to seek a solution. The answer to neither question is deterministic. Tossing coin produces random outcomes, which suggests that the answer is probabilistic. The solution to Problem 1 starts with the assumption that the coin is fair. It later proceeds to logically deduce the numerical probabilities for each possible count of “tails” after a toss resulting from 10 tosses. The possible counts are 0,1….,10. The solution to Problem 2 starts with an unfamiliar toss; we do not know if it is fair or biased. The search for an answer is experimental: toss the coin, see what happens, and examine the resulting data to see whether they look as if they came from a fair toss or a biased toss. One possible approach to making this judgement would be the following: Toss the coin 10 times and record the number of count when you got a “tail”. Repeat this process of tossing the coin 100 times. Compile the number of times you got a “tail” in each of these 100 trials. Compare these results to the frequencies produced by the mathematical model for a fair toss in Problem 1. If the frequencies from the experiment are quite dissimilar from those predicted by the mathematical model for a fair toss and the observed frequencies are not likely to be due to chance variability, then we can conclude that the toss is not fair. In Problem 1, we form our answer from logical deductions. In Problem 2, we form our answer by observing experimental results. 6. The Central Limit Theorem The Central Limit Theorem states that distribution of sample approaches a normal distribution as the sample size gets larger irrespective of what is the shape of the population distribution. The Central Limit Theorem is a statistical theory stating that given a significantly large sample size from a population with finite variance, the mean of all samples from same set of population will be roughly equal to the mean of the population. This holds true regardless of whether the source population is normal or skewed provided that the sample size is significantly large.
  • 5. 35 Few points to note about the Central Limit Theorem are: ✓ The Central Limit Theorem states that the distribution of sample means nears a normal distribution as the sample size gets bigger. ✓ Sample sizes that are equal to or greater than 30 are considered enough for the Central Limit Theorem to hold. ✓ Key aspect of the Central Limit Theorem is that the average of sample mean, and the standard deviation will always equal the population mean and the standard deviation. ✓ A significantly large sample size can predict the characteristics of a population very accurately. Let us now understand the Central Limit Theorem with the help of an example. Consider that there are 50 houses in your area. And each house has 5 people. Our task is to calculate average weight of people in your area. The usual approach that majority follow is: 1. Measure the weights of all people in your area 2. Add all the weights 3. Divide the total sum of weights with the total number of students to calculate the average However, the question over here is, what if the size of data is enormous? Does this way of calculating the average make sense? Of course, the answer is no. Measuring weight of all the people will be a very tiring and lengthy process. As a workaround, we have an alternative approach that we can take. 1. To start with, draw groups of people at random from your area. We will call this a sample. We will draw multiple samples in this case, each consisting of 30 people 2. Calculate the individual mean of each sample set 3. Calculate the mean of these sample means 4. To add up to this, a histogram of sample mean weights of people will resemble a normal distribution. This is what the Central Limit Theorem is all about. Now let us move ahead and understand what the formula for the central limit theorem is. And, Where, μ = Population mean σ = Population standard deviation μx¯¯¯ = Sample mean σx¯¯¯ = Sample standard deviation n = Sample size Now, that we have understood what the central limit theorem is, let us now see
  • 6. 36 what its real-life applications are and what is the formula to calculate it, let us learn why the central limit theorem is so important. Let us have a look at below example of the Central Limit Theorem Formula: Example: In India, the recorded weights of the male population are following a normal distribution. The mean and the standard deviations are 68 kgs and 10 kgs, respectively. If a person is eager to find the record of 50 males in the population, then what would mean and the standard deviation of the chosen sample? Over here, Mean of Population – 68 kgs Population Standard Deviation (σ) – 10 kgs Sample size (n) – 50 Solution: Mean of Sample is the same as the mean of population. The mean of the population is 68 since the sample size > 30. Sample Standard Deviation is calculated using below formula: σx= σ/√n Thus, Sample Standard Deviation = 10/√50 Sample Standard Deviation is 1.41. 7. Why is the Central Limit Theorem important? The Central Limit Theorem states that no matter what the distribution of population is, the shape of the sampling distribution will always approach normality as the sample size increases. This is helpful, as any research never knows which mean in the sampling distribution is the same as population mean, however, by selecting many random samples from population, the sample means will cluster together, allowing the researcher to make a good estimate of the population mean. Having said that, as the sample size increases, the error will always decrease. Some practical implementations of the Central Limit Theorem include: 1. Voting polls estimate the count of people who support a particular election candidate. The results of news channels that come with confidence intervals are all calculated using the Central Limit Theorem. Activity 3.1 Read about how population is measured for your city and how the Central Limit Theorem can help in counting the large group of population.
  • 7. 37 2. The Central Limit Theorem can also be used to calculate the mean family income for a specific region. Recap • In Data Science, bias is a deviation from the expected outcome in the data. • Selection bias is said to occur when the sample data that is gathered is not the representative of the true future population of cases that the model will see. • Linearity bias assumes that change in one quantity produces an equal and proportional change in another. • Confirmation Bias or Observer Bias is an outcome of seeing what you want to see in the data. • Recall bias occurs when you label similar type of data inconsistently. • The survivorship bias is based on the concept that we usually tend to twist the data sets by focusing on successful examples and ignoring the failures. • The Central Limit Theorem states that distribution of sample approaches a normal distribution as the sample size gets larger irrespective of what is the shape of the population distribution.
  • 8. 38 Exercises Objective Type Questions Please choose the correct option in the questions below. 1. What is the Data Science term used to describe partiality, preference, and prejudice? a) Bias b) Favoritism c) Influence d) Unfairness Answer: a 2. Which of the following is NOT a type of bias? a) Selection Bias b) Linearity Bias c) Recall Bias d) Trial Bias Answer: d 3. Which of the following is not a correct statement about a probability a) It must have a value between 0 and 1 b) It can be reported as a decimal or a fraction c) A value near 0 means that the event is not likely to occur/happen d) It is the collection of several experiments Answer: d 4. The central limit theorem states that sampling distribution of the sample mean is approximately normal if a) All possible samples are selected b) The sample size is large c) The standard error of the sampling distribution is small Answer: b 5. The central limit theorem says that the mean of the sampling distribution of the sample mean is
  • 9. 39 a) Equal to the population mean divided by the square root of the sample size b) Close to the population mean if the sample size is large c) Exactly equal to the population mean Answer: c 6. Sample of size 25 are selected from a population with mean 40 and standard deviation 7.5. The mean of the sampling distribution sample mean is a) 7.5 b) 8 c) 40 Answer: c Standard Questions 1. Explain what is Bias and why it occurs in data science? 2. Explain Selection Bias with the help of an example 3. Explain Recall Bias with the help of an example 4. Explain Linearity Bias with the help of an example 5. Explain Confirmation Bias with the help of an example 6. What is the central limit theorem? 7. What is the formula for central limit theorem? 8. What is real life application of central limit theorem? 9. Why central limit theorem is important? 10.The coaches of various sports around the world use probability to better their game and create gaming strategies. Can you explain how probability is applied in this case and how does it help players? Higher Order Thinking Skills 1. As per reports, in October 2019, researchers found that an algorithm used on more than 200 million people in US hospitals to predict which patients who would likely need extra medical care heavily favored white patients over black patients. Can you reason about what must have caused this bias and categorize it into the types of bias that you learnt in this chapter? Teacher’s Notes: The teacher should revise the concept and types of bias that we learnt in this chapter and encourage critical thinking between students to categorize this problem in the right type.