DS-38data sciencehandbooknotescompiled-46.pdf

31
1. Lesson Structure
1. What is partiality, preference
and prejudice?
2. How to identify partiality,
preference and prejudice?
3. Probability for Statistics
4. The Central Limit Theorem
5. Why is Central Limit Theorem
important?
2. Lesson Plan
Subtopics Method
What is partiality, preference
and prejudice?
Theory
How to identify partiality,
Theory
Probability for Statistics Theory
The Central Limit Theorem Theory
Why is Central Limit Theorem
important?
Theory
Exercises
Practical
&
Theory
Teacher’s Note:
Discussion: What is partiality,
The teacher should encourage the
students to think and come out with
practical problems that can be caused if
the data is biased.
CHAPTER
IDENTIFYING PATTERNS
Studying this chapter should
enable you to understand:
• How to identify partiality,
preference and prejudice
• What is Central Limit
Theorem?

32
Discussion: Probability for Statistics
The teacher should encourage the
students to formulate statistical
investigative questions during the
activity.
3. What is partiality,
preference and
prejudice?
We often come across situations where if
we have a special fondness towards a
particular thing, we tend to be slightly
partial towards it. This, in majority cases
may affect the outcome or you can say it
can deviate the outcome in favor of
certain thing. Naturally, it is not the
right way of dealing with the data on
larger scale.
This partiality, preference and prejudice
towards a set of data is called as a Bias.
In Data Science, bias is a deviation from
the expected outcome in the data.
Fundamentally, you can also call bias as
error in the data. However, it is observed
that this error is indistinct and goes
unnoticed. So, question to be asked is,
why does the bias occur in first place?
Bias basically occurs because of
sampling and estimation. If we would
know everything about all the entities in
our data and would store information on
all probable entities, our data would
never have any bias. However, data
science is often not conducted in
carefully controlled conditions. It is
mostly done of the “found data”, i.e. the
data that is collected for a purpose other
than modelling. That is the reason why
this data is very likely to have biases.
Next question that may arise in your
mind is, why does the bias really matter?
Well, the answer is that predictive
models often consider only the data that
is used for training. In fact, they know
no other reality other than the data that
is fed in their system. Naturally, if the
data that is fed into the system is biased,
model accuracy and fidelity are
compromised. Biased models can also
tend to discriminate against certain
groups of people. Therefore, it is very
important to eliminate the bias to avoid
these risks.
4. How to identify the
partiality,
preference and
prejudice?
We can categorize common statistical
and cognitive bias in following ways:
1. Selection Bias
2. Linearity Bias
3. Confirmation Bias
4. Recall Bias
5. Survivor Bias
Selection Bias
This type of bias usually occurs when a
model itself influences the creation of
data that is used to train it. Selection
bias is said to occur when the sample
data that is gathered is not
representative of the true future
population of cases that the model will

33
see. This bias occurs mostly in systems
that rank the content like
recommendation systems, polls or
personalized advertisements. This is
because, user responses for the content
that is displayed is collected and the
user response for the content that is not
displayed is unknown.
Linearity Bias
Linearity bias assumes that change in
one quantity produces an equal and
proportional change in another. Unlike
selection bias, linearity bias is a
cognitive bias. This is produced not
through some statistical process but
rather through how mistakenly we
perceive the world around us.
Confirmation Bias
Confirmation Bias or Observer Bias is an
outcome of seeing what you want to see
in the data. This can occur when
researchers go into a project with some
subjective thoughts about their study,
which is either conscious or
unconscious. We can also encounter
this when labelers allow their subjective
thoughts to control their labeling habits,
which results in inaccurate data.
Recall Bias
Recall Bias is a type of measurement
bias. It is common at the data labeling
stage of any project. This type of bias
occurs when you label similar type of
data inconsistently. Thus, resulting in
lower accuracy. For example, let us say
we have a team labeling images of
damaged laptops. The damaged laptops
are tagged across labels as damaged,
partially damaged, and undamaged.
Now, if someone in the team labels an
image as damaged and some similar
image as partially damaged, your data
will obviously be inconsistent.
Survivor Bias
The survivorship bias is based on the
concept that we usually tend to twist the
data sets by focusing on successful
examples and ignoring the failures. This
type of bias also occurs when we are
looking at the competitors. For example,
while starting a business we usually
take the examples of businesses in a
similar sector that have performed well
and often ignore the businesses which
have incurred heavy losses, gone
bankrupt, merged etc.
While this is arguable point that we don’t
want to copy the failure, we can still
learn a lot by understanding a range of
customer experiences. The only way to
avoid survivor bias in our systems is by
finding as many inputs as possible and
study the failures as well as average
performers.
5. Probability for
Statistics
Probability is all about counting
randomness. It is the basics of how we
make predictions in statistics. We can
use probability to predict how likely or
unlikely particular events may be. We
can also, if needed, consider informal
predictions beyond the scope of the data
which we have analyzed.

34
Probability is a very essential tool in
statistics. There are two problems and
nature of their solution that will
illustrate the difference.
Problem 1: Assume a coin is “fair”
Question: If a coin is tossed 10 times,
how many times will we get “tail” on the
top face.
Problem 2: You pick up a coin
Question:Is this a fair coin? That is, does
each face have an equal chance of
appearing?
Problem 1 is a mathematical probability
problem. Problem 2 is a statistics
problem that can use the mathematical
probability model determined in
Problem 1 as a tool to seek a solution.
The answer to neither question is
deterministic. Tossing coin produces
random outcomes, which suggests that
the answer is probabilistic. The solution
to Problem 1 starts with the assumption
that the coin is fair. It later proceeds to
logically deduce the numerical
probabilities for each possible count of
“tails” after a toss resulting from 10
tosses. The possible counts are
0,1….,10.
The solution to Problem 2 starts with an
unfamiliar toss; we do not know if it is
fair or biased. The search for an answer
is experimental: toss the coin, see what
happens, and examine the resulting
data to see whether they look as if they
came from a fair toss or a biased toss.
One possible approach to making this
judgement would be the following: Toss
the coin 10 times and record the number
of count when you got a “tail”. Repeat
this process of tossing the coin 100
times. Compile the number of times you
got a “tail” in each of these 100 trials.
Compare these results to the
frequencies produced by the
mathematical model for a fair toss in
Problem 1. If the frequencies from the
experiment are quite dissimilar from
those predicted by the mathematical
model for a fair toss and the observed
frequencies are not likely to be due to
chance variability, then we can conclude
that the toss is not fair.
In Problem 1, we form our answer from
logical deductions. In Problem 2, we
form our answer by observing
experimental results.
6. The Central Limit
Theorem
The Central Limit Theorem states that
distribution of sample approaches a
normal distribution as the sample size
gets larger irrespective of what is the
shape of the population distribution.
The Central Limit Theorem is a
statistical theory stating that given a
significantly large sample size from a
population with finite variance, the
mean of all samples from same set of
population will be roughly equal to the
mean of the population. This holds true
regardless of whether the source
population is normal or skewed provided
that the sample size is significantly
large.

35
Few points to note about the Central
Limit Theorem are:
✓ The Central Limit Theorem states
that the distribution of sample
means nears a normal distribution
as the sample size gets bigger.
✓ Sample sizes that are equal to or
greater than 30 are considered
enough for the Central Limit
Theorem to hold.
✓ Key aspect of the Central Limit
Theorem is that the average of
sample mean, and the standard
deviation will always equal the
population mean and the standard
deviation.
✓ A significantly large sample size can
predict the characteristics of a
population very accurately.
Let us now understand the Central Limit
Theorem with the help of an example.
Consider that there are 50 houses in
your area. And each house has 5 people.
Our task is to calculate average weight
of people in your area.
The usual approach that majority follow
is:
1. Measure the weights of all people in
your area
2. Add all the weights
3. Divide the total sum of weights with
the total number of students to
calculate the average
However, the question over here is,
what if the size of data is enormous?
Does this way of calculating the
average make sense? Of course, the
answer is no. Measuring weight of all
the people will be a very tiring and
lengthy process.
As a workaround, we have an
alternative approach that we can take.
1. To start with, draw groups of people
at random from your area. We will
call this a sample. We will draw
multiple samples in this case, each
consisting of 30 people
2. Calculate the individual mean of
each sample set
3. Calculate the mean of these sample
means
4. To add up to this, a histogram of
sample mean weights of people will
resemble a normal distribution.
This is what the Central Limit Theorem
is all about. Now let us move ahead and
understand what the formula for the
central limit theorem is.
And,
Where,
μ = Population mean
σ = Population standard deviation
μx¯¯¯ = Sample mean
σx¯¯¯ = Sample standard deviation
n = Sample size
Now, that we have understood what the
central limit theorem is, let us now see

36
what its real-life applications are and
what is the formula to calculate it, let us
learn why the central limit theorem is so
important.
Let us have a look at below example of
the Central Limit Theorem Formula:
Example:
In India, the recorded weights of the
male population are following a normal
distribution. The mean and the standard
deviations are 68 kgs and 10 kgs,
respectively. If a person is eager to find
the record of 50 males in the population,
then what would mean and the standard
deviation of the chosen sample?
Over here,
Mean of Population – 68 kgs
Population Standard Deviation (σ) – 10
kgs
Sample size (n) – 50
Solution:
Mean of Sample is the same as the mean
of population.
The mean of the population is 68 since
the sample size > 30.
Sample Standard Deviation is calculated
using below formula:
σx= σ/√n
Thus, Sample Standard Deviation =
10/√50
Sample Standard Deviation is 1.41.
7. Why is the Central
Limit Theorem
important?
The Central Limit Theorem states that
no matter what the distribution of
population is, the shape of the sampling
distribution will always approach
normality as the sample size increases.
This is helpful, as any research never
knows which mean in the sampling
distribution is the same as population
mean, however, by selecting many
random samples from population, the
sample means will cluster together,
allowing the researcher to make a good
estimate of the population mean.
Having said that, as the sample size
increases, the error will always decrease.
Some practical implementations of the
Central Limit Theorem include:
1. Voting polls estimate the count of
people who support a particular
election candidate. The results of
news channels that come with
confidence intervals are all
calculated using the Central Limit
Theorem.
Activity 3.1
Read about how population is measured
for your city and how the Central Limit
Theorem can help in counting the large
group of population.

37
2. The Central Limit Theorem can also
be used to calculate the mean
family income for a specific region.
Recap
• In Data Science, bias is a deviation from the expected outcome in the
data.
• Selection bias is said to occur when the sample data that is gathered is
not the representative of the true future population of cases that the
model will see.
• Linearity bias assumes that change in one quantity produces an equal
and proportional change in another.
• Confirmation Bias or Observer Bias is an outcome of seeing what you
want to see in the data.
• Recall bias occurs when you label similar type of data inconsistently.
• The survivorship bias is based on the concept that we usually tend to
twist the data sets by focusing on successful examples and ignoring the
failures.
• The Central Limit Theorem states that distribution of sample
approaches a normal distribution as the sample size gets larger
irrespective of what is the shape of the population distribution.

38
Exercises
Objective Type Questions
Please choose the correct option in the questions below.
1. What is the Data Science term used to describe partiality, preference, and
prejudice?
a) Bias
b) Favoritism
c) Influence
d) Unfairness
Answer: a
2. Which of the following is NOT a type of bias?
a) Selection Bias
b) Linearity Bias
c) Recall Bias
d) Trial Bias
Answer: d
3. Which of the following is not a correct statement about a probability
a) It must have a value between 0 and 1
b) It can be reported as a decimal or a fraction
c) A value near 0 means that the event is not likely to occur/happen
d) It is the collection of several experiments
Answer: d
4. The central limit theorem states that sampling distribution of the sample mean
is approximately normal if
a) All possible samples are selected
b) The sample size is large
c) The standard error of the sampling distribution is small
Answer: b
5. The central limit theorem says that the mean of the sampling distribution of the
sample mean is

39
a) Equal to the population mean divided by the square root of the sample size
b) Close to the population mean if the sample size is large
c) Exactly equal to the population mean
Answer: c
6. Sample of size 25 are selected from a population with mean 40 and standard
deviation 7.5. The mean of the sampling distribution sample mean is
a) 7.5
b) 8
c) 40
Answer: c
Standard Questions
1. Explain what is Bias and why it occurs in data science?
2. Explain Selection Bias with the help of an example
3. Explain Recall Bias with the help of an example
4. Explain Linearity Bias with the help of an example
5. Explain Confirmation Bias with the help of an example
6. What is the central limit theorem?
7. What is the formula for central limit theorem?
8. What is real life application of central limit theorem?
9. Why central limit theorem is important?
10.The coaches of various sports around the world use probability to better their
game and create gaming strategies. Can you explain how probability is applied in
this case and how does it help players?
Higher Order Thinking Skills
1. As per reports, in October 2019, researchers found that an algorithm used on
more than 200 million people in US hospitals to predict which patients who would
likely need extra medical care heavily favored white patients over black patients.
Can you reason about what must have caused this bias and categorize it into the
types of bias that you learnt in this chapter?
Teacher’s Notes: The teacher should revise the concept and types of bias that we
learnt in this chapter and encourage critical thinking between students to categorize
this problem in the right type.

DS-38data sciencehandbooknotescompiled-46.pdf

More Related Content

Similar to DS-38data sciencehandbooknotescompiled-46.pdf (20)

Recently uploaded (20)

DS-38data sciencehandbooknotescompiled-46.pdf