Chapter2 slides-part 2-harish complete

STT 200
STATISTICAL METHODS
Chapter 2:
Foundation for Inference – Part 2

2.5 The Central Limit Theorem
■ Here are a few of the null distributions we’ve looked at
throughout this chapter.
■ What patterns can you identify regarding their shape?
The null distribution for testing
independence of precipitation vs.
type of day of the week.
The null distribution for testing
whether a new sales pitch is
better than a current one.
2

Government shutdown
The null distribution for testing whether network
political inclination is independent of reported polling
results about the government shutdown.
3

The Central Limit Theorem (CLT) - 1
■ In Chapter 1, we learned distributions are often left- or right-
skewed.
■ However, these null distributions are symmetric and relatively
bell-shaped.
■ Do you think this is a coincidence?
■ TOP HAT
■ This is NOT a coincidence! The shape of these distributions is,
in fact, mathematically guaranteed by the Central Limit
Theorem.
■ It says that under certain conditions, in the limit, certain
sample statistics are approximately normally distributed!
■ Average behavior is normal.
■ Prof. Shlomo Levental
4

The Central Limit Theorem (CLT) - 2
■ If we look at a proportion (or difference in proportions)
and the scenario meets certain conditions,
■ then the sample proportion (or difference in proportions)
will appear to follow a bell-shaped curve called
a normal distribution.
𝑓 𝑥 =
1
𝜎 2𝜋
𝑒
−
1
2
𝑥−𝜇
𝜎
2
Center is Mean: 𝜇
Spread is SD: 𝜎
The equation of
the curve is
CENTER/SPREAD 5

Conditions for the CLT
In order for the CLT to apply, two conditions must be true:
1. Observations in the sample(s) are independent.
■ Independence is often guaranteed in an observational study by
taking a random sample from a population.
■ It can also be guaranteed in the context of a controlled experiment if
we randomly divide individuals into treatment groups.
2. The sample size is sufficiently large.
■ In order for the null distribution to take on the shape of a normal
curve, we must have gathered a sufficiently large sample of data,
regardless of whether it is an observational study or controlled
experiment.
■ Just how large is large enough?
■ That differs from one context to the next, and we’ll provide
guidelines as we encounter them through the rest of the semester.
6

2.6 The Normal Distribution
Here are three different normal curves. What do they share in common?
Normal curves always have the following five characteristics:
1. Unimodal (single peak)
2. Symmetic
3. Bell Shaped (or Mound Shaped)
4. Center is the mean and Spread is the standard deviation
5. Area under any normal curve is probability and Total area/probability is 1
7

Shape of the Normal Curve
■ Despite these common characteristics, normal
distributions can look quite different, as you can see
above.
■ Specifically, the normal distribution can be adjusted
using two parameters, the mean and the standard
deviation.
■ Change Center: Changing the mean of a normal curve
shifts the curve to the left or right.
■ Change Spread: Changing the standard deviation of a
normal curve
stretches or constricts the curve around the mean.
8

Labelling the Normal Curve
■ If a normal curve has mean 𝜇 and standard deviation 𝜎, statisticians
will write the distribution as the 𝑵 𝝁, 𝝈 distribution.
■ The three distributions above can be written (from left to right) as
the 𝑁(0,1), the 𝑁(1,1.5) and 𝑁(−2, 0.7) distributions.
9

Standard Normal Distribution
■ Because the mean and standard
deviation describe a normal
distribution exactly, they are called
the distribution’s parameters.
■ MATRIX Movie: NEO
■ SPECIAL NORMAL DISTRIBUTION:
When a normal curve has
mean 𝜇 = 0 and standard deviation
𝜎 = 1, we label the curve the
Standard normal or N(0,1) or Z curve.
10

Using Calculator to find Probabilities, Areas
and Percentiles on Page 81
To find a probability if a data value is known:
2nd Vars – “normalcdf” – enter “lower limit, upper limit,
mean, sd”
Example: 𝑃(900 ≤ 𝑋 ≤ 1200) where
𝜇 = 1060 𝑎𝑛𝑑 𝜎 = 195
Enter 2nd Vars – normalcdf (900, 1200, 1060, 195) enter.
Answer 0.557644
To find data values when given an area (or percentage):
2nd Vars – “invnorm” – enter (enter area to the left as decimal,
mean, sd)
Example: Find the score or data value corresponding to the 80th
percentile where 𝜇 = 1060 𝑎𝑛𝑑 𝜎 = 195
2nd Vars – “invnorm” – (0.80, 1060, 195) enter
Answer: 1224
Step-by-step instructions with examples at
https://msu.edu/~fairbour/MSU/CalculatorHelps/Normal
CurveCalcInstructions.pdf
Normal tables in olden days!!!
11

Example: SAT scores
■ Cumulative SAT scores are approximated well by a
normal model, 𝑁(1060, 195).
■ Provide a sketch of the approximating normal curve.
12

Applying Z Scores: SAT scores 1
1. Approximately what proportion of test takers score between
900 and 1200 on the SAT?
Given: Data Values
To find: Proportion/Probability/Area
Sketch and label center data values
Press 2nd Press Vars
Choose Normalcdf
Lower: 900
Upper:1200
Mean: 1060
SD: 195
Answer: 0.5576
13

Finding probabilities – known data values
Step 1: Sketch a picture of the area you’re trying to find.
Step 2: Compute the area using a calculator / computer
software.
■ To find a probability if a data value is known:
– 2ND VARS – “normalcdf” – enter “lower limit, upper
limit, mean, sd”
■ Example: 𝑃(900 ≤ 𝑋 ≤ 1200)
– Press 2nd Vars – normalcdf (900, 1200, 1060, 195)
Enter.
– Answer 0.557644
14

2. A randomly-selected SAT test-taker is about to sit for the test.
Nothing is known about her aptitude. What is the probability that
she scores at least 1300 on her SATs?
Given: Data Values
To find: Probability/Area
2nd Vars Normalcdf
Lower: 1300
Upper: 𝟏𝟎 𝟏𝟎
Mean: 1060
SD: 195
Answer: 0.1092
15

c. Another SAT test-taker is taking the SAT for a second time
after earning a 1100 on his first attempt. What was the
percentile of his first score?
DO THE FOLLOWING NOW!!!
■ Sketch the normal curve
■ Label Center
■ Mark the data values
■ Shade the required area
■ Find the probability using GC
■ Think: What is the lower limit?
■ TOP HAT
16

c. Another SAT test-taker is taking the SAT for a second time
after earning a 1100 on his first attempt. What was the
percentile of his first score?
Answer:
■ Normalcdf
𝑳𝒐𝒘𝒆𝒓: −𝟏𝟎 𝟏𝟎
Upper: 1100
Mean: 1060
SD: 195
Answer: 58th Percentile
17

d. What is the SAT score of someone who scores at the 80th
percentile?
Given: Percentile/Probability
To find: Data value
Should we use Normalcdf???
No, USE 2nd Vars invNorm
Must enter left side area!!!
Area: 0.80
Mean: 1060
SD: 195
Answer: 1224
???
18

Finding data values with known area
Step 1: Sketch a picture with the data value you’re trying to
find.
Step 2: Compute the data value using a calculator /
computer software.
■ To find data values when given an area (or percentage):
– 2ND VARS– “invnorm” –(enter area to the left as a
decimal, mean, sd)
■ Example: Find data value corresponding to 80th
percentile.
– 2ND VARS – invNorm(0.80, 1060, 195) enter
– Answer: 1224.116
19

Standardizing with Z scores: Formula!
■ Often, it is valuable to quantify how far an observation falls from its
mean or expected value.
■ Recall that the SD gives us the typical average distance an
observation falls from its mean or expected value
■ Standardized score or z-score:
𝑧 =
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
=
𝑥 − 𝜇
𝜎
■ We can interpret a z-score as quantifying the number of standard
deviations an observation falls from its mean or expected value.
■ In using this formula, some Normal Variable X is converted to
Standard Normal Variable Z
■ There is an interesting connection between the area under the
normal curves of X and Z.
20

There are two major tests of readiness for college, the ACT and
the SAT.
■ ACT scores are reported on a scale from 1 to 36. The
distribution of ACT scores for more than 1 million students in a
recent high school graduating class was mound-shaped and
symmetric with
– mean = 20.8 and sd = 4.8.
■ SAT scores are reported on a scale from 400 to 1600. The SAT
scores for 1.4 million students in the same graduating class
was mound-shaped and symmetric with
– mean = 1060 and sd = 195.
■ Tonya scores 1320 on the SAT. Jessie scores 28 on the ACT.
■ Both seemed to have done well.
■ But, who did better in their respective test, can you tell?
ACT vs. SAT
21

Who did better on college prep test?
■ ACT: average = 20.8, SD = 4.8
■ SAT: average = 1060, SD = 195
– Tonya scores 1320 on the SAT. Jessie scores 28 on
the ACT. Assuming that both tests measure the same
thing, who has the higher score (relatively)?
■ Question: Who did better?
■ Who is further away from the mean?
■ Calculate the z-scores:
■ TOP HAT 
−
=
x
z
22

Interpreting z-scores Extra question
■ Police department salaries in San Francisco have a
mean of $90,702 and an sd of $45,321.
■ A chief of police’s salary has a z-score of z = 4.84.
Interpret this z-score.
– A. The chief of police makes 4.84 times as much as
the average employee salary.
– B. The chief of police’s salary is nearly 5 standard
deviations above the average employee salary.
– C. Only 4.84% of employees make more than the
chief of police.
■ TOP HAT
23

Z-score formula preserves …
■ Use normalcdf to calculate the approximate percentage
of students who scored better than Jessie on the ACT.
■ Use normalcdf to calculate the area above 𝐳 =
𝟏. 𝟓 using the 𝑁(0, 1) distribution. Recall, Jessie’s z-
score is 1.5.
PAGE 82: Add this before finding probabilities with z-scores.
■ What do you notice about the probabilities?
■ TOP HAT
24

Z-score formula preserves …
P(Z>1.5) =0.0668P(ACT>28) =0.0668
Recall, that Jessie ACT score was 28 and her z-score was 1.5.
25

Z-score formula preserves probability
■ Use normalcdf to calculate the approximate percentage of
students who scored better than Jessie on the ACT.
■ Use normalcdf to calculate the area above 𝐳 = 𝟏. 𝟓 using the
𝑁(0, 1) distribution. Recall, Jessie’s z-score is 1.5.
PAGE 82: Add this before finding probabilities with z-scores.
■ What do you notice about the probabilities?
■ The probabilities are the same.
■ Why?
■ Z-score formula preserves area under normal curves
(probability).
P(ACT < a) = P( Z < a*)
where a* is the z-score of a.
26

Finding probabilities for z scores (1)
1. Find P(−1 ≤ 𝑍 ≤ 1)
In words, find the probability that the standard normal
variable takes values within one standard deviation of the
mean.
Hint: Remember Matrix? Neo!
Use normalcdf with 𝑵(𝟎, 𝟏)
Lower: -1
Upper: 1
Mean: 0
SD: 1
Answer: 0.6827
27

Finding probabilities for z scores (2)
2. What is the probability that a standard normal variable Z
is within 2 standard deviations of mean?
That is, find P(-2 ≤ Z ≤ 2).
■ TOP HAT
28

Finding probabilities for z-scores (3)
2. What is the probability that a standard normal variable Z
is within 2 standard deviations of mean?
That is, find P(-2 ≤ Z ≤ 2).
■ TOP HAT
■ Answer: normalcdf (-2, 2, 0, 1) = 0.9545
3. Find P(-3 ≤ Z ≤ 3).
Answer: normalcdf (-3, 3, 0, 1) = 0.9973
29

Normal curves and the empirical rule
For data that follows a normal distribution,
■ Approximately 68% of the data will have a z-score
between -1 and 1.
■ Approximately 95% of the data will have a z-score
between -2 and 2.
■ Approximately 99.7% of the data will have a z-score
between -3 and 3.
■ So, in general:
■ 68%, 95% and 99.7%
lie within one, two and
three SDs of the
mean.
30

Finding z-scores for probabilities
4. What z-scores provide the bounds for the middle 50% of the standard
normal distribution?
■ THINK: What do we want to find? Two bounds so middle area is 0.5.
■ Let us call this bound as a.
■ Then this is –a by symmetry. So, just need to find a.
■ THINK: Which function should we use in GC? TOP HAT
■ THINK: Middle Area is 0.5 then what is the left side area for the
negative bound -a?
■ THINK: Middle Area is 0.5 then what is the left side area for the
positive bound a? TOP HAT
Z~N(0,1)
What is this
z-score?
31

4. What z-scores provide the bounds for the middle 50% of
the standard normal distribution?
On the Z curve we have middle 50%
Remaining area is 50%, so 25% on each side!
32

4. What z-scores provide the bounds for the middle 50% of the standard normal
distribution?
Left side area for a is 0.75 and not 75, must be a decimal between 0 and 1.
Answer: invNorm(0.75, 0, 1) or invNorm(0.25, 0, 1)
-0.674 and 0.674 are the bounds for middle 50%
33

■ THINK: Which function?
■ THINK: What is the left side area?
■ TOP HAT
34

■ THINK: Which function?
■ THINK: What is the left side area?
■ TOP HAT
■ Answer: invNorm (0.975, 0, 1) = 1.96
■ So, -1.96 and 1.96 are the bounds for the middle 95% of
the standard normal distribution.
35

Example: IQ scores (1)
■ IQ test scores are formulated to be normally distributed, that
is, they follow the shape of a normal curve.
Suppose we have the following sample of 30 IQ scores:
■ Let us verify whether these scores are from a Normal model:
Mean: 97.1 and SD: 11.7.
1. How many of these scores are within 1 standard deviation of
the mean? HINT: Count them!
■ First find the range of IQ scores that are within 1 SD of the
mean.
■ Mean ± 1 x SD = 97.1 ± 11.7. So, between 85.4 and 108.8
65 80 81 83 85 89 90 91 91 92 94 95 97 97 97
97 99 100 101 101 101 102 104 105 106 107 109 112 120 121
36

■ IQ test scores are formulated to be normally distributed,
that is, they follow the shape of a normal curve.
Suppose we have the following sample of 30 IQ scores:
1. How many of these scores are within 1 standard
deviation of the mean? HINT: Count them!
■ We count 21 IQ scores between 85.4 and 108.8
■ So, 21/30 = 70% of IQ scores lie within 1 SD of the
mean
65 80 81 83 85 89 90 91 91 92 94 95 97 97 97
97 99 100 101 101 101 102 104 105 106 107 109 112 120 121
37

■ IQ test scores are known to be normally distributed, that
is, they follow the shape of a normal curve. Suppose we
have the following sample of 30 IQ scores:
■ How many of these scores are within 2 standard
deviations of the mean?
■ Mean ± 2 x SD = 97.1 ± 2 x 11.7
■ That is, between 73.7 and 120.5
■ We count 28 of the scores, so 28/30 = 93.33%
65 80 81 83 85 89 90 91 91 92 94 95 97 97 97
97 99 100 101 101 101 102 104 105 106 107 109 112 120 121
38

■ IQ test scores are known to be normally distributed, that is, they
follow the shape of a normal curve. Suppose we have the following
sample of 30 IQ scores:
■ How many of these scores are within 3 standard
deviations of the mean?
■ Mean ± 3 x SD = 97.1 ± 3 x 11.7
■ All of them. So, 30/30 = 100%.
■ Do you think the data follows the empirical rule well?
■ Yes, the data seems to follow the empirical rule. So, we
can believe that this data is from a Normal population.
65 80 81 83 85 89 90 91 91 92 94 95 97 97 97
97 99 100 101 101 101 102 104 105 106 107 109 112 120 121
39

Evaluating the normal approximation
■ Many data sets can be well-approximated by the normal
distribution.
■ We saw earlier that SAT scores, ACT scores and IQ scores
are well-approximated by the normal model.
■ While these models are helpful and convenient, we must
remember that they are only an approximation.
■ Often, it is important to evaluate just how good (or bad)
of an approximation the normal model is when applied to
a scenario.
■ There are two simple visual ways to assess whether a
normal approximation is appropriate:
1. Histogram
2. QQ-plot (normal probability plot)
40

Histograms and QQ plots
■ Here are the histogram and QQ plot for the IQ score example:
■ A QQ-plot is short for quantile-quantile plot
■ (quantile = percentile rank)
■ Sample’s quantiles/percentile ranks on the vertical axis and Z’s
quantiles/percentile ranks on the horizontal axis
■ For example, sample Q1 will be matched with Z’s Q1, etc.
■ In a QQ-plot, the closer the dots are to a perfect straight line, the more
confident we can be that our data follow the normal model.
41

Histograms and QQ-plots 1
Consider the following histograms of data sets along
with the QQ-plots:
42

Histograms and QQ-plots 3
For which of the data sets would you recommend using a
normal curve to model the distribution?
TOP HAT
44

2.7 Applying the normal model (1)
Standard Error
■ Sample statistics or Point estimates vary from sample to
sample, and it is often valuable to quantify that variability
with what is called the standard error (SE).
■ The standard error of a point estimate is
approximately equal to the standard deviation associated
with the estimate.
■ For example, if we look at the normal approximation of
the distribution of sample proportions, then the standard
error will be used as the standard deviation of sample
proportions, 𝑺. 𝑬ෝ𝒑
45

Applying the normal model (2)
■ For instance, if we had a sample statistic/point estimate
with 𝑆𝐸 = 4.2 units, that would mean that this point
estimate, over many repeated samples, would be
approximately 4.2 units away from the parameter it
estimates, on average.
■ The way we compute the 𝑆𝐸 of a point estimate varies for
different types of point estimates.
■ We will cover these computations in more detail in later
chapters. For now, let’s return to some familiar research
scenarios.
46

Example: “Boomeranging” (1)
■ We are on PAGE 86
■ Recall the research study that investigated whether the rate at
which men ‘boomeranged’ back to their parents’ homes as
adults had changed from its 1997 level of 13%.
■ The researchers took a random sample of 150 adult men and
found that 25 of them had left their parental home and then
returned.
The hypotheses that were tested were:
■ 𝐻0: There has been no change in the rate of ‘boomeranging’
among young men. The percentage is still 13% and any
difference in the sample is due to chance.
■ 𝐻 𝑎: There has been a change in the rate of ‘boomeranging’ for
young men.
47

1. Use the information above to report a point estimate for
the current rate of ‘boomeranging’ among young men, the
approximate p-value from this randomization test, and then
evaluate the evidence with regards to the associated
hypotheses.
■ Sample proportion or Point estimate = 25/150 = 0.1667
■ The approximate p-value from two tails is 0.2265
■ There is very little evidence that the null model is not a
good fit for the observed results.
2. Now try to reproduce this p-value using a normal
distribution approach.
49

2. Create a quick sketch in the space below of a normal
distribution centered at 0.13 with a standard error of
0.0275.
50

Interpret the standard error of 0.0275.
Hint for Exam2:
■ Here our sample proportion of boomerangers
𝐡𝐚𝐬 𝒂 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒆𝒓𝒓𝒐𝒓, 𝑺𝑬 = 𝟎. 𝟎𝟐𝟕𝟓.
■ INTERPRETATION: This means that point estimates/sample
proportions, on average, over many repeated samples, would be
approximately 0.0275 units away from the parameter/population
proportion of boomerangers of 0.13.
51

2. Calculate the Z score using the observed ‘boomeranging’
rate of
25
150
= 0.1667, along with the mean and standard
error of the normal model. [Notice how we use the standard
error of the statistic as the standard deviation to find the z
score.]
𝑧 =
0.1667 − 0.13
0.0275
= 1.3345

−
=
x
z
52

3. Identify the p-value corresponding
to this z score.
■ Recall, area under Z curve is the same
as corresponding area under the
normal curve for sample proportions
■ normalcdf(1.3345, 𝟏𝟎 𝟏𝟎
, 0, 1)
■ Right tail area = 0.091.
Or could do,
normalcdf(0.1667, 𝟏𝟎 𝟏𝟎, 𝟎. 𝟏𝟑, 𝟎. 𝟎𝟐𝟕𝟓)
This is not recommended, as later on, we
have different formulas for S.E. and it gets
confusing.
1.3345
53

Recall:
Observed boomeranging rate
is 0.1667
𝑧 = 1.3345
■ What is the definition of p-value?
■ Under the null model,
the probability of observing a Boomeranging rate of 0.1667
and beyond, that is, as far away as 0.1667 is from 0.13 or
worse.
■ TOP HAT: To find p-value should we multiply the right tail area
by 2? How can we tell whether to multiply by 2 or not???
54

3. Identify the p-value corresponding to this z score. How
does it compare to the p-value from the randomization
simulation? Would we make the same evaluation regarding
the null model?
■ In terms of z-scores, it is the probability of observing a z-score of
1.3345 and beyond, that is, z-scores less than -1.3345 and z-scores
greater than 1.3345.
■ We have a two tailed test, so by symmetry,
p-value = 2 x 0.091 = 0.182
■ For randomization based test
the approximate p-value was 0.2265.
■ Here, we have a slightly smaller p-value
but the evaluation is still the same!
■ There is LITTLE evidence that
the null model IS NOT a good fit
for the observed results.
1.3345-1.3345
55

Example: Web Design (1)
■ Recall the observational study undertaken by an online
art gallery to see whether investing in a website redesign
would increase the percentage of premium accounts
from its current rate of 25%.
■ The gallery surveyed 500 users and found that 150 of
them say they would continue or purchase a premium
account if the new features were included.
The hypotheses tested were:
■ 𝐻0: The percentage of premium accounts will not change
after including additional features on the website; it will
remain at 25%. Any difference from this rate is due to
chance involved in the sampling process.
■ 𝐻 𝑎: The percentage of premium accounts will change
after including additional features on the website.
56

From the randomization test, we could see that the p-value
was small and we were inclined to think that the percentage
of premium accounts might increase.
1. Now try to reproduce this p-value using a normal
distribution approach. Create a quick sketch in the space
below of a normal distribution centered at 0.25 with a
standard error of 0.019.
57

2. Calculate a Z score using the observed ‘boomeranging’
rate of
25
150
= 0.3, along with the mean and standard error of
the normal model. [Notice how we use the standard error of
the statistic as the standard deviation for the z score.]
■ TOP HAT: To find p-value should we multiply the right tail
area by 2? How can we tell whether to multiply by 2 or
not???

−
=
x
z
𝑧 =
0.3 − 0.25
0.019
= 2.6316
58

3. Identify the p-value corresponding to this z score. How does it
compare to the p-value from the randomization simulation? Would we
make the same evaluation regarding the null hypothesis?
■ Normalcdf (2.6316, 𝟏𝟎 𝟏𝟎, 0, 1)
■ P-value = 0.0042 (one-tail test)
■ From Section 2.4, for the
randomization based
test, we had a p-value less than
0.001, that is, extremely strong
evidence that null model is not a
good fit for our data.
■ Whereas, here we only have very
strong
evidence that null model is not a
good fit.
■ In both cases, we will recommend
redesign of website to the owners.
59

2.8 Confidence Intervals
We are on PAGE 88
■ A sample statistic provides a single plausible value for a
population parameter using collected sample data.
■ That is why, we call this sample statistic a point estimate.
■ Sometimes, it is more useful to provide a plausible range
of values for that parameter.
■ Statisticians call this plausible range of values a
confidence interval.
■ Suppose we have a large bin filled
with small green and white balls
and we want to know the proportion
of white balls in the
bin/box/population.
60

Example: Estimating the proportion
Question: What is the proportion of white balls in the bin?
In this scenario, we could count all the balls and find the
population proportion but that would take too much time.
■ Instead, let’s take a random sample of, say, 100 balls.
■ If our sample of 100 balls has 59 white and 41 green
balls, what is the sample proportion of white balls?
■ Sample proportion of white balls: Ƹ𝑝 𝑤ℎ𝑖𝑡𝑒 =
59
100
■ We call this sample proportion, point estimate
■ Based on this sample point estimate, give a plausible
range of values for the population proportion.
■ How confident are you that your answer is correct? TOP
HAT
61

Sample estimates vary
■ We know based on experience that sample estimates
vary, and that under certain conditions they will follow a
normal model.
■ The graph shows 3000 sample proportions for this
scenario.
■ Our sample proportion of
Ƹ𝑝 = 0.59 is somewhere in
this distribution, and likely
near the center of the
distribution but not at the
center.
62

Standard Error
■ Say, we can also estimate the standard error to be
0.048.
■ Since we’re talking about a sampling model, we call this
standard deviation the standard error.
■ We will learn how to calculate the standard error in
chapter 3.
63

Range of plausible values
■ Sample proportion is Ƹ𝑝 = 0.59
■ Standard error is 0.048
■ Use what you know about the normal distribution to construct
a new range of plausible values for the population proportion
of white balls..
■ Hint: Use empirical rule!
■ How confident are you that your answer is correct? TOP HAT
■ App for confidence intervals
https://shiny.stt.msu.edu/fairbour/Confidence/
■ How likely is it that the interval includes the true population
proportion? TOP HAT
■ How many out of 100 intervals do you expect to contain the
true population proportion? TOP HAT
64

The gist of confidence intervals (1)
1. The value of the sample estimate will vary from one
sample to the next.
■ The values vary around the population parameter.
2. The standard error of the sample estimate provides an
idea of how far away it would tend to vary from the
parameter value (on average).
3. The general format for a confidence interval is given by:
■ Sample (point) estimate ± (a few) standard errors
4. The “few” or number of standard errors we go out each
way from the sample estimate will depend on what coverage
rate (i.e., how confident) we want to be.
65

The gist of confidence intervals (2)
5. The “how confident” we want to be is referred to as the
confidence level.
■ This level reflects how confident we are in the procedure.
■ The confidence level is the percentage of the time we
expect the procedure to produce an interval that
contains the population parameter.
■ Most of the intervals that are made will contain the truth
about the population, but occasionally an interval will be
produced that does not contain the true parameter
value.
■ Each interval either contains the population parameter
or it doesn’t.
67

How many standard errors?
We are at the bottom of PAGE 89
■ This depends on the confidence level.
■ Given the standard normal distribution, what are the
boundaries for the middle 95%?
■ Use invNorm(0.975, 0, 1), see page 83 on Lecture guide.
■ -1.96 and 1.96 are the middle 95% boundaries.
■ If we set “a few” to be z = 1.96, then we can expect that 95%
of the sample proportions will be in the interval
population parameter ± 1.96 standard errors
■ This is a fact.
■ But, recall we are estimating the population parameter,
meaning we do not know the population parameter.
■ We know our point estimate/sample proportion!!!
68

Procedure and confidence
69
Let us look at this the other way round:
point estimate ± 1.96 standard errors

Constructing Confidence Intervals
Calculate the interval
point estimate ± 1.96 standard errors
for each of the following:
Standard error is 0.048
■ Sample 1 has point estimate, Ƹ𝑝 = 0.59
■ Answer: (0.49592, 0.68408)
■ Sample 2 has Ƹ𝑝 = 0.71
■ Answer: (0.61592, 0.80408)
■ Sample 3 has Ƹ𝑝 = 0.75
■ Answer: (0.65592, 0.84408)
70

Which intervals include the parameter?
■ In this example, we actually know the true proportion is
𝑝 = 0.66.
■ Which of the intervals you calculated included this value?
71
(0.49592 0.68408)
(0.61592, 0.80408)
(0.65592, 0.84408)
Population proportion=0.66

Key Idea
■ Because 95% of sample proportions are within 1.96
standard error of the population parameter,
approximately 95% of the intervals we create using this
procedure will include the parameter.
■ More practice with this idea at
https://shiny.stt.msu.edu/fairbour/Confidence/
72

Example 1: Pass or Fail?
■ Earlier, we encountered a research scenario where an
engineering instructor examined whether students with
an urban/suburban background are more likely to pass
the course than rural/small-town students. The table
below shows the results of this study.
Student Background Pass Fail Total
Urban/Suburban 52 13 65
Rural/Small-town 30 25 55
Total 82 38 120
73

Example 1: Pass or Fail? Interval
■ The point estimate suggests that students from an urban/suburban
background are more likely to pass the course:
■ Ƹ𝑝 𝑈𝑟𝑏𝑎𝑛 − Ƹ𝑝 𝑅𝑢𝑟𝑎𝑙 =
52
65
−
30
55
= 0.255 .
■ The standard error of this estimate is 𝑆𝐸 = 0.0852.
■ Construct a 95% confidence interval for the true difference in the
proportions of urban and rural students that pass the course.
point estimate ± 1.96 standard errors = 0.255 ± 1.96 x 0.0852
■ The 95% confidence interval for the true difference in the
proportions of urban and rural students that pass the course is
(0.088008, 0.421992)
■ 0.088008 is called the lower bound of the confidence interval.
■ 0.421992 is called the upper bound of the confidence interval.
■ The plausible values for the true difference are between 8.8% and
42.2%.
74

Example 2: Web Design Interval
■ Recall the observational study undertaken by an online
art gallery to see whether investing in a website redesign
would increase the percentage of premium accounts
from its current rate of 25%. The gallerists surveyed 500
users and found that 150 of them say they would
purchase a premium account if the new features were
included.
■ Earlier, we used the point estimate Ƹ𝑝 = 0.3 and its
standard error 𝑆𝐸 = 0.0205 to conduct a hypothesis test
based on the normal distribution.
■ Use these same values to create a 95% confidence
interval for the proportion of Premium account users.
TOP HAT
75

Example 2: Web Design Interval
Interpretation
■ Answer: The 95% CI for the proportion of Premium users
is (0.25982, 0.34018)
■ Notice that the value 0.25 does not fall within the 95%
confidence interval!
■ We can interpret this to mean the confidence interval
does not consider it to be a reasonable value for the true
percentage of premium accounts – this is consistent with
our evaluation of the evidence for the hypothesis test we
conducted earlier.
76

Interpretation
■ The phrase confidence level is used to describe the likeliness
or chance that a yet-to-be constructed interval will actually
contain the true population value.
■ However, we have to be careful about how to interpret this
level of confidence if we have already computed our interval of
values.
■ The population parameter is not a random quantity, it does not
vary - once we have “looked” (computed) the actual interval,
we cannot talk about probability or chance for this particular
interval anymore.
■ Unlike in the movie “Harry Potter”!
■ https://youtu.be/81scFUYQGbU?t=78
■ The 95% confidence level applies to the procedure, not to an
individual interval; it applies “before you look” and not “after
you look” at your data and compute your observed interval of
values.
77

Changing the confidence level - 1
■ Suppose we want to create a range of plausible values for a
parameter that will have more than 95% confidence [i.e.,
create the interval using a process that will capture the
parameter more than 95% of the time].
■ Let’s return to our original formula for a 95% confidence
interval:
𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 1.96 ∗ 𝑆𝐸
■ Notice this interval has three components: the point estimate,
the standard error of that estimate, and a multiplier of 1.96.
■ invNorm(0.975, 0, 1) = 1.96
■ Recall that we chose this multiplier of 1.96 earlier after
observing that 95% of observations of a normally-distributed
variable fall within 1.96 standard deviations of the mean.
78

Changing the confidence level - 2
■ invNorm(0.995, 0, 1) = 2.576
■ invNorm(0.95, 0, 1) = 1.645
■ By extension, we could observe that 99% of observations
fall within 2.576 standard deviations of the mean, and
that only 90% of observations fall within 1.645 standard
deviations of the mean.
■ If the point estimate of a parameter follows a normal
model with standard error 𝑆𝐸, then a confidence interval
for that population parameter is:
𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 𝑧∗ ∗ 𝑆𝐸
where 𝑧∗
corresponds to the level you’d like the confidence
interval to have.
79

Confidence level multipliers
■ Use the table below to jot down some confidence levels
that are commonly seen in statistical studies, along with
their associated multipliers.
■ After looking at this table, you’ll probably notice a key
idea underlying confidence intervals:
■ If you want to be more confident in your interval of
plausible values, you need to make your interval wider.
Confidence Level Multiplier 𝑧∗
90% 1.645
95% 1.96
99% 2.576
80

The Logic of Confidence Intervals
■ Consider all possible random samples of the same large size n.
■ Each possible random sample provides a possible sample statistic
value.
■ If we made a histogram of all of these possible statistics it would
look like the normal distribution.
■ About 95% of the possible sample statistics will be in the interval
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 ± 1.96 ∗ 𝑆𝐸
■ and for each one of these sample statistic values, the interval
𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 1.96 ∗ 𝑆𝐸 will contain the population parameter.
■ Thus about 95% of the intervals
𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 1.96 ∗ 𝑆𝐸
will contain the population parameter.
■ Note: The  part of the interval 1.96 ∗ 𝑆𝐸 is called the 95% margin of
error.
81

Chapter2 slides-part 2-harish complete

More Related Content

What's hot (20)

Similar to Chapter2 slides-part 2-harish complete (20)

More from EasyStudy3 (20)

Recently uploaded (20)

Chapter2 slides-part 2-harish complete