SlideShare a Scribd company logo
Chapter 4 Numerical Methods for Describing Data
Describing the Center of a Data Set with the arithmetic mean
Describing the Center of a Data Set with the arithmetic mean The  population mean  is denoted by  µ , is the average of all x values in the entire population.
Example calculations During a two week period 10 houses were sold in Fancytown. The “average” or mean price for this sample of 10 houses in Fancytown is $295,000
Example calculations During a two week period 10 houses were sold in Lowtown. The “average” or mean price for this sample of 10 houses in Lowtown is $295,000 Outlier
Reflections on the Sample calculations Looking at the dotplots of the samples for Fancytown and Lowtown we can see that the mean, $295,000 appears to accurately represent the “center” of the data for Fancytown, but it is not representative of the Lowtown data. Clearly, the mean can be greatly affected by the presence of even a single  outlier .  Outlier
Comments In the previous example of the house prices in the sample of 10 houses from Lowtown, the mean was affected very strongly by the one house with the extremely high price. The other 9 houses had selling prices around $100,000. This illustrates that the mean can be very sensitive to a few extreme values.
Describing the Center of a Data Set with the median The  sample median  is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then
Example of Median Calculation Consider the Fancytown data. First, we put the data in numerical increasing order to get  231,000  285,000  287,000  294,000 297,000  299,000  312,000  313,000 315,000  317,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle.
Example of Median Calculation Consider the Lowtown data. We put the data in numerical increasing order to get  93,000  95,000  97,000  99,000 100,000  110,000  113,000  121,000 122,000   2,000,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle.
Comparing the Sample Mean & Sample Median
Comparing the Sample Mean & Sample Median
Comparing the Sample Mean & Sample Median Typically,  when a distribution is skewed positively, the mean is larger than the median, when a distribution is skewed negatively, the mean is smaller then the median, and when a distribution is symmetric, the mean and the median are equal. Notice from the preceding pictures that the median splits the area in the distribution in half and the mean is the point of balance.
The Trimmed Mean A  trimmed mean  is computed by first ordering the data values from smallest to largest, deleting a selected number of values from each end of the ordered list, and finally computing the mean of the remaining values.  The  trimming percentage  is the percentage of values deleted from each end of the ordered list.
Example of Trimmed Mean
Example of Trimmed Mean
Another Example Here’s an example of what happens if you compute the mean, median, and 5% & 10% trimmed means for the Ages for the 79 students taking Data Analysis
Categorical Data - Sample Proportion
Categorical Data - Sample Proportion If we look at the student data sample, consider the variable gender and treat being female as a success, we have 25 of the sample of 79 students are female, so the sample proportion  (of females) is
Describing Variability The simplest numerical measure of the variability of a numerical data set is the  range , which is defined to be the difference between the largest and smallest data values. range = maximum - minimum
Describing Variability The n  deviations from the sample mean  are the differences: Note: The sum of all of the deviations from the sample mean will be equal to 0, except possibly for the effects of rounding the numbers. This means that the average deviation from the mean is always 0 and cannot be used as a measure of variability.
Sample Variance The sample  variance , denoted  s 2  is the sum of the squared deviations from the mean divided by n-1.
Sample Standard Deviation The sample  standard deviation , denoted  s  is the positive square root of the sample variance. The  population standard deviation  is denoted by   .
Example calculations 10 Macintosh Apples were randomly selected and weighed (in ounces).
Calculator Formula for s 2  and s A  computational formula  for the sample variance is given by A little algebra can establish the sum of the square deviations,
Calculations Revisited The values for s 2  and s are exactly the same as were obtained earlier.
Quartiles and the Interquartile Range Lower quartile  (Q 1 ) = median of the lower half  of  the data set. Upper Quartile  (Q 3 ) = median of the upper half of the data set. Note: If n is odd, the median is excluded from both the lower and upper halves of the data. The  interquartile range  (iqr), a resistant measure of variability is given by iqr = upper quartile – lower quartile = Q 3  – Q 1
Quartiles and IQR Example 15 students with part time jobs were randomly selected and the number of hours worked last week was recorded.  2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 19, 12, 14, 10, 12, 10, 25,  9, 8, 4, 2, 10, 7, 11, 15 The data is put in increasing order to get
Quartiles and IQR Example With 15 data values, the median is the 8 th  value. Specifically, the median is 10. 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 Lower quartile = 8  Upper quartile = 14 Iqr = 14 - 8 = 6 Median Lower Half Upper Half Lower quartile Q 1 Upper quartile Q 3
Boxplots Constructing a Skeletal Boxplot Draw a horizontal (or vertical) scale. Construct a rectangular box whose left (or lower) edge is at the lower quartile and whose right (or upper) edge is at the upper quartile (the box width = iqr). Draw a vertical (or horizontal) line segment inside the box at the location of the median. Extend horizontal (or vertical) line segments from each end of the box to the smallest and largest observations in the data set. (These lines are called whiskers.)
Skeletal Boxplot Example Using the student work hours data we have 0  5  10  15  20  25
Outliers An observations is an  outlier  if it is more than 1.5 iqr away from the closest end of the box (less than the lower quartile minus 1.5 iqr or more than the upper quartile plus 1.5 iqr. An outlier is  extreme  if it is more than 3 iqr from the closest end of the box, and it is  mild  otherwise.
Modified Boxplots A  modified boxplot  represents mild outliers by shaded circles and extreme outliers by open circles. Whiskers extend on each end to the most extreme observations that are  not  outliers.
Modified Boxplot Example Using the student work hours data we have 0  5  10  15  20  25 Lower quartile + 1.5 iqr = 14 - 1.5(6) = -1 Upper quartile + 1.5 iqr = 14 + 1.5(6) = 23 Smallest data value that isn’t  an outlier Largest data value that isn’t  an outlier Upper quartile + 3  iqr = 14 + 3(6) = 32 Mild Outlier
Modified Boxplot Example Consider the ages of the 79 students from the classroom data set from the slideshow Chapter 3. Iqr = 22 – 19 = 3 17  18  18  18  18  18  19  19  19  19 19  19  19  19  19  19  19  19  19  19 19  19  19  19  19  19  20  20  20  20 20  20  20  20  20  20  21  21  21  21 21  21  21  21  21  21  21  21  21  21 22  22  22  22  22  22  22  22  22  22 22  23  23  23  23  23  23  24  24  24 25  26  28  28  30  37  38  44  47 Lower quartile – 3 iqr = 10  Lower quartile – 1.5 iqr =14.5 Upper quartile + 3 iqr = 31  Upper quartile + 1.5 iqr = 26.5 Median Lower  Quartile Upper  Quartile Moderate Outliers Extreme Outliers
Modified Boxplot Example Here is the modified boxplot for the student age data. Smallest data value that isn’t  an outlier Largest data value that isn’t  an outlier Mild Outliers Extreme Outliers 15  20  25  30  35  40  45  50
Modified Boxplot Example Here is the same boxplot reproduced with a vertical orientation. 50 45 40 35 30 25 20 15
Comparative Boxplot Example By putting boxplots of two separate groups or subgroups we can compare their distributional behaviors. Notice that the distributional pattern of female and male student weights have similar shapes, although the females are roughly 20 lbs lighter (as a group). 100  120  140  160  180  200  220  240 Females Males G e n d e r   Student Weight
Comparative Boxplot Example
Interpreting Variability Chebyshev’s Rule
For specific values of k Chebyshev’s Rule reads At least  75% of the observations are within 2 standard deviations of the mean. At least 89% of the observations are within 3 standard deviations of the mean. At least 90% of the observations are within 3.16 standard deviations of the mean. At least 94% of the observations are within 4 standard deviations of the mean. At least 96% of the observations are within 5 standard deviations of the mean. At least 99 % of the observations are with 10 standard deviations of the mean. Interpreting Variability Chebyshev’s Rule
Consider the student age data Example - Chebyshev’s Rule 17  18  18  18  18  18  19  19  19  19 19  19  19  19  19  19  19  19  19  19 19  19  19  19  19  19  20  20  20  20 20  20  20  20  20  20  21  21  21  21 21  21  21  21  21  21  21  21  21  21 22  22  22  22  22  22  22  22  22  22 22  23  23  23  23  23  23  24  24  24 25  26   28  28  30  37   38   44  47 Color code:  within 1 standard deviation of the mean within 2 standard deviations of the mean within 3 standard deviations of the mean within 4 standard deviations of the mean within 5 standard deviations of the mean
Summarizing the student age data Example - Chebyshev’s Rule Notice that Chebyshev gives very conservative lower bounds and the values aren’t very close to the actual percentages. 79/79 = 100%    96.0% within 5 standard deviations of the mean 77/79 = 97.5%    93.8% within 4 standard deviations of the mean 76/79 = 96.2%    88.8% within 3 standard deviations of the mean 75/79 = 94.9%    75% within 2 standard deviations of the mean 72/79 = 91.1%    0% within 1 standard deviation of the mean Actual Chebyshev’s  Interval
Empirical Rule If the histogram of values in a data set is reasonably symmetric and unimodal (specifically, is reasonably approximated by a normal curve), then Approximately 68% of the observations are within 1 standard deviation of the mean. Approximately 95% of the observations are within 2 standard deviation of the mean. Approximately 99.7% of the observations are within 3 standard deviation of the mean.
Z Scores The z score is how many standard deviations the observation is from the mean. A positive z score indicates the observation is above the mean and a negative z score indicates the observation is below the mean.
Z Scores Computing the z score is often referred to as  standardization  and the z score is called a  standardized score .
Example A sample of GPAs of 38 statistics students appear below (sorted in increasing order) 2.00  2.25   2.36   2.37   2.50  2.50   2.60 2.67  2.70  2.70   2.75   2.78   2.80  2.80 2.82  2.90  2.90  3.00   3.02   3.07   3.15 3.20  3.20   3.20   3.23  3.29   3.30   3.30   3.42  3.46  3.48  3.50  3.50  3.58  3.75   3.80   3.83  3.97
Example The following stem and leaf indicates that the GPA data is reasonably symmetric and unimodal. 2 0 2 233 2 55 2 667777 2 88899 3 0001 3 2222233 3 444555 3 7 3 889 Stem: Units digit Leaf: Tenths digit
Example
Example Notice that the empirical rule gives reasonably good estimates for this example. 38/38 = 100%  99.7% within 3 standard deviations of the mean 37/38 = 97%    95% within 2 standard deviations of the mean 27/38 = 71%    68% within 1 standard deviation of the mean Actual Empirical Rule  Interval
Comparison of Chebyshev’s Rule and the Empirical Rule The following refers to the weights in the sample of 79 students. Notice that the stem and leaf diagram suggest the data distribution is unimodal but is positively skewed because of the outliers on the high side. Nevertheless, the results for the Empirical Rule are good. 10 3 11 37 12 011444555 13 000000455589 14 000000000555 15 000000555567 16 000005558 17 0000005555 18 0358 19 5 20 00 21 0 22 55 23 79 Stem: Hundreds & tens digits Leaf: Units digit
Comparison of Chebyshev’ Rule and the Empirical Rule Notice that even with moderate positive skewing of the data, the Empirical Rule gave a much more usable and meaningful result.  99.7%    95%    68% Empirical Rule   79/79 = 100%    88.8% within 3 standard deviations of the mean 75/79 = 94.9%    75% within 2 standard deviations of the mean 56/79 = 70.9%    0% within 1 standard deviation of the mean Actual Chebyshev’s Rule  Interval

More Related Content

PPTX
Scatterplots, Correlation, and Regression
PPTX
Computing transformations
PPT
Exploring bivariate data
PPTX
Frequency Distributions for Organizing and Summarizing
PPTX
Graphs that Enlighten and Graphs that Deceive
PPTX
Measures of Variation
PPT
Solving Systems of Linear Inequalities
Scatterplots, Correlation, and Regression
Computing transformations
Exploring bivariate data
Frequency Distributions for Organizing and Summarizing
Graphs that Enlighten and Graphs that Deceive
Measures of Variation
Solving Systems of Linear Inequalities

What's hot (20)

PPTX
Sampling Distributions and Estimators
PPTX
Introduction to Statistics
PPTX
Chapter 2: Frequency Distribution and Graphs
PPTX
21 monotone sequences x
PDF
Practice Test 2 Probability
PDF
Algebra formulas
PPTX
Goodness of Fit Notation
PPTX
Skewness
PPTX
The Standard Normal Distribution
PDF
Dm2021 binary operations
PPTX
Math functions, relations, domain & range
PDF
Normal Distribution
PPTX
Frequency distribution & graph
PPT
Chap09 2 sample test
PDF
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
PPTX
2.mathematics for machine learning
PDF
Lesson 10: Derivatives of Trigonometric Functions
PPTX
The Central Limit Theorem
PPT
5.2 first and second derivative test
PPTX
The kolmogorov smirnov test
Sampling Distributions and Estimators
Introduction to Statistics
Chapter 2: Frequency Distribution and Graphs
21 monotone sequences x
Practice Test 2 Probability
Algebra formulas
Goodness of Fit Notation
Skewness
The Standard Normal Distribution
Dm2021 binary operations
Math functions, relations, domain & range
Normal Distribution
Frequency distribution & graph
Chap09 2 sample test
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
2.mathematics for machine learning
Lesson 10: Derivatives of Trigonometric Functions
The Central Limit Theorem
5.2 first and second derivative test
The kolmogorov smirnov test
Ad

Similar to Chapter04 (20)

PPTX
Statistics (Measures of Dispersion)
PPTX
measure of dispersion
PPTX
Working with Numerical Data
PDF
Measures of dispersion are statistical tools that show how spread out a set o...
PDF
Describing Distributions with Numbers
PPT
Algebra unit 9.3
PPT
Describing quantitative data with numbers
PPT
chapter no. 2. describing central tendency and variability .ppt
PPT
Penggambaran Data Secara Numerik
PPTX
Revisionf2
PPT
ap_stat_1.3.ppt
PPT
Chapter 6 slide show notes math 140 summer 2011
PPTX
Chap02 describing data; numerical
PPT
Chapter03
PDF
Applied Business Statistics ,ken black , ch 3 part 1
PDF
Stats - Lecture CH 3- Describing Data Using Numerical Measures.pdf
PPTX
ProbabilityandStatsUnitAPowerpoint-1.pptx
PPTX
Lesson 2 Measures of Variability .pptx
PPTX
Measures of Central Tendency, Variability and Shapes
Statistics (Measures of Dispersion)
measure of dispersion
Working with Numerical Data
Measures of dispersion are statistical tools that show how spread out a set o...
Describing Distributions with Numbers
Algebra unit 9.3
Describing quantitative data with numbers
chapter no. 2. describing central tendency and variability .ppt
Penggambaran Data Secara Numerik
Revisionf2
ap_stat_1.3.ppt
Chapter 6 slide show notes math 140 summer 2011
Chap02 describing data; numerical
Chapter03
Applied Business Statistics ,ken black , ch 3 part 1
Stats - Lecture CH 3- Describing Data Using Numerical Measures.pdf
ProbabilityandStatsUnitAPowerpoint-1.pptx
Lesson 2 Measures of Variability .pptx
Measures of Central Tendency, Variability and Shapes
Ad

More from rwmiller (18)

PPT
Chapter06
PPT
Chapter14
PPT
Chapter13
PPT
Chapter12
PPT
Chapter11
PPT
Chapter10
PPT
Chapter09
PPT
Chapter08
PPT
Chapter07
PPT
Chapter05
PPT
Chapter04
PPT
Chapter03
PPT
Chapter02
PPT
Chapter15
PPT
Chapter01
PPT
Chapter03
PPT
Chapter02
PPT
Chapter01
Chapter06
Chapter14
Chapter13
Chapter12
Chapter11
Chapter10
Chapter09
Chapter08
Chapter07
Chapter05
Chapter04
Chapter03
Chapter02
Chapter15
Chapter01
Chapter03
Chapter02
Chapter01

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Pre independence Education in Inndia.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
01-Introduction-to-Information-Management.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Institutional Correction lecture only . . .
Microbial disease of the cardiovascular and lymphatic systems
102 student loan defaulters named and shamed – Is someone you know on the list?
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial diseases, their pathogenesis and prophylaxis
Pre independence Education in Inndia.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Anesthesia in Laparoscopic Surgery in India
PPH.pptx obstetrics and gynecology in nursing
01-Introduction-to-Information-Management.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Week 4 Term 3 Study Techniques revisited.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Chapter04

  • 1. Chapter 4 Numerical Methods for Describing Data
  • 2. Describing the Center of a Data Set with the arithmetic mean
  • 3. Describing the Center of a Data Set with the arithmetic mean The population mean is denoted by µ , is the average of all x values in the entire population.
  • 4. Example calculations During a two week period 10 houses were sold in Fancytown. The “average” or mean price for this sample of 10 houses in Fancytown is $295,000
  • 5. Example calculations During a two week period 10 houses were sold in Lowtown. The “average” or mean price for this sample of 10 houses in Lowtown is $295,000 Outlier
  • 6. Reflections on the Sample calculations Looking at the dotplots of the samples for Fancytown and Lowtown we can see that the mean, $295,000 appears to accurately represent the “center” of the data for Fancytown, but it is not representative of the Lowtown data. Clearly, the mean can be greatly affected by the presence of even a single outlier . Outlier
  • 7. Comments In the previous example of the house prices in the sample of 10 houses from Lowtown, the mean was affected very strongly by the one house with the extremely high price. The other 9 houses had selling prices around $100,000. This illustrates that the mean can be very sensitive to a few extreme values.
  • 8. Describing the Center of a Data Set with the median The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then
  • 9. Example of Median Calculation Consider the Fancytown data. First, we put the data in numerical increasing order to get 231,000 285,000 287,000 294,000 297,000 299,000 312,000 313,000 315,000 317,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle.
  • 10. Example of Median Calculation Consider the Lowtown data. We put the data in numerical increasing order to get 93,000 95,000 97,000 99,000 100,000 110,000 113,000 121,000 122,000 2,000,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle.
  • 11. Comparing the Sample Mean & Sample Median
  • 12. Comparing the Sample Mean & Sample Median
  • 13. Comparing the Sample Mean & Sample Median Typically, when a distribution is skewed positively, the mean is larger than the median, when a distribution is skewed negatively, the mean is smaller then the median, and when a distribution is symmetric, the mean and the median are equal. Notice from the preceding pictures that the median splits the area in the distribution in half and the mean is the point of balance.
  • 14. The Trimmed Mean A trimmed mean is computed by first ordering the data values from smallest to largest, deleting a selected number of values from each end of the ordered list, and finally computing the mean of the remaining values. The trimming percentage is the percentage of values deleted from each end of the ordered list.
  • 17. Another Example Here’s an example of what happens if you compute the mean, median, and 5% & 10% trimmed means for the Ages for the 79 students taking Data Analysis
  • 18. Categorical Data - Sample Proportion
  • 19. Categorical Data - Sample Proportion If we look at the student data sample, consider the variable gender and treat being female as a success, we have 25 of the sample of 79 students are female, so the sample proportion (of females) is
  • 20. Describing Variability The simplest numerical measure of the variability of a numerical data set is the range , which is defined to be the difference between the largest and smallest data values. range = maximum - minimum
  • 21. Describing Variability The n deviations from the sample mean are the differences: Note: The sum of all of the deviations from the sample mean will be equal to 0, except possibly for the effects of rounding the numbers. This means that the average deviation from the mean is always 0 and cannot be used as a measure of variability.
  • 22. Sample Variance The sample variance , denoted s 2 is the sum of the squared deviations from the mean divided by n-1.
  • 23. Sample Standard Deviation The sample standard deviation , denoted s is the positive square root of the sample variance. The population standard deviation is denoted by  .
  • 24. Example calculations 10 Macintosh Apples were randomly selected and weighed (in ounces).
  • 25. Calculator Formula for s 2 and s A computational formula for the sample variance is given by A little algebra can establish the sum of the square deviations,
  • 26. Calculations Revisited The values for s 2 and s are exactly the same as were obtained earlier.
  • 27. Quartiles and the Interquartile Range Lower quartile (Q 1 ) = median of the lower half of the data set. Upper Quartile (Q 3 ) = median of the upper half of the data set. Note: If n is odd, the median is excluded from both the lower and upper halves of the data. The interquartile range (iqr), a resistant measure of variability is given by iqr = upper quartile – lower quartile = Q 3 – Q 1
  • 28. Quartiles and IQR Example 15 students with part time jobs were randomly selected and the number of hours worked last week was recorded. 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 19, 12, 14, 10, 12, 10, 25, 9, 8, 4, 2, 10, 7, 11, 15 The data is put in increasing order to get
  • 29. Quartiles and IQR Example With 15 data values, the median is the 8 th value. Specifically, the median is 10. 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 Lower quartile = 8 Upper quartile = 14 Iqr = 14 - 8 = 6 Median Lower Half Upper Half Lower quartile Q 1 Upper quartile Q 3
  • 30. Boxplots Constructing a Skeletal Boxplot Draw a horizontal (or vertical) scale. Construct a rectangular box whose left (or lower) edge is at the lower quartile and whose right (or upper) edge is at the upper quartile (the box width = iqr). Draw a vertical (or horizontal) line segment inside the box at the location of the median. Extend horizontal (or vertical) line segments from each end of the box to the smallest and largest observations in the data set. (These lines are called whiskers.)
  • 31. Skeletal Boxplot Example Using the student work hours data we have 0 5 10 15 20 25
  • 32. Outliers An observations is an outlier if it is more than 1.5 iqr away from the closest end of the box (less than the lower quartile minus 1.5 iqr or more than the upper quartile plus 1.5 iqr. An outlier is extreme if it is more than 3 iqr from the closest end of the box, and it is mild otherwise.
  • 33. Modified Boxplots A modified boxplot represents mild outliers by shaded circles and extreme outliers by open circles. Whiskers extend on each end to the most extreme observations that are not outliers.
  • 34. Modified Boxplot Example Using the student work hours data we have 0 5 10 15 20 25 Lower quartile + 1.5 iqr = 14 - 1.5(6) = -1 Upper quartile + 1.5 iqr = 14 + 1.5(6) = 23 Smallest data value that isn’t an outlier Largest data value that isn’t an outlier Upper quartile + 3 iqr = 14 + 3(6) = 32 Mild Outlier
  • 35. Modified Boxplot Example Consider the ages of the 79 students from the classroom data set from the slideshow Chapter 3. Iqr = 22 – 19 = 3 17 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 25 26 28 28 30 37 38 44 47 Lower quartile – 3 iqr = 10 Lower quartile – 1.5 iqr =14.5 Upper quartile + 3 iqr = 31 Upper quartile + 1.5 iqr = 26.5 Median Lower Quartile Upper Quartile Moderate Outliers Extreme Outliers
  • 36. Modified Boxplot Example Here is the modified boxplot for the student age data. Smallest data value that isn’t an outlier Largest data value that isn’t an outlier Mild Outliers Extreme Outliers 15 20 25 30 35 40 45 50
  • 37. Modified Boxplot Example Here is the same boxplot reproduced with a vertical orientation. 50 45 40 35 30 25 20 15
  • 38. Comparative Boxplot Example By putting boxplots of two separate groups or subgroups we can compare their distributional behaviors. Notice that the distributional pattern of female and male student weights have similar shapes, although the females are roughly 20 lbs lighter (as a group). 100 120 140 160 180 200 220 240 Females Males G e n d e r Student Weight
  • 41. For specific values of k Chebyshev’s Rule reads At least 75% of the observations are within 2 standard deviations of the mean. At least 89% of the observations are within 3 standard deviations of the mean. At least 90% of the observations are within 3.16 standard deviations of the mean. At least 94% of the observations are within 4 standard deviations of the mean. At least 96% of the observations are within 5 standard deviations of the mean. At least 99 % of the observations are with 10 standard deviations of the mean. Interpreting Variability Chebyshev’s Rule
  • 42. Consider the student age data Example - Chebyshev’s Rule 17 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 25 26 28 28 30 37 38 44 47 Color code: within 1 standard deviation of the mean within 2 standard deviations of the mean within 3 standard deviations of the mean within 4 standard deviations of the mean within 5 standard deviations of the mean
  • 43. Summarizing the student age data Example - Chebyshev’s Rule Notice that Chebyshev gives very conservative lower bounds and the values aren’t very close to the actual percentages. 79/79 = 100%  96.0% within 5 standard deviations of the mean 77/79 = 97.5%  93.8% within 4 standard deviations of the mean 76/79 = 96.2%  88.8% within 3 standard deviations of the mean 75/79 = 94.9%  75% within 2 standard deviations of the mean 72/79 = 91.1%  0% within 1 standard deviation of the mean Actual Chebyshev’s Interval
  • 44. Empirical Rule If the histogram of values in a data set is reasonably symmetric and unimodal (specifically, is reasonably approximated by a normal curve), then Approximately 68% of the observations are within 1 standard deviation of the mean. Approximately 95% of the observations are within 2 standard deviation of the mean. Approximately 99.7% of the observations are within 3 standard deviation of the mean.
  • 45. Z Scores The z score is how many standard deviations the observation is from the mean. A positive z score indicates the observation is above the mean and a negative z score indicates the observation is below the mean.
  • 46. Z Scores Computing the z score is often referred to as standardization and the z score is called a standardized score .
  • 47. Example A sample of GPAs of 38 statistics students appear below (sorted in increasing order) 2.00 2.25 2.36 2.37 2.50 2.50 2.60 2.67 2.70 2.70 2.75 2.78 2.80 2.80 2.82 2.90 2.90 3.00 3.02 3.07 3.15 3.20 3.20 3.20 3.23 3.29 3.30 3.30 3.42 3.46 3.48 3.50 3.50 3.58 3.75 3.80 3.83 3.97
  • 48. Example The following stem and leaf indicates that the GPA data is reasonably symmetric and unimodal. 2 0 2 233 2 55 2 667777 2 88899 3 0001 3 2222233 3 444555 3 7 3 889 Stem: Units digit Leaf: Tenths digit
  • 50. Example Notice that the empirical rule gives reasonably good estimates for this example. 38/38 = 100%  99.7% within 3 standard deviations of the mean 37/38 = 97%  95% within 2 standard deviations of the mean 27/38 = 71%  68% within 1 standard deviation of the mean Actual Empirical Rule Interval
  • 51. Comparison of Chebyshev’s Rule and the Empirical Rule The following refers to the weights in the sample of 79 students. Notice that the stem and leaf diagram suggest the data distribution is unimodal but is positively skewed because of the outliers on the high side. Nevertheless, the results for the Empirical Rule are good. 10 3 11 37 12 011444555 13 000000455589 14 000000000555 15 000000555567 16 000005558 17 0000005555 18 0358 19 5 20 00 21 0 22 55 23 79 Stem: Hundreds & tens digits Leaf: Units digit
  • 52. Comparison of Chebyshev’ Rule and the Empirical Rule Notice that even with moderate positive skewing of the data, the Empirical Rule gave a much more usable and meaningful result.  99.7%  95%  68% Empirical Rule 79/79 = 100%  88.8% within 3 standard deviations of the mean 75/79 = 94.9%  75% within 2 standard deviations of the mean 56/79 = 70.9%  0% within 1 standard deviation of the mean Actual Chebyshev’s Rule Interval