SlideShare a Scribd company logo
Methods for Selecting Random Samples
RANDSAMP.XLS This file contains data about the annual incomes of 40 families. We want to choose a simple random sample of size 10 from this frame. How can this be done? And how do summary statistics of the chosen families compare to the corresponding summary statistics of the population?
Data
Sampling Terminology In any sampling problem there is a relevant  population , the set of all members about which the study intends to make inferences. Before we select a sample from a given population, we typically need a list of all members of the population. This list is called the  frame , and the potential sample members are called  sampling units . There are two type of samples, probability samples and judgmental samples.
Sampling Terminology -- continued A  probability sample  is a sample in which the sampling units are chosen from the population by means of a random mechanism such as a random number table. No formal random mechanism is used to select a  judgmental sample , in this case the sampling units are chosen according to the sampler’s judgment. The simplest type of sampling scheme is appropriately called  simple random sampling.
Solution The idea is very simple. We first generate a column of random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows.  The following procedure produces the results. Random numbers.  Enter the formula =RAND() in cell C10 and copy it down column C. Replace with values.  To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do third, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option.
Solution -- continued Copy to a new range.  Copy the range A10:C49 to the range E10:G49. Sort.  Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample. Means.  Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F.
Results
More Random Samples Automatically If we would like more random samples of size 10 we would need to repeat the process repeatedly. To save you the trouble, we have setup a macro to automate the process. See the Automated sheet of the  RANDSAMP.XLS  file. By clicking on the button we get a different random sample.
Example 8.2 Methods for Selecting Random Samples
RECEIVE.XLS This file contains 280 accounts receivable for the Spring Mills Company. There are three variables: Size: customer size (small, medium, large), depending on its volume of business with Spring Mills Days: number of days since the customer was billed Amount: amount of the bill Generate 50 random samples of size 15 each from the small customers only, calculate the average amount owed in each random sample, and construct a histogram of these 50 averages.
Generated Random Sample
Solution To select small accounts only, insert blank row after account 150 (the last small account). Then, with the cursor anywhere in the small account data set, use the StatPro/Statistical Inference/Generate Random Samples enter 50 and 15 as the number of samples and the sample size, and put the results in a new sheet. To find the amounts owed for the sampled accounts, enter the formula  =VLOOKUP(B3,Data!Data,4)  in cell B21 and copy it to the range B21:AY35.
Solution -- continued Then calculate the average in row 37 with the AVERAGE function and transpose this row of average to a column of averages in BA4:BA53 with the formula  =TRANSPOSE(B37:AY37)  and pressing Ctrl-Shift-Enter. Use StatPro’s histogram procedure to create a histogram - each will look different  because of the random numbers selected.
Solution -- continued The histogram indicates the variability of sample means we might obtain by selecting many different random samples of size 15 from this population of small customer accounts.
Example 8.3 Methods for Selecting Random Samples
STRATIFIED.XLS This file contains a frame of all 1000 people in the city of Smalltown who have Sears credit cards. Sears is interested in estimating the average number of  other  credit cards these people own, as well as other information about their use of credit. The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone. How might Sears proceed?
Systematic Sampling A  systematic sample  provides a convenient way to choose the sample. It works as follows: First, we calculate the  sampling interval  as the population size divided by the sample size. Next, we use a random mechanism to choose a number between 1 and 220 (Say number 131). Then we choose the 131st name, the 351st name, the 571 and so on. The result is a systematic sample of size n=250.
Stratified Sampling Suppose we can identify various subpopulations within the total population. We call these subpopulations  strata . It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called  stratified sampling . This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum.
Stratified Sampling -- continued To obtain a stratified random sample we must choose a total sample size  n , and we must choose a sample size  n i  for each stratum  i . There are many ways to choose these numbers but the most popular method is  proportional sample sizes . The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata.
Solution First Sears must decide exactly how to stratify by age. There reasoning is that different age groups probably have different attitudes and behavior regarding credit. After preliminary investigation they decide to have three age categories: 18-30, 31-62, and 63-80. The calculation goes as follows: the total sample size is cell C3 the definitions of the strata in rows 6-8 the customer data in range A11:B1010
Stratified Sample
Solution -- continued To see what age category each customer falls in we enter the formula   =IF(B11<=$D$6,1,IF(B11<=$D$7,2,3)) in cell C11 and then copy it down column C. Next, it is useful to “unstack” the data into three groups, one for each age category. It is easy to unstack the data in columns A-C.  With the cursor anywhere in A10:C1010 select StatPro/Data Utilities/Unstack Variables. Select Category as the Code variable, select Cust and Age as the variables to unstack, and accept the default location for the unstacked variables.
Solution -- continued Once the variables are unstacked we can calculate the counts and sample sizes in F6:G8 with the formulas =COUNT(E11:E142)  and  =ROUND(TotSampSize*F6/1000,0) . Finally, we proceed by copying the data in columns E and F into L and M and append a a column of random numbers, sort on the random number column and choose the first 13 (or how ever many) customers. The file shows the calculations for the other categories.
Cluster Sampling Suppose a company is interested in various characteristics of households in a particular city. The sampling units are households.  We could proceed with the sampling methods discussed but it would be more convenient another way. We could divide the city into city blocks as sampling units and then sample all the households in the chosen blocks. In this case the city blocks are called  clusters  and the sampling is called  cluster sampling .
Cluster Sampling -- continued The advantage of cluster sampling is sampling convenience (and possibly less cost). It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster. When all sampling units within each cluster are taken it is called a  single stage  sampling scheme. Real applications are often more complex and result in  multistage sampling schemes .
Example 8.4 An Introduction to Estimation
AUDIT.XLS An internal auditor for a furniture retailer wants to estimate the average of all accounts receivable taken over the population of all customer accounts. The company has approximately 10,000 accounts. An exhaustive enumeration is impossible. Therefore, the auditor randomly samples 100 of the accounts. This file contains the observed data. What can the auditor conclude from this sample?
Random Sample
Sources of Estimation Error There are two basic sources of errors that can occur when we sample randomly from a population: Sampling error  results from “unlucky” samples. Nonsampling errors,  which are quite different, can occur for a variety of reasons. Nonresponse bias  is when a portion of the sample fails to respond to the survey. Nontruthful responses  are particularly a problem when asked sensitive questions. One solution is to use a  randomized response  technique by giving two sensitive questions: one sensitive, one innocuous. Measurement error  occurs when the responses to the questions do not reflect what the investigator had in mind.
Sampling Distribution of the Sample Mean We typically estimate the population mean by the sample mean of the randomly chosen sample. The sample mean is called a  point estimate  of the population mean. In general a point estimate of any population parameter is a single-value estimate of that parameter, based on observed sample data. The  sampling error  is the difference between the observed sample mean and the true population mean.
Sampling Distribution of the Sample Mean A negative sampling error means an underestimate of the population mean. The standard deviation of the observed sample mean is called the  standard error of the mean . The sample mean is an  unbiased  estimate of the population mean.
Solution The receivables for the 100 sampled accounts appear in column E. We calculate the sample mean and the sample standard deviation. Then we calculate the (approximate) standard error of the mean with the formula =Sstdev/SQRT(SampSize) in cell B9.
Interpretation The auditor should interpret these values as follows: The sample mean can be used to estimate the unknown population mean. It provides a best guess for the average of the receiveables for the 10,000 accounts. The standard error provides a measure of accuracy. The auditor can be 95% certain that the mean from all 10,000 accounts is within the interval $279 + or - $84, that is, between $195 and $363.
An Introduction to Estimation
Background Information Suppose you have he opportunity to play a game with a “wheel of fortune”. When you spin a large wheel, it is equally likely to stop in any position. Depending on where it stops, you win anywhere from $0 to $1000. Let’s suppose your winnings are actually based on not one, but  n  spins of the wheel. If  n =2, your winnings are based on the average of two spins. How does the distribution of your winnings depend on  n ?
Random Sampling? What does this experiment have to do with random sampling? Here, the population is the set of all outcomes we could obtain from a single spin of the wheel; that is, all dollar values from $0 to $1000. Each spin results in one randomly sampled dollar value from the population. Furthermore, because we have assumed that the wheel is equally likely to land in any position, all possible values in the continuum from $0 to $1000 have the same chance of occurring.
Random Sampling? The resulting population distribution is called the  uniform distribution  on the interval from $0 to $1000. It can be shown that the mean and standard deviation are $500 and $289, respectively.
SPIN1.XLS In order to analyze the distribution of winnings based on the average of  n  spins we need to do a sequence of simulations for  n =1,  n =2, n=3,  n =6 and  n =10. This spreadsheet contains the simulation for  n =1. The other simulations can be found in the following spreadsheets,  SPIN2.XLS ,  SPIN3.XLS ,  SPIN6.XLS , and  SPIN10.XLS . For each simulation we consider 1000 replications of an experiment.
Simulations The experiment simulates  n  spins of the wheel and calculates the average - that is, the winnings - from the  n  spins. Based on these 1000 replications, we can then calculate the average winnings, the standard deviation of winnings, and a histogram of winnings for each  n . These will show clearly how the distribution of winnings depends on  n . The following slide shows the results for  n =1. Here, there is no averaging.
 
Simulations -- continued To replicate the experiment 1000 times and collect statistics, we proceed as follows. Random outcomes.  To generate outcomes uniformly distributed between $0 and $1000 we enter the formula  =$B$3RAND( ) *($B$4-$B$3)  in cells B11 and copy it down column B. The effect of this formula is to generate a random number between 0 and 1 and multiply it be $1000. Summary measures.  Calculate the average and standard deviation of the 1000 winnings in column B with the AVERAGE and STDEV functions. These values appear in cells E4 and E5.
Simulations -- continued Frequency table and histogram.  Use the StatPro histogram procedure to create a histogram of the values in column B.  Note the following from the chart and graph from spin 1: The sample mean of the winnings (E4) is very close to the population mean: $500. The standard deviation of the winnings (cell E5) is very close to the population standard deviation: $289. The histogram is nearly flat. These should come as no surprise without any averaging taking place. Therefore, they are equivalent to the flat population distribution.
Simulations -- continued But what happens when  n  > 1? The following slide contains the chart and graph of the  n =2 simulation. To do this we formed a second column of outcomes in column C corresponding to a second spin in each experiment. We average the values in column B and C to obtain each of the winnings in column D. The average winnings is very close to $500, but the standard deviation is much lower and the histogram is no longer flat.
 
Simulations -- continued The histogram is now triangular shaped - symmetric, but not yet bell shaped. To develop similar simulations for  n =3,  n =6,  n =10, or any other  n , we insert additional outcome columns and make sure that the AVERAGE formula in the Winnings column average all  n  outcomes to its left. They clearly show two effects of increasing  n: the histogram becomes more bell shaped there is less variability.
Histogram for Three Spins
Histogram for Six Spins
Histogram for Ten Spins
Central Limit Theorem The mean stays right at $500. This behavior is exactly as the central limit theorem predicts. For any population distribution with mean  mu , the sampling distribution of the sample mean is approximately normal with the mean  mu  and the standard deviation  , and the approximation as  n  increases. If fact, because the population distribution is symmetric in this example - it’s flat - we see the effect of the theorem for  n  much less than 30; it is already evident for  n  as low as 6.
An Introduction to Estimation
Background Information A marketing researcher has been hired by a videocassette rental company to estimate the average number of videocassettes rented annually by households in a particular metropolitan area. The researcher decides to determine the sample size that makes the maximum probable absolute error approximately equal to 10, Discuss how she should proceed.
Sample Size Determination The determination of sampling size is usually driven by sampling error considerations. The usual procedure is to select an acceptable sampling error  B  called the  maximum probable absolute error  by using the equation The implication is that if we randomly sample many members from the population, then there is a 95% chance that the resulting sampling error will be no greater than  B  in magnitude.
SAMPSIZE.XLS This file contains the data needed to solve the problem. The researcher has chosen to maximize probable absolute error criterion with B=10, as the value she is willing to tolerate. Therefore, she should use the maximum probable error equation.
Solution -- continued To use this equation she must estimate a value of   . Based on her knowledge of the industry and available historical data, she uses a best guess of  sigma =50. She then uses the values from C7 and C8 to find the required sample size in C10 with the formula =4*PopStDev^2/MaxAbsErr^2 Finally, she takes a sample of size 100 and observes the sample values shown in column F.  Based on this sample, we calculate summary measures in the usual way in the range C13:C16.
Sample Size Determination
Results The absolute error in cell C16 is 2 times as great as the standard error in cell C15. It is slightly higher than the maximum absolute error she specified in cell C8 because she observed a larger standard deviation than she had guessed. In other words, the fact that there is evidently more variation in the population than she thought makes her sample mean based on 100 households slightly less accurate than she intended.

More Related Content

PPT
Session02
DOC
Qnt 275 Enhance teaching / snaptutorial.com
PPTX
Elementary statistics for Food Indusrty
DOC
QNT 275 Inspiring Innovation / tutorialrank.com
DOC
QNT 275 Exceptional Education - snaptutorial.com
PPTX
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
PDF
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
DOCX
Performance of the classification algorithm
Session02
Qnt 275 Enhance teaching / snaptutorial.com
Elementary statistics for Food Indusrty
QNT 275 Inspiring Innovation / tutorialrank.com
QNT 275 Exceptional Education - snaptutorial.com
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
Performance of the classification algorithm

What's hot (20)

PPTX
PPT
Spss lecture notes
PPTX
What is FP Growth Analysis and How Can a Business Use Frequent Pattern Mining...
PPTX
Sampling Technique - Anish
PPTX
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
PPTX
Data cleaning and screening
PPTX
Using SPSS: A Tutorial
PPT
04 statistics presentation_notes
DOCX
Descriptive Statistics - SPSS
PPTX
Machine learning session1
PPTX
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
PDF
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
PPTX
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
PPT
Chapter 12
PPTX
Basic of Statistical Inference Part-I
PPTX
PPTX
Evaluating classification algorithms
PPTX
Sampling
PDF
Sampling Distribution and Simulation in R
Spss lecture notes
What is FP Growth Analysis and How Can a Business Use Frequent Pattern Mining...
Sampling Technique - Anish
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
Data cleaning and screening
Using SPSS: A Tutorial
04 statistics presentation_notes
Descriptive Statistics - SPSS
Machine learning session1
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
Chapter 12
Basic of Statistical Inference Part-I
Evaluating classification algorithms
Sampling
Sampling Distribution and Simulation in R
Ad

Viewers also liked (13)

PPTX
Determining the Sample Size
PPTX
Sample size calculation
PPTX
STATISTICS: Normal Distribution
PPTX
Sample size calculation
PPT
Sample size
PPTX
Normal distribution and sampling distribution
PDF
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
PPT
Sampling methods PPT
PPTX
Sampling Methods in Qualitative and Quantitative Research
PPT
Chapter 8-SAMPLE & SAMPLING TECHNIQUES
PPTX
sampling ppt
PPT
PROBABILITY SAMPLING TECHNIQUES
PPTX
RESEARCH METHOD - SAMPLING
Determining the Sample Size
Sample size calculation
STATISTICS: Normal Distribution
Sample size calculation
Sample size
Normal distribution and sampling distribution
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Sampling methods PPT
Sampling Methods in Qualitative and Quantitative Research
Chapter 8-SAMPLE & SAMPLING TECHNIQUES
sampling ppt
PROBABILITY SAMPLING TECHNIQUES
RESEARCH METHOD - SAMPLING
Ad

Similar to Lecture 4 Applied Econometrics and Economic Modeling (20)

PPTX
PSUnit_IV_Lesson_1_Computing_the_Point_Estimate_of_a_Population_Mean.pptx
PPTX
PSUnit_IV_Lesson_1_Computing_the_Point_Estimate_of_a_Population_Mean.pptx
DOCX
As mentioned earlier, the mid-term will have conceptual and quanti.docx
PPTX
STA 222 Lecture 1 Introduction to Statistical Inference.pptx
PPTX
12-Computing-the-Point-Estimate-of-a-Population-Mean.pptx
PPT
Penggambaran Data dengan Grafik
PDF
1.0 Descriptive statistics.pdf
PDF
Descriptive Analytics: Data Reduction
PPTX
Sampling techniques new
PPTX
Sampling techniques new
PPTX
Introduction to Statistics (Part -I)
PDF
Data collection and_sampling sample an method
PDF
Sampling as data collection
PDF
Semana8 muestreo
PPS
nossi ch 9
PPT
Brm sampling techniques
PPT
RM chapter-4 (3).ppt measurements and descriptive
PDF
Accurate Campaign Targeting Using Classification Algorithms
PPTX
Sampling and statistical inference
DOCX
Variables sampling.docx
PSUnit_IV_Lesson_1_Computing_the_Point_Estimate_of_a_Population_Mean.pptx
PSUnit_IV_Lesson_1_Computing_the_Point_Estimate_of_a_Population_Mean.pptx
As mentioned earlier, the mid-term will have conceptual and quanti.docx
STA 222 Lecture 1 Introduction to Statistical Inference.pptx
12-Computing-the-Point-Estimate-of-a-Population-Mean.pptx
Penggambaran Data dengan Grafik
1.0 Descriptive statistics.pdf
Descriptive Analytics: Data Reduction
Sampling techniques new
Sampling techniques new
Introduction to Statistics (Part -I)
Data collection and_sampling sample an method
Sampling as data collection
Semana8 muestreo
nossi ch 9
Brm sampling techniques
RM chapter-4 (3).ppt measurements and descriptive
Accurate Campaign Targeting Using Classification Algorithms
Sampling and statistical inference
Variables sampling.docx

More from stone55 (12)

PPT
excel master series-Anova in-excel-to-improve-marketing
PPT
Lecture6 Applied Econometrics and Economic Modeling
PPT
Lecture7b Applied Econometrics and Economic Modeling
PPT
Lecture7a Applied Econometrics and Economic Modeling
PPT
Lecture5 Applied Econometrics and Economic Modeling
PPT
Lecture3 Applied Econometrics and Economic Modeling
PPT
Lecture8 Applied Econometrics and Economic Modeling
PPT
lecture 1 applied econometrics and economic modeling
PPT
Consumer credit-risk3440
PPT
Summer 07-mfin7011-tang1922
PDF
smoothwall networkguide
PPT
Lecture2 Applied Econometrics and Economic Modeling
excel master series-Anova in-excel-to-improve-marketing
Lecture6 Applied Econometrics and Economic Modeling
Lecture7b Applied Econometrics and Economic Modeling
Lecture7a Applied Econometrics and Economic Modeling
Lecture5 Applied Econometrics and Economic Modeling
Lecture3 Applied Econometrics and Economic Modeling
Lecture8 Applied Econometrics and Economic Modeling
lecture 1 applied econometrics and economic modeling
Consumer credit-risk3440
Summer 07-mfin7011-tang1922
smoothwall networkguide
Lecture2 Applied Econometrics and Economic Modeling

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Institutional Correction lecture only . . .
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Lesson notes of climatology university.
PDF
Pre independence Education in Inndia.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
RMMM.pdf make it easy to upload and study
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
01-Introduction-to-Information-Management.pdf
Pharma ospi slides which help in ospi learning
TR - Agricultural Crops Production NC III.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Final Presentation General Medicine 03-08-2024.pptx
Insiders guide to clinical Medicine.pdf
Complications of Minimal Access Surgery at WLH
Sports Quiz easy sports quiz sports quiz
Institutional Correction lecture only . . .
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Computing-Curriculum for Schools in Ghana
Microbial diseases, their pathogenesis and prophylaxis
Module 4: Burden of Disease Tutorial Slides S2 2025
STATICS OF THE RIGID BODIES Hibbelers.pdf
Lesson notes of climatology university.
Pre independence Education in Inndia.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
RMMM.pdf make it easy to upload and study
FourierSeries-QuestionsWithAnswers(Part-A).pdf

Lecture 4 Applied Econometrics and Economic Modeling

  • 1. Methods for Selecting Random Samples
  • 2. RANDSAMP.XLS This file contains data about the annual incomes of 40 families. We want to choose a simple random sample of size 10 from this frame. How can this be done? And how do summary statistics of the chosen families compare to the corresponding summary statistics of the population?
  • 4. Sampling Terminology In any sampling problem there is a relevant population , the set of all members about which the study intends to make inferences. Before we select a sample from a given population, we typically need a list of all members of the population. This list is called the frame , and the potential sample members are called sampling units . There are two type of samples, probability samples and judgmental samples.
  • 5. Sampling Terminology -- continued A probability sample is a sample in which the sampling units are chosen from the population by means of a random mechanism such as a random number table. No formal random mechanism is used to select a judgmental sample , in this case the sampling units are chosen according to the sampler’s judgment. The simplest type of sampling scheme is appropriately called simple random sampling.
  • 6. Solution The idea is very simple. We first generate a column of random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows. The following procedure produces the results. Random numbers. Enter the formula =RAND() in cell C10 and copy it down column C. Replace with values. To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do third, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option.
  • 7. Solution -- continued Copy to a new range. Copy the range A10:C49 to the range E10:G49. Sort. Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample. Means. Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F.
  • 9. More Random Samples Automatically If we would like more random samples of size 10 we would need to repeat the process repeatedly. To save you the trouble, we have setup a macro to automate the process. See the Automated sheet of the RANDSAMP.XLS file. By clicking on the button we get a different random sample.
  • 10. Example 8.2 Methods for Selecting Random Samples
  • 11. RECEIVE.XLS This file contains 280 accounts receivable for the Spring Mills Company. There are three variables: Size: customer size (small, medium, large), depending on its volume of business with Spring Mills Days: number of days since the customer was billed Amount: amount of the bill Generate 50 random samples of size 15 each from the small customers only, calculate the average amount owed in each random sample, and construct a histogram of these 50 averages.
  • 13. Solution To select small accounts only, insert blank row after account 150 (the last small account). Then, with the cursor anywhere in the small account data set, use the StatPro/Statistical Inference/Generate Random Samples enter 50 and 15 as the number of samples and the sample size, and put the results in a new sheet. To find the amounts owed for the sampled accounts, enter the formula =VLOOKUP(B3,Data!Data,4) in cell B21 and copy it to the range B21:AY35.
  • 14. Solution -- continued Then calculate the average in row 37 with the AVERAGE function and transpose this row of average to a column of averages in BA4:BA53 with the formula =TRANSPOSE(B37:AY37) and pressing Ctrl-Shift-Enter. Use StatPro’s histogram procedure to create a histogram - each will look different because of the random numbers selected.
  • 15. Solution -- continued The histogram indicates the variability of sample means we might obtain by selecting many different random samples of size 15 from this population of small customer accounts.
  • 16. Example 8.3 Methods for Selecting Random Samples
  • 17. STRATIFIED.XLS This file contains a frame of all 1000 people in the city of Smalltown who have Sears credit cards. Sears is interested in estimating the average number of other credit cards these people own, as well as other information about their use of credit. The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone. How might Sears proceed?
  • 18. Systematic Sampling A systematic sample provides a convenient way to choose the sample. It works as follows: First, we calculate the sampling interval as the population size divided by the sample size. Next, we use a random mechanism to choose a number between 1 and 220 (Say number 131). Then we choose the 131st name, the 351st name, the 571 and so on. The result is a systematic sample of size n=250.
  • 19. Stratified Sampling Suppose we can identify various subpopulations within the total population. We call these subpopulations strata . It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called stratified sampling . This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum.
  • 20. Stratified Sampling -- continued To obtain a stratified random sample we must choose a total sample size n , and we must choose a sample size n i for each stratum i . There are many ways to choose these numbers but the most popular method is proportional sample sizes . The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata.
  • 21. Solution First Sears must decide exactly how to stratify by age. There reasoning is that different age groups probably have different attitudes and behavior regarding credit. After preliminary investigation they decide to have three age categories: 18-30, 31-62, and 63-80. The calculation goes as follows: the total sample size is cell C3 the definitions of the strata in rows 6-8 the customer data in range A11:B1010
  • 23. Solution -- continued To see what age category each customer falls in we enter the formula =IF(B11<=$D$6,1,IF(B11<=$D$7,2,3)) in cell C11 and then copy it down column C. Next, it is useful to “unstack” the data into three groups, one for each age category. It is easy to unstack the data in columns A-C. With the cursor anywhere in A10:C1010 select StatPro/Data Utilities/Unstack Variables. Select Category as the Code variable, select Cust and Age as the variables to unstack, and accept the default location for the unstacked variables.
  • 24. Solution -- continued Once the variables are unstacked we can calculate the counts and sample sizes in F6:G8 with the formulas =COUNT(E11:E142) and =ROUND(TotSampSize*F6/1000,0) . Finally, we proceed by copying the data in columns E and F into L and M and append a a column of random numbers, sort on the random number column and choose the first 13 (or how ever many) customers. The file shows the calculations for the other categories.
  • 25. Cluster Sampling Suppose a company is interested in various characteristics of households in a particular city. The sampling units are households. We could proceed with the sampling methods discussed but it would be more convenient another way. We could divide the city into city blocks as sampling units and then sample all the households in the chosen blocks. In this case the city blocks are called clusters and the sampling is called cluster sampling .
  • 26. Cluster Sampling -- continued The advantage of cluster sampling is sampling convenience (and possibly less cost). It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster. When all sampling units within each cluster are taken it is called a single stage sampling scheme. Real applications are often more complex and result in multistage sampling schemes .
  • 27. Example 8.4 An Introduction to Estimation
  • 28. AUDIT.XLS An internal auditor for a furniture retailer wants to estimate the average of all accounts receivable taken over the population of all customer accounts. The company has approximately 10,000 accounts. An exhaustive enumeration is impossible. Therefore, the auditor randomly samples 100 of the accounts. This file contains the observed data. What can the auditor conclude from this sample?
  • 30. Sources of Estimation Error There are two basic sources of errors that can occur when we sample randomly from a population: Sampling error results from “unlucky” samples. Nonsampling errors, which are quite different, can occur for a variety of reasons. Nonresponse bias is when a portion of the sample fails to respond to the survey. Nontruthful responses are particularly a problem when asked sensitive questions. One solution is to use a randomized response technique by giving two sensitive questions: one sensitive, one innocuous. Measurement error occurs when the responses to the questions do not reflect what the investigator had in mind.
  • 31. Sampling Distribution of the Sample Mean We typically estimate the population mean by the sample mean of the randomly chosen sample. The sample mean is called a point estimate of the population mean. In general a point estimate of any population parameter is a single-value estimate of that parameter, based on observed sample data. The sampling error is the difference between the observed sample mean and the true population mean.
  • 32. Sampling Distribution of the Sample Mean A negative sampling error means an underestimate of the population mean. The standard deviation of the observed sample mean is called the standard error of the mean . The sample mean is an unbiased estimate of the population mean.
  • 33. Solution The receivables for the 100 sampled accounts appear in column E. We calculate the sample mean and the sample standard deviation. Then we calculate the (approximate) standard error of the mean with the formula =Sstdev/SQRT(SampSize) in cell B9.
  • 34. Interpretation The auditor should interpret these values as follows: The sample mean can be used to estimate the unknown population mean. It provides a best guess for the average of the receiveables for the 10,000 accounts. The standard error provides a measure of accuracy. The auditor can be 95% certain that the mean from all 10,000 accounts is within the interval $279 + or - $84, that is, between $195 and $363.
  • 35. An Introduction to Estimation
  • 36. Background Information Suppose you have he opportunity to play a game with a “wheel of fortune”. When you spin a large wheel, it is equally likely to stop in any position. Depending on where it stops, you win anywhere from $0 to $1000. Let’s suppose your winnings are actually based on not one, but n spins of the wheel. If n =2, your winnings are based on the average of two spins. How does the distribution of your winnings depend on n ?
  • 37. Random Sampling? What does this experiment have to do with random sampling? Here, the population is the set of all outcomes we could obtain from a single spin of the wheel; that is, all dollar values from $0 to $1000. Each spin results in one randomly sampled dollar value from the population. Furthermore, because we have assumed that the wheel is equally likely to land in any position, all possible values in the continuum from $0 to $1000 have the same chance of occurring.
  • 38. Random Sampling? The resulting population distribution is called the uniform distribution on the interval from $0 to $1000. It can be shown that the mean and standard deviation are $500 and $289, respectively.
  • 39. SPIN1.XLS In order to analyze the distribution of winnings based on the average of n spins we need to do a sequence of simulations for n =1, n =2, n=3, n =6 and n =10. This spreadsheet contains the simulation for n =1. The other simulations can be found in the following spreadsheets, SPIN2.XLS , SPIN3.XLS , SPIN6.XLS , and SPIN10.XLS . For each simulation we consider 1000 replications of an experiment.
  • 40. Simulations The experiment simulates n spins of the wheel and calculates the average - that is, the winnings - from the n spins. Based on these 1000 replications, we can then calculate the average winnings, the standard deviation of winnings, and a histogram of winnings for each n . These will show clearly how the distribution of winnings depends on n . The following slide shows the results for n =1. Here, there is no averaging.
  • 41.  
  • 42. Simulations -- continued To replicate the experiment 1000 times and collect statistics, we proceed as follows. Random outcomes. To generate outcomes uniformly distributed between $0 and $1000 we enter the formula =$B$3RAND( ) *($B$4-$B$3) in cells B11 and copy it down column B. The effect of this formula is to generate a random number between 0 and 1 and multiply it be $1000. Summary measures. Calculate the average and standard deviation of the 1000 winnings in column B with the AVERAGE and STDEV functions. These values appear in cells E4 and E5.
  • 43. Simulations -- continued Frequency table and histogram. Use the StatPro histogram procedure to create a histogram of the values in column B. Note the following from the chart and graph from spin 1: The sample mean of the winnings (E4) is very close to the population mean: $500. The standard deviation of the winnings (cell E5) is very close to the population standard deviation: $289. The histogram is nearly flat. These should come as no surprise without any averaging taking place. Therefore, they are equivalent to the flat population distribution.
  • 44. Simulations -- continued But what happens when n > 1? The following slide contains the chart and graph of the n =2 simulation. To do this we formed a second column of outcomes in column C corresponding to a second spin in each experiment. We average the values in column B and C to obtain each of the winnings in column D. The average winnings is very close to $500, but the standard deviation is much lower and the histogram is no longer flat.
  • 45.  
  • 46. Simulations -- continued The histogram is now triangular shaped - symmetric, but not yet bell shaped. To develop similar simulations for n =3, n =6, n =10, or any other n , we insert additional outcome columns and make sure that the AVERAGE formula in the Winnings column average all n outcomes to its left. They clearly show two effects of increasing n: the histogram becomes more bell shaped there is less variability.
  • 50. Central Limit Theorem The mean stays right at $500. This behavior is exactly as the central limit theorem predicts. For any population distribution with mean mu , the sampling distribution of the sample mean is approximately normal with the mean mu and the standard deviation , and the approximation as n increases. If fact, because the population distribution is symmetric in this example - it’s flat - we see the effect of the theorem for n much less than 30; it is already evident for n as low as 6.
  • 51. An Introduction to Estimation
  • 52. Background Information A marketing researcher has been hired by a videocassette rental company to estimate the average number of videocassettes rented annually by households in a particular metropolitan area. The researcher decides to determine the sample size that makes the maximum probable absolute error approximately equal to 10, Discuss how she should proceed.
  • 53. Sample Size Determination The determination of sampling size is usually driven by sampling error considerations. The usual procedure is to select an acceptable sampling error B called the maximum probable absolute error by using the equation The implication is that if we randomly sample many members from the population, then there is a 95% chance that the resulting sampling error will be no greater than B in magnitude.
  • 54. SAMPSIZE.XLS This file contains the data needed to solve the problem. The researcher has chosen to maximize probable absolute error criterion with B=10, as the value she is willing to tolerate. Therefore, she should use the maximum probable error equation.
  • 55. Solution -- continued To use this equation she must estimate a value of  . Based on her knowledge of the industry and available historical data, she uses a best guess of sigma =50. She then uses the values from C7 and C8 to find the required sample size in C10 with the formula =4*PopStDev^2/MaxAbsErr^2 Finally, she takes a sample of size 100 and observes the sample values shown in column F. Based on this sample, we calculate summary measures in the usual way in the range C13:C16.
  • 57. Results The absolute error in cell C16 is 2 times as great as the standard error in cell C15. It is slightly higher than the maximum absolute error she specified in cell C8 because she observed a larger standard deviation than she had guessed. In other words, the fact that there is evidently more variation in the population than she thought makes her sample mean based on 100 households slightly less accurate than she intended.