SlideShare a Scribd company logo
Statistical Review Measures of Central Location
SALARY.XLS Lists starting salaries for 190 graduates from an undergraduate school of business. The data is in the range named  Salary  on a sheet called  Data.
The Mean We calculate the  mean  salary by entering the formula “ =AVERAGE(Salary) ” in cell B6 of the  Excel Functions  worksheet. The mean salary is $29,762. The mean in this example is a “representative” measure because the distribution of salaries is nearly symmetric. The mean can be misleading due to skewness.
The Median The  median  is the “middle” observation when the data are listed from smallest to largest. If there is an odd number of observations, the median is the middle observation. If there is an even number of observations, we take the median to be the average of the two middle observations.
The Median -- continued We calculate the median salary in Example 3.1 by entering the formula “ =MEDIAN(Salary)”  in cell B7 of the  Excel Functions  worksheet. The median in this example is $29,850. In this case, the mean and the median values are nearly the same because the distribution is approximately symmetric.
The Median -- continued If the salary distribution were skewed (for example, a few graduates received abnormally large salaries), the mean would be biased upward while the median would not be affected by the unusual values. Thus, it is better to use the median in characterizing the center of a distribution when that distribution is skewed.
The Mode The  mode  is the most frequently occurring value. If the values are essentially continuous, as with the salaries in Example 3.1, then the mode is essentially irrelevant.  There is typically no single value that occurs more than once. Thus, the mode is not likely to provide much information.
The Mode
SHOES.XLS This file lists shoe sizes purchased at a shoe store. We seek to find the best-selling shoe size at this store.
The Mode The  mode  is the most frequently occurring value. If the values are essentially continuous, then the mode is usually not relevant.  There is typically no  single  value that occurs more than once.
Why is the mode relevant here? Shoe sizes come in discrete increments, rather than a continuum; so it makes sense to find the mode, the size that is requested most often, in this example.
Computing the Mode in EXCEL The mode can be found in Excel by entering the formula “ =MODE(Range of Data)”. Applying this formula in the present example, we find that size 11 is the most frequently purchased shoe size. This is also apparent form the histogram on the next slide.
Distribution of Shoe Sizes
Measures of Variability: Variance and Standard Deviation
Suppose that Otis Elevator is going to stop manufacturing elevator rails.  Instead, it is going to buy them from an outside supplier. Otis would like each rail to have a diameter of 1 inch. The company has obtained samples of ten elevator rails from each supplier.  They are listed in columns A and B of this Excel file. OTIS4.XLS
Which should Otis prefer? Observe that the mean, median, and mode are all exactly 1 inch for each of the two suppliers. Based on these measures, the two suppliers are equally good and right on the mark. However, we when we consider measures of variability, supplier 1 is somewhat better than supplier 2.  Why?
Explanation The reason is that supplier 2’s rails exhibit  more variability  about the mean than do supplier 1’s rails. If we want rails to have a diameter of 1 inch, then more variability around the mean is very undesirable!
Variance The most commonly used measures of variability are the  variance  and  standard deviation . The variance is essentially the average of the squared deviations from the mean. We say “essentially” because there are two versions of the variance: the  population variance  and the  sample variance.
More on the Variance The variance tends to increase when there is more variability around the mean.  Indeed, large deviations from the mean contribute heavily to the variance because they are  squared. One consequence of this is that the variance is expressed in squared units (squared dollars, for example) rather than original units.
Standard Deviation A more intuitive measure of variability is the  standard deviation.  The standard deviation is defined to be the square root of the variance. Thus, the standard deviation is measured in  original  units, such as dollars, and it is much easier to interpret.
Computing Variance and Standard Deviation in Excel Excel has built-in functions for computing these measures of variability. The sample variances and standard deviations of the rail diameters from the suppliers in the present example can be found by entering the following formulas: “ =VAR(Supplier1) ” in cell E8 and “ =STDEV(Supplier1) ” in cell E9.
Computing Variances & Standard Deviations -- continued Of course, enter similar formulas for supplier 2 in cells F8 and F9. As we mentioned earlier, it is difficult to interpret the variances numerically because they are expressed in squared inches, not inches.  All we can say is that the variance from supplier 2 is considerably larger than the variance from supplier 1.
Interpretation of the Standard Deviation The standard deviations, on the other hand, are expressed in inches.  The standard deviation for supplier 1 is approximately 0.012 inch, and supplier 2’s standard deviation is approximately three times this large. This is quite a disparity.  Hence, Otis will prefer to buy rails from supplier 1.
Interpretation of the Standard Deviation: Rules of Thumb
DOW.XLS This file contains monthly closing prices for the Dow Jones Index from January 1947 through January 1993.  The monthly returns from the index are also shown starting with February 1947. Each return is the monthly percentage change (expressed) as a decimal) in the index. How well do the rules of thumb work for these data?
Rules of Thumb Many data sets follow “rules of thumb”. Approximately 68% of the observations are within one standard deviation of the mean. Approximately 95% of the observations are within two standard deviations of the mean. Approximately 99.7% - almost all - of the observations are within three standard deviations of the mean.
Index Time Series Plot A time series plot of the index show that the index has been increasingly fairly steadily over the period. Whenever a series indicates a clear trend such as the index does, most of the measures we have been discussing are less relevant. For example, the mean of the index for this period has at most historical interest. We are probably more interested in predicting the future of the Dow, and the historical mean has little relevance for predicting the future.
Time Series Plot of Dow Closing Index
Time Series Plot of Dow Returns
Return Time Series Plot A time series plot of the returns show no obvious trend over the period. The measures we have been discussing are relevant in discussing the series of returns, which fluctuate around a stable mean. We first calculate the mean and standard deviation of the returns by using the Excel functions AVERAGE and STDEV in cells B4 and B5. See the table on the next slide.
Rules of Thumb for Dow Jones Data
Returns -- continued The average return is 0.59% and the standard deviation of about 3.37%. Therefore, the rules of thumb (if they apply) imply, for example, that about 2/3 of all returns are within the interval 0.59%  +  3.37%, that is from -2.78% to 3.95%. In order to determine if the rules of thumb apply to these returns, we can use a frequency table.
Creating the Frequency Table We first enter the upper limits of the suitable categories in the range A8:A15. Any categories can be chosen but it is convenient to choose categories in which each breakpoint is one standard deviation higher than the previous one with the open-ended categories on either end are “more than 3 standard deviations from the mean”. Next we use the FREQUENCY function to fill in column C. “ =FREQUENCY(Returns,Bins)”
Frequency Table continued Finally, we use the frequencies in column C to calculate the actual percentage of return within  k  standard deviations of the mean for  k=1, k=2 and k=3  and we compare these with the percentages from the rules of thumb. The agreement between these percentages is not perfect - there are a few more observations within one standard deviation of the mean than the rule of thumb predicts - but in general the rules of thumb work quite well.
Obtaining Summary Measures with Add-Ins
SALARY.XLS Lists starting salaries for 190 graduates from an undergraduate school of business. The data is in the range named  Salary  on a sheet called  Data. We need to find a set of  useful summary measures for the salaries.
Summary Statistics To find the summary statistics of a set of data we can use the Stat-Pro Add-In or Excel’s Analysis ToolPak. In this example we use the Stat-Pro Add-In . Begin by placing the cursor anywhere within the data range. Then select StatPro/Summary Stats/One-Variable Summary Stats menu item. Select all variables you want to summarize, and select the summary measures you want to find from the Available Summary Measures dialog box shown on the next slide.
Available Summary Measures
About the Measures Four measures are selected by default. These are mean, median, Standard Deviation and Count. You can override these. A typical output appears here. It includes many of the  measures we have discussed  plus a few more.
About the Measures -- continued The mean absolute deviation is similar to the variance except that it is an average of the  absolute  (note squared) deviations from the mean. The kurtosis and skewness indicate the relative peakedness of the distribution and its skewness.  By clicking on any of the cells containing the measures (Column B), you’ll see that StatPro provides the formulas for the outputs. (Analysis ToolPak does not do so.)
About the Measures -- continued The effect of this is that if any of the data changes the summary measures we produced change automatically. All output is formatted as “numerical” to three decimal places by default. You can reformat them in a more appropriate manner if you would like.
Measures of Association: Covariance and Correlation
EXPENSES.XLS A survey questions members of 100 households about their spending habits. The data in this file represent the salary, expense for cultural activities, expense for sports-related activities, and the expense for dining-out for each household over the past year. Do these variables appear to be related linearly?
Covariance and Correlation When we need to summarize the relationship between two variables we can use the measures  covariance  and  correlation . We summarize the type of behavior observed in a scatterplot. Each measures the strength (and direction) of a  linear  relationship between two numerical variables. The relationship is “strong” if the points in a scatterplot cluster tightly around some straight line. If this line rises form left to right then the relationship is “positive”. If it falls from left to right then the relationship is “negative”.
Determining Linear Relationships Scatterplots of each variable versus each other would provide the answer to the question but six scatterplots would be required, one for each pair. To get a quick indication of possible linear relationships we can use Stat-Proto obtain a table of correlations and/or covariances.
Table of Correlations and Covariances To get the table, place the cursor anywhere in the data set and use the StatPro/Summary Stats/Correlations, Covariances menu item and proceed in the obvious way.
Relationships The only relationships that stand out are the positive relationships between salary and cultural expenses and between salary and dining expenses.  The negative relationships are between cultural and sports-related expenses. To confirm these graphically we show scatterplots of Salary versus Culture and Culture versus Sports
Scatterplot Indicating Positive Relationship
Scatterplot Indicating Negative Relationship
Correlation and Covariance Properties In general, the following properties are evident from the Table of correlations and covariances. The correlation between a variable and itself is 1. The correlation between X and Y is the same as the correlation between Y and X. Therefore, it is sufficient to list the correlations below (or above) the diagonal in the table. (The same is true for the covariances).  The covariance between a variable and itself is the variance of the variable. We indicate this in the heading of the covariance table.
Correlation and Covariance Properties -- continued It is difficult to interpret the magnitudes of covariances. These depend on the fact that the data are measured in dollars rather than, say, thousands of dollars. It is such easier to interpret the magnitudes of the correlations because they are scaled to be between -1 and +1.
Describing Data Sets with Boxplots
DOW.XLS This file lists the monthly returns on the Dow from February 1947 through January 1993. Use a boxplot to summarize the distribution of these returns.
Boxplots A boxplot is a very useful graphical method for summarizing data. Boxplots can be used in two ways: either to describe a single variable in a data set or to compare two (or more) variables. Excel has no boxplot option, but we included this option in the StatPro add-in.
Creating Boxplots Place the cursor anywhere within the data set, use the StatPro/Charts/BoxPlots(s) menu item and proceed in the obvious way. Eventually two sheets will be added to your workbook. One has a the boxplot chart, while the other contains summary measures used to form the boxplot. The following slides show the chart and the summary measure information.
Boxplot Chart
Boxplot Summary Measures
Keys to Understanding Boxplots The right and left of the box are at the third and first quartiles. Therefore, the length of the box equals the interquartile range (IQR), and the box itself represents the middle 50% of the observations. The height of the box has no significance. The vertical line inside the box indicates the location of the median. The point inside the box indicates the location of the mean.
Keys to Understanding Boxplots -- continued Horizontal lines are drawn from each side of the box. They extend to the most extreme observations tat are no farther than 1.5 IQRs from the box. They are useful for indicating variability and skewness. Observations farther than 1.5 IQRs from the box are shown as individual points. If they are between 1.5 IRQs and 3 IQRs from the box, they are call  mild outliers  and are hollow. Otherwise, they are called  extreme outliers  and are solid.
Dow Returns Boxplot The boxplot for this example summarizes the distribution of the returns. It implies that the Dow returns are approximately symmetric on each side of the median, although the mean is a bit below the median. Also there are a few mild outliers but no extreme outliers.
Describing Data Sets with Boxplots
ACTORS.XLS Recall that the salaries of famous actors and actresses are listed in this file. Use side-by-side boxplots to compare the salaries of male and female actors and actresses.
Side-by-Side Boxplots Boxplots are probably most useful for comparing two populations graphically. This is done using side-by-side boxplots. The data setup for this type of “comparison” can be in one of two forms: stacked or unstacked. Data are stacked if there is a “code” variable that designates which category each observation is in, and there is a single “measurement” variable that contains the data for both categories.
Side-by-Side Boxplots -- continued The data are unstacked if there is a separate data column for each category. In this example the data are stacked because Gender designates the gender associated with each observation and Salary is the single measurement variable. If the data were unstacked, for example, actors’ salaries would be in one column and actresses’ salaries would be in another.
Creating a Side-by-Side Boxplot Since the data are stacked the following must be steps must be followed to create the boxplots. Place cursor within the data set and select the StatPro/Charts/Boxplot(s) menu item. In the dialog box that opens check the “stacked” option. Then choose Gender as the code variable and Salary as the measurement variable. The resulting data in stacked form and the side-by-side boxplot appears.
Actor Data in Stacked Form
Side-by-Side Boxplot Chart
Reading the Boxplot It is clear the the female salary box is considerably to the left of the male salary box, although both have about the same IQR. Each boxplot has three indications that the salary distributions are skewed to the right: the means are larger than the medians the medians are closer to the left sides of the boxes than to the right sides the horizontal lines extend farther to the right than to the left of the boxes. However, there are no outliers.
Applying the Tools
Background Information The Spring Mills Company produces and distributes a wide variety of manufactured goods. Due to its variety, it has a a large number of customers. Spring Mills classifies these customers as small, medium and large, depending on the volume of business each does with them. Recently they have noticed a problem with accounts receivable. They are not getting paid by their customers in as timely a manner as they would like. This obviously costs them money.
RECEIVE.XLS Spring Mills has gathered data on 280 customer accounts.  For each of these accounts the data set lists three variables: Size, the size of the customer (coded 1 for small, 2 for medium, 3 for large) Days, the number of days since the customer was billed Amount, the amount the customer owes What information can we obtain from this data?
Analysis It is best to start by getting a good sense of the data. To do this we: Calculate several summary measures for the Days and Amount Create a histogram of Amount Create a scatterplot of Amount versus Days From these we determine: a positive skewness in the Amount variable (The mean is considerably larger than the median and the standard deviation of Amount is quite large). The scatterplot suggests some suspicious behavior, with two distinct groups of points.
Analysis -- continued The next step is to see whether the different customer sizes have any effect on either Days, Amount, or the relationship between Days and Amount. To do this, it is useful to unstack the Days and Amount variables - that is to create a new Days and Amounts variable for each group of customer sizes. For example, the Days and Amount variables for customers of size 1 are named Days1 and Amount1. This can be accomplished by using StatPro’s Unstack procedure but copying and pasting also work.
Analysis -- continued Once unstacked we need to calculate summary measures and a variety of charts on these unstacked variables. The charts include: Histograms of Amount for each size  customer Boxplots of days owed by different size customers Boxplots of amounts owed by different size customers Scatterplots of Amount versus Days for each size customer
Summary Measures for Combined Data
Histogram of All Amounts Owed
Scatterplot of Amount versus Days for All Customers
Summary Measures Broken Down by Size
Histogram for Small Customers
Histogram of Amount for Medium Customers
Histogram of Amount for Large Customers
Boxplots of Days Owed by Different Size Customers
Boxplots of Amounts Owed by Different Size Customers
Scatterplot of Amount versus Days for Small Companies
Scatterplot of Amount versus Days for Medium Companies
Scatterplot of Amount versus Days for Large Companies
Analysis -- continued There is obviously a lot going on here and it is evident form the charts. We point out the following: there are considerably fewer large customers than small or medium customers. the large customers tend to owe considerably more than small or medium customers. the small customers do not tend to be as long overdue as the small and medium customers. there is no relationship between Days and Amount for the small customers, but there is a definite positive relationship between these variables for the medium and large customers.
More Analysis We have done the obvious but here is still much more we can do.  For example, suppose Spring Mills wants a breakdown of customers who owe at least $500. We first create a new variable called “Large?” next to the original variables that equals 1 for all amounts greater than $500 and equals 0 otherwise. We do this by entering a formula in D6 and copying down.The next slide shows a sample of data and the formula to be used.
More Analysis - -continued We enter the formula =IF(C6>$B$3,1,0) in cell D6 and copy it down. We can then use a pivot table to create a count of the number of 1’s in this new variable for each value of the Size variable.
Pivot Tables for Counts of Customers Who Owe More than $500
More Analysis -- continued We created the pivot table twice, once showing the counts as percentages of each column, and once showing them as percentages of each row. One table shows that 73% of all customers with amounts less than $500 are small customers.The other table shows that 45% of all medium-sized customers owe more than $500. This type of analysis is often referred to as “slicing and dicing the data”. They are based on the same counts but portray them in different ways. Neither way is better; they both provide useful information.
More Analysis -- continued Finally, we investigate the amount of interest Spring Mills is losing by the delays in its customers’ payments. We assume that the company can make 12% annual interest on excess cash. Then we can create a Lost variable for each customer size that indicates the amount of interest Spring Mills loses on each customer group.
More Analysis The formula entered in cell C10 to calculate Lost1 is  =B10*A10*$C$7/365 . This is the amount owed by the number of days owed multiplied by the interest rate, divided by the number of days in a year. Then we copy this formula down and to the columns to calculate Lost2 and Lost3. Then we calculate the sums of these amounts in row 5. The next slide shows a sample of data at this point.
More Analysis -- continued Although Spring Mills is losing more per customer from the large customers, it is losing more total from the medium-sized customers - because there are more of them.
More Analysis -- continued This is shown graphically by a pie chart of the sums in row 5.
Findings If Spring Mills really wants to decrease receivables, it might want to target the medium-sized customer group, from which it is losing the most interest.  Or it could target the large customers because they owe the most on average.  The most appropriate action depends on the cost and effectiveness of targeting any particular customer group. However, the analysis presented here gives the company a much better picture of what’s currently going on.
Applying the Tools
Background Information The R&P Supermarket is open 24 hours a day, 7 days a week. Lately it has been receiving a lot of complaints from customers about excessive waiting in line for checking out.  R&P has decided to investigate this situation by gathering data on arrivals, departures, and line lengths at the checkout stations. It has collected data in half-hour increments for an entire week - 336 observations - starting at 8 am on Monday and ending at 8 am on the following Monday.
CHECKOUT.XLS This file includes the data collected on the following variables: InitialWaiting, the number waiting or being checked out at the beginning of a half-hour period Arrivals, the number of arrivals to the checkout stations during a period Departures, the number of checkout stations open during a period Checkers, the number of checkout stations open during a period
The data set also includes time variables: Day, day of week StartTime, clock time at the beginning of each half-hour period TimeInterval, a descriptive term for the time of day such as Lunch rush for 11:30 a.m. to 1:30 p.m. Finally the data set includes a calculated variable: EndWaiting, the number waiting or being checked out at the end of a half-hour period This variable for any time period equals InitialWaiting plus Arrivals minus Departures; it also equals InitialWaiting for the next period. CHECKOUT.XLS  -- continued
The Data
Analysis Information The manager of R&P wants to analyze the data to discover any trends, particularly in the pattern of arrivals throughout the day or across the entire week. Also the store currently uses a “seat-of-the-pants” approach to opening and closing checkout stations each half hour. The manager would like to see how well the current approach is working.  Of course, she would love to know the “best” strategy for opening and closing checkout stations - but this is beyond her (and our) capabilities at this point
Analysis Obviously time plays a crucial role in this data, so a good place to start is to create one or more time series plots. The time series plot of InitialWaiting and Arrivals Variables shown on the next slide shows that: Fridays and Saturdays are the busiest days the time pattern of arrivals is somewhat different - more spread out - during the weekends than during the weekdays there are fairly regular peak arrival periods during the weekdays the number waiting is sometimes as large as 10 or 20, and the largest of these tend to be around peak arrival times
Time Series Plot of Initial Waiting and Arrivals Variables
Analysis -- continued A similar time series plot shown below shows Arrivals and Departures.
Analysis -- continued On this plot it is difficult to separate the two time series; they are practically on top of each other. Perhaps this is not so bad because this tells us that the store is checking out customers approximately as quickly as they are arriving. A somewhat more efficient way to understand the time series behavior is to use pivot tables.
Analysis - continued A pivot table for Average InitialWaiting by Hour of Day is generated. This table can be seen on the next slide. To create this table, we: drag the InitialWaiting variable to the Data area, express it as an average drag StartTime variable to the Row area drag the Day variable to the Page area Finally, we use the data in the pivot table to create a time series plot.
Average Initial Waiting by Hour of Day
Analysis -- continued Note how the variable in the Page area works. Depending on which day we select in the page area the corresponding table and time series plot changes. Similarly, a pivot table can be created, accompanied by a column chart. Average Arrivals by TimeInterval of Day can be analyzed with a pivot table by dragging Arrival to the Data area, expressing it as an average, dragging the TimeInterval variable to the Row  area, and dragging the day variable to the Page area. This table is is on the next slide. You can check for the patterns on each day.
Average Arrivals by Time Interval of Days
More Analysis The manager of R&P is ultimately interested in whether the “right” number of checkout stations are available throughout the day. We can create two scatterplots to provide some evidence. The plots follow this slide. The first of these would be a scatterplot of Checkers versus TotalCustomers. The TotalCustomers variable is calculated as the sum of InitialWaiting and the Arrivals to measure the total amount of work presented to the checkout stations in any half-hour period.
Scatterplot of Checkers versus Total Customers
Scatterplot of End Waiting versus Checkers
More Analysis -- continued From the scatterplot we can see an obvious positive relationship between these two variables. Evidently management is reacting as they should - it is opening more checkout stations when there is more traffic. The second scatterplot shows EndWaiting versus Checkers. There is again a definite upward trend. Periods when more checkout stations are open tend to be associated with periods where more customers still remain in the checkout process.
More Analysis - continued Presumably, management is reacting with more open checkout stations in busy periods, but it is not reacting strongly enough. Just fiddling with the numbers in the Checker column will not solve the manager's problems.
Conclusions Two problems exist. First, there is a trade-off between the “cost” of having customers wait in line and the cost of paying extra checkout people. This is a difficult tradeoff for any supermarket manager. Second, the number of departures is clearly related to the number of checkout stations open. Therefore it doesn’t make sense to change the numbers in the Checkers column without changing the numbers in the Departures (and hence InitialWaiting and EndWaiting) column in an appropriate way. This is not an easy problem.
Applying the Tools
Background Information The HyTex Company is a direct marketer of stereophonic equipment, personal computers and other electronic products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are taken over the telephone. The company spends a great deal of money on its catalog mailings, and it wants to be sure that this is paying off in sales.
CATALOGS.XLS This file contains the data that has been collected on 1000 customers at the end of the current year. For each customer it has data on the following variables: Age: coded as 1 for 30 or younger, 2 for 31 to 55, 3 for 56 and older. Gender: coded as 1 for males, 2 for females OwnHome: coded as 1 if customer owns a home, 2 otherwise Married: coded as 1 if customer is currently married, 2 otherwise
CATALOGS.XLS  -- continued Close: coded as 1 if customers lives reasonably close to a shopping area that sells similar merchandise, 2 otherwise Salary: combined annual salary of customer and spouse (if any) Children: number of children living with customer History: coded as “NA” if customer had no dealings with the company before this year, 1 if customer was a low spending customer last year, 2 if medium-spending, 3 if high-spending Catalogs: Number of catalogs sent to the customer this year AmountSpent: Total amount of purchases made by the customer this year
Analysis Desired HyTex wants to analyze these data carefully to understand its customers better. They want to see whether they are sending the catalogs to the right customers. Each customer either receives 6, 12, 18, or 24 catalogs through the mail. Currently which customer receives which amount is not thought out carefully. They want to know if the current distribution of catalogs is effective. Is there room for improvement?
Analysis HyTex is obviously interested in the AmountSpent variable. Therefore, it makes sense to create scatterplots of AmountSpent versus selected “explanatory” variables. First we create the scatterplot shown on the next slide showing AmountSpent versus Salary.  It is clear that customers with higher salaries tend to spend more, although the variability in amounts spent increases significantly as salary increases.
Scatterplot of Amount Spent versus Salary
Analysis -- continued Second, we create a scatterplot of AmountSpent versus Catalogs This plot shows that there is some tendency toward higher spending among customers who receive more catalogs.  But do the catalogs cause more spending, or are more catalogs sent to customers who would tend to spend more anyway? There is no way to answer this with this data. Next is the scatterplot of AmountSpent versus Children This plot shows the interesting tendency of customers with more children to spend less.
Scatterplot of Amount Spent versus Catalogs
Scatterplot of Amount Spent versus Children
More Analysis Pivot tables and accompanying charts are very useful in this type of situation. We can begin by using this technique to get a better understanding of the demographics of the customers. The first pivot table and chart shows the percentage of an age group who own homes. Using the pivot table we can see how these percentages change for women, unmarried men and so on. Specifically small percentages of the younger people own their own hoe, regardless of marital status or gender.
Percent of Home Owners versus Age, Married and Gender
More Analysis -- continued The second pivot table shows the percentage of each age group who are married, for any combination of the Gender and OwnHome variables. For example, we can check that the married/unmarried split is quite different for women who don’t own a home than for male home owners. The third pivot table shows the average Salary broken down by Age and Gender, with page variables for OwnHome and Married.
Percent of Married versus Age, OwnHome and Gender
Average Salary versus Age, Gender, Married and OwnHome
More Analysis --continued The shape of the resulting charts is practically the same for any combination of the page variables. However, the heights of the bars change appreciably. For example,  salaries are higher for the married home owners than for unmarried customers who are not home owners. Another pivot table and chart break the data down in another way. Each column in the pivot table shows the percentages in the various History categories for a particular number of children. Each of these columns corresponds to one of the bars in the “stacked” bar chart. We have also used Close as a page variable.
Percentages in History Categories versus Children and Close
More Analysis -- continued Two interesting points emerge from this plot. First, customers with more children tend to be more heavily represented in the low-spending History category. Second, you can check by changing the setting of the Close variable from 1 to 2, the percentage of high spenders among customers who live far from electronic stores is much higher than for those who live close to such stores. The next pivot table provide insight into how HyTex determined its catalog mailing distributions. Each row shows a percentage of a particular History category that were sent 6,12,18 or 24 catalogs.
Catalog Distribution versus History
More Analysis -- continued The company’s distribution policy is still unclear but we can see that it definitely sends more catalogs to high spending customers and fewer to low spending customers. Finally the last pivot table shows AmountSpent versus History and Catalogs, with a variety of demographic variables in the page area. There are so many possible combinations so it can be difficult to discover all the existing patterns
Average Amount Spent versus History, Catalogs and Demographic Variables
More Analysis -- continued One things stands out loud and clear from the graph: the more customers receive, the more they tend to spend. Also, if they were large spenders last year, they tend to be large spenders this year. In a pivot table with this many combinations there will almost certainly be some combinations with no observations.

More Related Content

PDF
Fibonacci retracement and expansion patterns
PPT
Lecture7a Applied Econometrics and Economic Modeling
PDF
MS Excel 2010 tutorial 5
PDF
Manual for Troubleshooting Formulas & Functions in Excel
PDF
Forecasting Stock Market using Multiple Linear Regression
PDF
Introduction to charting techniques
PDF
Computer Applications in Business
PPT
stats
Fibonacci retracement and expansion patterns
Lecture7a Applied Econometrics and Economic Modeling
MS Excel 2010 tutorial 5
Manual for Troubleshooting Formulas & Functions in Excel
Forecasting Stock Market using Multiple Linear Regression
Introduction to charting techniques
Computer Applications in Business
stats

What's hot (19)

PPT
Percentiles and Deciles
PDF
Biology statistics made_simple_using_excel
PPT
Linear regression
PDF
Ch4 notes for students
PDF
Commonly used excel formulas
PPT
Linear regression
DOC
Turn left or right: How the economy affects political preferences and media c...
PPSX
Advances in ms excel
PDF
Excel.useful fns
PPT
Chapter 03
PPTX
Chapter 2
PDF
Describing and exploring data
PPT
Exp2003 exl ppt_04
PDF
Econometrics notes (Introduction, Simple Linear regression, Multiple linear r...
PDF
Statistical-Process-Control-Analysis-Unraveled_updated210
PDF
3. frequency distribution
PDF
Advanced Microsoft Excel
PPT
Graphical presentation of data
PPTX
Correlation analysis
Percentiles and Deciles
Biology statistics made_simple_using_excel
Linear regression
Ch4 notes for students
Commonly used excel formulas
Linear regression
Turn left or right: How the economy affects political preferences and media c...
Advances in ms excel
Excel.useful fns
Chapter 03
Chapter 2
Describing and exploring data
Exp2003 exl ppt_04
Econometrics notes (Introduction, Simple Linear regression, Multiple linear r...
Statistical-Process-Control-Analysis-Unraveled_updated210
3. frequency distribution
Advanced Microsoft Excel
Graphical presentation of data
Correlation analysis
Ad

Viewers also liked (20)

PPTX
Ias 18 and ifrs 5
PDF
IAS 38 Intangible Assets made easy
PPTX
IAS 40 Investment Property
PPTX
Chapter 8 - Accounting for Receivables
PPT
Non current assets held for sale and discontinued
PDF
IAS 7 statement of cash flows
PPT
Ias 7 Cash Flow Statements
PPT
My regression lecture mk3 (uploaded to web ct)
PDF
IAS 38 Intangible Assets
PPT
Intangible assets ias 38
PPTX
Property, Plant & Equipment IAS 16
PPT
PPT Earnings Per Share IAS 33
PPTX
Ias 32
PPT
multiple regression
PPT
Chapter 4 - multiple regression
PPT
Inventories – ias 2
PPSX
Ias 20
PPT
Mobilink Presentation
PPTX
Financial accounting concepts
PPTX
Topic 7 brs
Ias 18 and ifrs 5
IAS 38 Intangible Assets made easy
IAS 40 Investment Property
Chapter 8 - Accounting for Receivables
Non current assets held for sale and discontinued
IAS 7 statement of cash flows
Ias 7 Cash Flow Statements
My regression lecture mk3 (uploaded to web ct)
IAS 38 Intangible Assets
Intangible assets ias 38
Property, Plant & Equipment IAS 16
PPT Earnings Per Share IAS 33
Ias 32
multiple regression
Chapter 4 - multiple regression
Inventories – ias 2
Ias 20
Mobilink Presentation
Financial accounting concepts
Topic 7 brs
Ad

Similar to Lecture2 Applied Econometrics and Economic Modeling (20)

PDF
Evans_Analyticsjjjjjjjjjjjjjj2e_ppt_04 (1).pdf
PPT
Stat11t chapter3
PPTX
Lecture 3 Dispersion(1).pptx
PPTX
measures of dispersion in mba chapeter two
PDF
PG STAT 531 Lecture 2 Descriptive statistics
PDF
Module 1_Theory.pdf
PPTX
Measures of dispersion qt pgdm 1st trisemester
PPT
Statistics-Measures of dispersions
PDF
Lesson 5.pdf ....probability and statistics
PPTX
3. BIOSTATISTICS III measures of central tendency and dispersion by SM - Cop...
PPTX
measures of central tendency.pptx
PPT
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
PDF
Excel and research
PPTX
Statistics-1 : The Basics of Statistics
PPT
2. measures of dis[persion
PPT
Measure of Dispersion - Grade 8 Statistics.ppt
PDF
Research Method for Business chapter 12
PPT
5.DATA SUMMERISATION.ppt
PPTX
Biosttistics for ayurveda students and yoga students
Evans_Analyticsjjjjjjjjjjjjjj2e_ppt_04 (1).pdf
Stat11t chapter3
Lecture 3 Dispersion(1).pptx
measures of dispersion in mba chapeter two
PG STAT 531 Lecture 2 Descriptive statistics
Module 1_Theory.pdf
Measures of dispersion qt pgdm 1st trisemester
Statistics-Measures of dispersions
Lesson 5.pdf ....probability and statistics
3. BIOSTATISTICS III measures of central tendency and dispersion by SM - Cop...
measures of central tendency.pptx
Measures of dispersion by Prof Najeeb Memon BMC lumhs jamshoro
Excel and research
Statistics-1 : The Basics of Statistics
2. measures of dis[persion
Measure of Dispersion - Grade 8 Statistics.ppt
Research Method for Business chapter 12
5.DATA SUMMERISATION.ppt
Biosttistics for ayurveda students and yoga students

More from stone55 (11)

PPT
excel master series-Anova in-excel-to-improve-marketing
PPT
Lecture6 Applied Econometrics and Economic Modeling
PPT
Lecture 4 Applied Econometrics and Economic Modeling
PPT
Lecture7b Applied Econometrics and Economic Modeling
PPT
Lecture5 Applied Econometrics and Economic Modeling
PPT
Lecture3 Applied Econometrics and Economic Modeling
PPT
Lecture8 Applied Econometrics and Economic Modeling
PPT
lecture 1 applied econometrics and economic modeling
PPT
Consumer credit-risk3440
PPT
Summer 07-mfin7011-tang1922
PDF
smoothwall networkguide
excel master series-Anova in-excel-to-improve-marketing
Lecture6 Applied Econometrics and Economic Modeling
Lecture 4 Applied Econometrics and Economic Modeling
Lecture7b Applied Econometrics and Economic Modeling
Lecture5 Applied Econometrics and Economic Modeling
Lecture3 Applied Econometrics and Economic Modeling
Lecture8 Applied Econometrics and Economic Modeling
lecture 1 applied econometrics and economic modeling
Consumer credit-risk3440
Summer 07-mfin7011-tang1922
smoothwall networkguide

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
01-Introduction-to-Information-Management.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Insiders guide to clinical Medicine.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Final Presentation General Medicine 03-08-2024.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Abdominal Access Techniques with Prof. Dr. R K Mishra
01-Introduction-to-Information-Management.pdf
RMMM.pdf make it easy to upload and study
Sports Quiz easy sports quiz sports quiz
Insiders guide to clinical Medicine.pdf
Computing-Curriculum for Schools in Ghana
O5-L3 Freight Transport Ops (International) V1.pdf
Complications of Minimal Access Surgery at WLH
2.FourierTransform-ShortQuestionswithAnswers.pdf
Microbial diseases, their pathogenesis and prophylaxis
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Cell Structure & Organelles in detailed.
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPH.pptx obstetrics and gynecology in nursing
Final Presentation General Medicine 03-08-2024.pptx

Lecture2 Applied Econometrics and Economic Modeling

  • 1. Statistical Review Measures of Central Location
  • 2. SALARY.XLS Lists starting salaries for 190 graduates from an undergraduate school of business. The data is in the range named Salary on a sheet called Data.
  • 3. The Mean We calculate the mean salary by entering the formula “ =AVERAGE(Salary) ” in cell B6 of the Excel Functions worksheet. The mean salary is $29,762. The mean in this example is a “representative” measure because the distribution of salaries is nearly symmetric. The mean can be misleading due to skewness.
  • 4. The Median The median is the “middle” observation when the data are listed from smallest to largest. If there is an odd number of observations, the median is the middle observation. If there is an even number of observations, we take the median to be the average of the two middle observations.
  • 5. The Median -- continued We calculate the median salary in Example 3.1 by entering the formula “ =MEDIAN(Salary)” in cell B7 of the Excel Functions worksheet. The median in this example is $29,850. In this case, the mean and the median values are nearly the same because the distribution is approximately symmetric.
  • 6. The Median -- continued If the salary distribution were skewed (for example, a few graduates received abnormally large salaries), the mean would be biased upward while the median would not be affected by the unusual values. Thus, it is better to use the median in characterizing the center of a distribution when that distribution is skewed.
  • 7. The Mode The mode is the most frequently occurring value. If the values are essentially continuous, as with the salaries in Example 3.1, then the mode is essentially irrelevant. There is typically no single value that occurs more than once. Thus, the mode is not likely to provide much information.
  • 9. SHOES.XLS This file lists shoe sizes purchased at a shoe store. We seek to find the best-selling shoe size at this store.
  • 10. The Mode The mode is the most frequently occurring value. If the values are essentially continuous, then the mode is usually not relevant. There is typically no single value that occurs more than once.
  • 11. Why is the mode relevant here? Shoe sizes come in discrete increments, rather than a continuum; so it makes sense to find the mode, the size that is requested most often, in this example.
  • 12. Computing the Mode in EXCEL The mode can be found in Excel by entering the formula “ =MODE(Range of Data)”. Applying this formula in the present example, we find that size 11 is the most frequently purchased shoe size. This is also apparent form the histogram on the next slide.
  • 14. Measures of Variability: Variance and Standard Deviation
  • 15. Suppose that Otis Elevator is going to stop manufacturing elevator rails. Instead, it is going to buy them from an outside supplier. Otis would like each rail to have a diameter of 1 inch. The company has obtained samples of ten elevator rails from each supplier. They are listed in columns A and B of this Excel file. OTIS4.XLS
  • 16. Which should Otis prefer? Observe that the mean, median, and mode are all exactly 1 inch for each of the two suppliers. Based on these measures, the two suppliers are equally good and right on the mark. However, we when we consider measures of variability, supplier 1 is somewhat better than supplier 2. Why?
  • 17. Explanation The reason is that supplier 2’s rails exhibit more variability about the mean than do supplier 1’s rails. If we want rails to have a diameter of 1 inch, then more variability around the mean is very undesirable!
  • 18. Variance The most commonly used measures of variability are the variance and standard deviation . The variance is essentially the average of the squared deviations from the mean. We say “essentially” because there are two versions of the variance: the population variance and the sample variance.
  • 19. More on the Variance The variance tends to increase when there is more variability around the mean. Indeed, large deviations from the mean contribute heavily to the variance because they are squared. One consequence of this is that the variance is expressed in squared units (squared dollars, for example) rather than original units.
  • 20. Standard Deviation A more intuitive measure of variability is the standard deviation. The standard deviation is defined to be the square root of the variance. Thus, the standard deviation is measured in original units, such as dollars, and it is much easier to interpret.
  • 21. Computing Variance and Standard Deviation in Excel Excel has built-in functions for computing these measures of variability. The sample variances and standard deviations of the rail diameters from the suppliers in the present example can be found by entering the following formulas: “ =VAR(Supplier1) ” in cell E8 and “ =STDEV(Supplier1) ” in cell E9.
  • 22. Computing Variances & Standard Deviations -- continued Of course, enter similar formulas for supplier 2 in cells F8 and F9. As we mentioned earlier, it is difficult to interpret the variances numerically because they are expressed in squared inches, not inches. All we can say is that the variance from supplier 2 is considerably larger than the variance from supplier 1.
  • 23. Interpretation of the Standard Deviation The standard deviations, on the other hand, are expressed in inches. The standard deviation for supplier 1 is approximately 0.012 inch, and supplier 2’s standard deviation is approximately three times this large. This is quite a disparity. Hence, Otis will prefer to buy rails from supplier 1.
  • 24. Interpretation of the Standard Deviation: Rules of Thumb
  • 25. DOW.XLS This file contains monthly closing prices for the Dow Jones Index from January 1947 through January 1993. The monthly returns from the index are also shown starting with February 1947. Each return is the monthly percentage change (expressed) as a decimal) in the index. How well do the rules of thumb work for these data?
  • 26. Rules of Thumb Many data sets follow “rules of thumb”. Approximately 68% of the observations are within one standard deviation of the mean. Approximately 95% of the observations are within two standard deviations of the mean. Approximately 99.7% - almost all - of the observations are within three standard deviations of the mean.
  • 27. Index Time Series Plot A time series plot of the index show that the index has been increasingly fairly steadily over the period. Whenever a series indicates a clear trend such as the index does, most of the measures we have been discussing are less relevant. For example, the mean of the index for this period has at most historical interest. We are probably more interested in predicting the future of the Dow, and the historical mean has little relevance for predicting the future.
  • 28. Time Series Plot of Dow Closing Index
  • 29. Time Series Plot of Dow Returns
  • 30. Return Time Series Plot A time series plot of the returns show no obvious trend over the period. The measures we have been discussing are relevant in discussing the series of returns, which fluctuate around a stable mean. We first calculate the mean and standard deviation of the returns by using the Excel functions AVERAGE and STDEV in cells B4 and B5. See the table on the next slide.
  • 31. Rules of Thumb for Dow Jones Data
  • 32. Returns -- continued The average return is 0.59% and the standard deviation of about 3.37%. Therefore, the rules of thumb (if they apply) imply, for example, that about 2/3 of all returns are within the interval 0.59% + 3.37%, that is from -2.78% to 3.95%. In order to determine if the rules of thumb apply to these returns, we can use a frequency table.
  • 33. Creating the Frequency Table We first enter the upper limits of the suitable categories in the range A8:A15. Any categories can be chosen but it is convenient to choose categories in which each breakpoint is one standard deviation higher than the previous one with the open-ended categories on either end are “more than 3 standard deviations from the mean”. Next we use the FREQUENCY function to fill in column C. “ =FREQUENCY(Returns,Bins)”
  • 34. Frequency Table continued Finally, we use the frequencies in column C to calculate the actual percentage of return within k standard deviations of the mean for k=1, k=2 and k=3 and we compare these with the percentages from the rules of thumb. The agreement between these percentages is not perfect - there are a few more observations within one standard deviation of the mean than the rule of thumb predicts - but in general the rules of thumb work quite well.
  • 36. SALARY.XLS Lists starting salaries for 190 graduates from an undergraduate school of business. The data is in the range named Salary on a sheet called Data. We need to find a set of useful summary measures for the salaries.
  • 37. Summary Statistics To find the summary statistics of a set of data we can use the Stat-Pro Add-In or Excel’s Analysis ToolPak. In this example we use the Stat-Pro Add-In . Begin by placing the cursor anywhere within the data range. Then select StatPro/Summary Stats/One-Variable Summary Stats menu item. Select all variables you want to summarize, and select the summary measures you want to find from the Available Summary Measures dialog box shown on the next slide.
  • 39. About the Measures Four measures are selected by default. These are mean, median, Standard Deviation and Count. You can override these. A typical output appears here. It includes many of the measures we have discussed plus a few more.
  • 40. About the Measures -- continued The mean absolute deviation is similar to the variance except that it is an average of the absolute (note squared) deviations from the mean. The kurtosis and skewness indicate the relative peakedness of the distribution and its skewness. By clicking on any of the cells containing the measures (Column B), you’ll see that StatPro provides the formulas for the outputs. (Analysis ToolPak does not do so.)
  • 41. About the Measures -- continued The effect of this is that if any of the data changes the summary measures we produced change automatically. All output is formatted as “numerical” to three decimal places by default. You can reformat them in a more appropriate manner if you would like.
  • 42. Measures of Association: Covariance and Correlation
  • 43. EXPENSES.XLS A survey questions members of 100 households about their spending habits. The data in this file represent the salary, expense for cultural activities, expense for sports-related activities, and the expense for dining-out for each household over the past year. Do these variables appear to be related linearly?
  • 44. Covariance and Correlation When we need to summarize the relationship between two variables we can use the measures covariance and correlation . We summarize the type of behavior observed in a scatterplot. Each measures the strength (and direction) of a linear relationship between two numerical variables. The relationship is “strong” if the points in a scatterplot cluster tightly around some straight line. If this line rises form left to right then the relationship is “positive”. If it falls from left to right then the relationship is “negative”.
  • 45. Determining Linear Relationships Scatterplots of each variable versus each other would provide the answer to the question but six scatterplots would be required, one for each pair. To get a quick indication of possible linear relationships we can use Stat-Proto obtain a table of correlations and/or covariances.
  • 46. Table of Correlations and Covariances To get the table, place the cursor anywhere in the data set and use the StatPro/Summary Stats/Correlations, Covariances menu item and proceed in the obvious way.
  • 47. Relationships The only relationships that stand out are the positive relationships between salary and cultural expenses and between salary and dining expenses. The negative relationships are between cultural and sports-related expenses. To confirm these graphically we show scatterplots of Salary versus Culture and Culture versus Sports
  • 50. Correlation and Covariance Properties In general, the following properties are evident from the Table of correlations and covariances. The correlation between a variable and itself is 1. The correlation between X and Y is the same as the correlation between Y and X. Therefore, it is sufficient to list the correlations below (or above) the diagonal in the table. (The same is true for the covariances). The covariance between a variable and itself is the variance of the variable. We indicate this in the heading of the covariance table.
  • 51. Correlation and Covariance Properties -- continued It is difficult to interpret the magnitudes of covariances. These depend on the fact that the data are measured in dollars rather than, say, thousands of dollars. It is such easier to interpret the magnitudes of the correlations because they are scaled to be between -1 and +1.
  • 52. Describing Data Sets with Boxplots
  • 53. DOW.XLS This file lists the monthly returns on the Dow from February 1947 through January 1993. Use a boxplot to summarize the distribution of these returns.
  • 54. Boxplots A boxplot is a very useful graphical method for summarizing data. Boxplots can be used in two ways: either to describe a single variable in a data set or to compare two (or more) variables. Excel has no boxplot option, but we included this option in the StatPro add-in.
  • 55. Creating Boxplots Place the cursor anywhere within the data set, use the StatPro/Charts/BoxPlots(s) menu item and proceed in the obvious way. Eventually two sheets will be added to your workbook. One has a the boxplot chart, while the other contains summary measures used to form the boxplot. The following slides show the chart and the summary measure information.
  • 58. Keys to Understanding Boxplots The right and left of the box are at the third and first quartiles. Therefore, the length of the box equals the interquartile range (IQR), and the box itself represents the middle 50% of the observations. The height of the box has no significance. The vertical line inside the box indicates the location of the median. The point inside the box indicates the location of the mean.
  • 59. Keys to Understanding Boxplots -- continued Horizontal lines are drawn from each side of the box. They extend to the most extreme observations tat are no farther than 1.5 IQRs from the box. They are useful for indicating variability and skewness. Observations farther than 1.5 IQRs from the box are shown as individual points. If they are between 1.5 IRQs and 3 IQRs from the box, they are call mild outliers and are hollow. Otherwise, they are called extreme outliers and are solid.
  • 60. Dow Returns Boxplot The boxplot for this example summarizes the distribution of the returns. It implies that the Dow returns are approximately symmetric on each side of the median, although the mean is a bit below the median. Also there are a few mild outliers but no extreme outliers.
  • 61. Describing Data Sets with Boxplots
  • 62. ACTORS.XLS Recall that the salaries of famous actors and actresses are listed in this file. Use side-by-side boxplots to compare the salaries of male and female actors and actresses.
  • 63. Side-by-Side Boxplots Boxplots are probably most useful for comparing two populations graphically. This is done using side-by-side boxplots. The data setup for this type of “comparison” can be in one of two forms: stacked or unstacked. Data are stacked if there is a “code” variable that designates which category each observation is in, and there is a single “measurement” variable that contains the data for both categories.
  • 64. Side-by-Side Boxplots -- continued The data are unstacked if there is a separate data column for each category. In this example the data are stacked because Gender designates the gender associated with each observation and Salary is the single measurement variable. If the data were unstacked, for example, actors’ salaries would be in one column and actresses’ salaries would be in another.
  • 65. Creating a Side-by-Side Boxplot Since the data are stacked the following must be steps must be followed to create the boxplots. Place cursor within the data set and select the StatPro/Charts/Boxplot(s) menu item. In the dialog box that opens check the “stacked” option. Then choose Gender as the code variable and Salary as the measurement variable. The resulting data in stacked form and the side-by-side boxplot appears.
  • 66. Actor Data in Stacked Form
  • 68. Reading the Boxplot It is clear the the female salary box is considerably to the left of the male salary box, although both have about the same IQR. Each boxplot has three indications that the salary distributions are skewed to the right: the means are larger than the medians the medians are closer to the left sides of the boxes than to the right sides the horizontal lines extend farther to the right than to the left of the boxes. However, there are no outliers.
  • 70. Background Information The Spring Mills Company produces and distributes a wide variety of manufactured goods. Due to its variety, it has a a large number of customers. Spring Mills classifies these customers as small, medium and large, depending on the volume of business each does with them. Recently they have noticed a problem with accounts receivable. They are not getting paid by their customers in as timely a manner as they would like. This obviously costs them money.
  • 71. RECEIVE.XLS Spring Mills has gathered data on 280 customer accounts. For each of these accounts the data set lists three variables: Size, the size of the customer (coded 1 for small, 2 for medium, 3 for large) Days, the number of days since the customer was billed Amount, the amount the customer owes What information can we obtain from this data?
  • 72. Analysis It is best to start by getting a good sense of the data. To do this we: Calculate several summary measures for the Days and Amount Create a histogram of Amount Create a scatterplot of Amount versus Days From these we determine: a positive skewness in the Amount variable (The mean is considerably larger than the median and the standard deviation of Amount is quite large). The scatterplot suggests some suspicious behavior, with two distinct groups of points.
  • 73. Analysis -- continued The next step is to see whether the different customer sizes have any effect on either Days, Amount, or the relationship between Days and Amount. To do this, it is useful to unstack the Days and Amount variables - that is to create a new Days and Amounts variable for each group of customer sizes. For example, the Days and Amount variables for customers of size 1 are named Days1 and Amount1. This can be accomplished by using StatPro’s Unstack procedure but copying and pasting also work.
  • 74. Analysis -- continued Once unstacked we need to calculate summary measures and a variety of charts on these unstacked variables. The charts include: Histograms of Amount for each size customer Boxplots of days owed by different size customers Boxplots of amounts owed by different size customers Scatterplots of Amount versus Days for each size customer
  • 75. Summary Measures for Combined Data
  • 76. Histogram of All Amounts Owed
  • 77. Scatterplot of Amount versus Days for All Customers
  • 78. Summary Measures Broken Down by Size
  • 79. Histogram for Small Customers
  • 80. Histogram of Amount for Medium Customers
  • 81. Histogram of Amount for Large Customers
  • 82. Boxplots of Days Owed by Different Size Customers
  • 83. Boxplots of Amounts Owed by Different Size Customers
  • 84. Scatterplot of Amount versus Days for Small Companies
  • 85. Scatterplot of Amount versus Days for Medium Companies
  • 86. Scatterplot of Amount versus Days for Large Companies
  • 87. Analysis -- continued There is obviously a lot going on here and it is evident form the charts. We point out the following: there are considerably fewer large customers than small or medium customers. the large customers tend to owe considerably more than small or medium customers. the small customers do not tend to be as long overdue as the small and medium customers. there is no relationship between Days and Amount for the small customers, but there is a definite positive relationship between these variables for the medium and large customers.
  • 88. More Analysis We have done the obvious but here is still much more we can do. For example, suppose Spring Mills wants a breakdown of customers who owe at least $500. We first create a new variable called “Large?” next to the original variables that equals 1 for all amounts greater than $500 and equals 0 otherwise. We do this by entering a formula in D6 and copying down.The next slide shows a sample of data and the formula to be used.
  • 89. More Analysis - -continued We enter the formula =IF(C6>$B$3,1,0) in cell D6 and copy it down. We can then use a pivot table to create a count of the number of 1’s in this new variable for each value of the Size variable.
  • 90. Pivot Tables for Counts of Customers Who Owe More than $500
  • 91. More Analysis -- continued We created the pivot table twice, once showing the counts as percentages of each column, and once showing them as percentages of each row. One table shows that 73% of all customers with amounts less than $500 are small customers.The other table shows that 45% of all medium-sized customers owe more than $500. This type of analysis is often referred to as “slicing and dicing the data”. They are based on the same counts but portray them in different ways. Neither way is better; they both provide useful information.
  • 92. More Analysis -- continued Finally, we investigate the amount of interest Spring Mills is losing by the delays in its customers’ payments. We assume that the company can make 12% annual interest on excess cash. Then we can create a Lost variable for each customer size that indicates the amount of interest Spring Mills loses on each customer group.
  • 93. More Analysis The formula entered in cell C10 to calculate Lost1 is =B10*A10*$C$7/365 . This is the amount owed by the number of days owed multiplied by the interest rate, divided by the number of days in a year. Then we copy this formula down and to the columns to calculate Lost2 and Lost3. Then we calculate the sums of these amounts in row 5. The next slide shows a sample of data at this point.
  • 94. More Analysis -- continued Although Spring Mills is losing more per customer from the large customers, it is losing more total from the medium-sized customers - because there are more of them.
  • 95. More Analysis -- continued This is shown graphically by a pie chart of the sums in row 5.
  • 96. Findings If Spring Mills really wants to decrease receivables, it might want to target the medium-sized customer group, from which it is losing the most interest. Or it could target the large customers because they owe the most on average. The most appropriate action depends on the cost and effectiveness of targeting any particular customer group. However, the analysis presented here gives the company a much better picture of what’s currently going on.
  • 98. Background Information The R&P Supermarket is open 24 hours a day, 7 days a week. Lately it has been receiving a lot of complaints from customers about excessive waiting in line for checking out. R&P has decided to investigate this situation by gathering data on arrivals, departures, and line lengths at the checkout stations. It has collected data in half-hour increments for an entire week - 336 observations - starting at 8 am on Monday and ending at 8 am on the following Monday.
  • 99. CHECKOUT.XLS This file includes the data collected on the following variables: InitialWaiting, the number waiting or being checked out at the beginning of a half-hour period Arrivals, the number of arrivals to the checkout stations during a period Departures, the number of checkout stations open during a period Checkers, the number of checkout stations open during a period
  • 100. The data set also includes time variables: Day, day of week StartTime, clock time at the beginning of each half-hour period TimeInterval, a descriptive term for the time of day such as Lunch rush for 11:30 a.m. to 1:30 p.m. Finally the data set includes a calculated variable: EndWaiting, the number waiting or being checked out at the end of a half-hour period This variable for any time period equals InitialWaiting plus Arrivals minus Departures; it also equals InitialWaiting for the next period. CHECKOUT.XLS -- continued
  • 102. Analysis Information The manager of R&P wants to analyze the data to discover any trends, particularly in the pattern of arrivals throughout the day or across the entire week. Also the store currently uses a “seat-of-the-pants” approach to opening and closing checkout stations each half hour. The manager would like to see how well the current approach is working. Of course, she would love to know the “best” strategy for opening and closing checkout stations - but this is beyond her (and our) capabilities at this point
  • 103. Analysis Obviously time plays a crucial role in this data, so a good place to start is to create one or more time series plots. The time series plot of InitialWaiting and Arrivals Variables shown on the next slide shows that: Fridays and Saturdays are the busiest days the time pattern of arrivals is somewhat different - more spread out - during the weekends than during the weekdays there are fairly regular peak arrival periods during the weekdays the number waiting is sometimes as large as 10 or 20, and the largest of these tend to be around peak arrival times
  • 104. Time Series Plot of Initial Waiting and Arrivals Variables
  • 105. Analysis -- continued A similar time series plot shown below shows Arrivals and Departures.
  • 106. Analysis -- continued On this plot it is difficult to separate the two time series; they are practically on top of each other. Perhaps this is not so bad because this tells us that the store is checking out customers approximately as quickly as they are arriving. A somewhat more efficient way to understand the time series behavior is to use pivot tables.
  • 107. Analysis - continued A pivot table for Average InitialWaiting by Hour of Day is generated. This table can be seen on the next slide. To create this table, we: drag the InitialWaiting variable to the Data area, express it as an average drag StartTime variable to the Row area drag the Day variable to the Page area Finally, we use the data in the pivot table to create a time series plot.
  • 108. Average Initial Waiting by Hour of Day
  • 109. Analysis -- continued Note how the variable in the Page area works. Depending on which day we select in the page area the corresponding table and time series plot changes. Similarly, a pivot table can be created, accompanied by a column chart. Average Arrivals by TimeInterval of Day can be analyzed with a pivot table by dragging Arrival to the Data area, expressing it as an average, dragging the TimeInterval variable to the Row area, and dragging the day variable to the Page area. This table is is on the next slide. You can check for the patterns on each day.
  • 110. Average Arrivals by Time Interval of Days
  • 111. More Analysis The manager of R&P is ultimately interested in whether the “right” number of checkout stations are available throughout the day. We can create two scatterplots to provide some evidence. The plots follow this slide. The first of these would be a scatterplot of Checkers versus TotalCustomers. The TotalCustomers variable is calculated as the sum of InitialWaiting and the Arrivals to measure the total amount of work presented to the checkout stations in any half-hour period.
  • 112. Scatterplot of Checkers versus Total Customers
  • 113. Scatterplot of End Waiting versus Checkers
  • 114. More Analysis -- continued From the scatterplot we can see an obvious positive relationship between these two variables. Evidently management is reacting as they should - it is opening more checkout stations when there is more traffic. The second scatterplot shows EndWaiting versus Checkers. There is again a definite upward trend. Periods when more checkout stations are open tend to be associated with periods where more customers still remain in the checkout process.
  • 115. More Analysis - continued Presumably, management is reacting with more open checkout stations in busy periods, but it is not reacting strongly enough. Just fiddling with the numbers in the Checker column will not solve the manager's problems.
  • 116. Conclusions Two problems exist. First, there is a trade-off between the “cost” of having customers wait in line and the cost of paying extra checkout people. This is a difficult tradeoff for any supermarket manager. Second, the number of departures is clearly related to the number of checkout stations open. Therefore it doesn’t make sense to change the numbers in the Checkers column without changing the numbers in the Departures (and hence InitialWaiting and EndWaiting) column in an appropriate way. This is not an easy problem.
  • 118. Background Information The HyTex Company is a direct marketer of stereophonic equipment, personal computers and other electronic products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are taken over the telephone. The company spends a great deal of money on its catalog mailings, and it wants to be sure that this is paying off in sales.
  • 119. CATALOGS.XLS This file contains the data that has been collected on 1000 customers at the end of the current year. For each customer it has data on the following variables: Age: coded as 1 for 30 or younger, 2 for 31 to 55, 3 for 56 and older. Gender: coded as 1 for males, 2 for females OwnHome: coded as 1 if customer owns a home, 2 otherwise Married: coded as 1 if customer is currently married, 2 otherwise
  • 120. CATALOGS.XLS -- continued Close: coded as 1 if customers lives reasonably close to a shopping area that sells similar merchandise, 2 otherwise Salary: combined annual salary of customer and spouse (if any) Children: number of children living with customer History: coded as “NA” if customer had no dealings with the company before this year, 1 if customer was a low spending customer last year, 2 if medium-spending, 3 if high-spending Catalogs: Number of catalogs sent to the customer this year AmountSpent: Total amount of purchases made by the customer this year
  • 121. Analysis Desired HyTex wants to analyze these data carefully to understand its customers better. They want to see whether they are sending the catalogs to the right customers. Each customer either receives 6, 12, 18, or 24 catalogs through the mail. Currently which customer receives which amount is not thought out carefully. They want to know if the current distribution of catalogs is effective. Is there room for improvement?
  • 122. Analysis HyTex is obviously interested in the AmountSpent variable. Therefore, it makes sense to create scatterplots of AmountSpent versus selected “explanatory” variables. First we create the scatterplot shown on the next slide showing AmountSpent versus Salary. It is clear that customers with higher salaries tend to spend more, although the variability in amounts spent increases significantly as salary increases.
  • 123. Scatterplot of Amount Spent versus Salary
  • 124. Analysis -- continued Second, we create a scatterplot of AmountSpent versus Catalogs This plot shows that there is some tendency toward higher spending among customers who receive more catalogs. But do the catalogs cause more spending, or are more catalogs sent to customers who would tend to spend more anyway? There is no way to answer this with this data. Next is the scatterplot of AmountSpent versus Children This plot shows the interesting tendency of customers with more children to spend less.
  • 125. Scatterplot of Amount Spent versus Catalogs
  • 126. Scatterplot of Amount Spent versus Children
  • 127. More Analysis Pivot tables and accompanying charts are very useful in this type of situation. We can begin by using this technique to get a better understanding of the demographics of the customers. The first pivot table and chart shows the percentage of an age group who own homes. Using the pivot table we can see how these percentages change for women, unmarried men and so on. Specifically small percentages of the younger people own their own hoe, regardless of marital status or gender.
  • 128. Percent of Home Owners versus Age, Married and Gender
  • 129. More Analysis -- continued The second pivot table shows the percentage of each age group who are married, for any combination of the Gender and OwnHome variables. For example, we can check that the married/unmarried split is quite different for women who don’t own a home than for male home owners. The third pivot table shows the average Salary broken down by Age and Gender, with page variables for OwnHome and Married.
  • 130. Percent of Married versus Age, OwnHome and Gender
  • 131. Average Salary versus Age, Gender, Married and OwnHome
  • 132. More Analysis --continued The shape of the resulting charts is practically the same for any combination of the page variables. However, the heights of the bars change appreciably. For example, salaries are higher for the married home owners than for unmarried customers who are not home owners. Another pivot table and chart break the data down in another way. Each column in the pivot table shows the percentages in the various History categories for a particular number of children. Each of these columns corresponds to one of the bars in the “stacked” bar chart. We have also used Close as a page variable.
  • 133. Percentages in History Categories versus Children and Close
  • 134. More Analysis -- continued Two interesting points emerge from this plot. First, customers with more children tend to be more heavily represented in the low-spending History category. Second, you can check by changing the setting of the Close variable from 1 to 2, the percentage of high spenders among customers who live far from electronic stores is much higher than for those who live close to such stores. The next pivot table provide insight into how HyTex determined its catalog mailing distributions. Each row shows a percentage of a particular History category that were sent 6,12,18 or 24 catalogs.
  • 136. More Analysis -- continued The company’s distribution policy is still unclear but we can see that it definitely sends more catalogs to high spending customers and fewer to low spending customers. Finally the last pivot table shows AmountSpent versus History and Catalogs, with a variety of demographic variables in the page area. There are so many possible combinations so it can be difficult to discover all the existing patterns
  • 137. Average Amount Spent versus History, Catalogs and Demographic Variables
  • 138. More Analysis -- continued One things stands out loud and clear from the graph: the more customers receive, the more they tend to spend. Also, if they were large spenders last year, they tend to be large spenders this year. In a pivot table with this many combinations there will almost certainly be some combinations with no observations.