SlideShare a Scribd company logo
The five stages of questionnaire design and testing
Data Analysis_6 oct 2016.ppt
Conceptualization of well-being
“Do you have a driver’s license?”
“How many employees does this business have?”
“Who is the Prime Minister?”,
“Are you aware of these industry support groups?”
“How many times did you go to the theatre in the last 12 months?”
A) Factual Questions
B) Behavioural Questions
“What is your occupation?”
What is your age in years?
What was your main occupation?
A) Open Questions
B) Closed Questions
“Are you in favour of …?”
C) Opinion Questions
D) Hypothetical Questions
“What would you do if ... ?”
1. Can research without data?
2. How can you resolve the problem without supporting
data?
3. How do you convince other, that your data are
sufficient to support the solution?
4. Where do you go to find data?
5. Can you have imaginary data in research?
6. Can you have data simulation on for research?
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Data Analysis_6 oct 2016.ppt
Statistics for Experimenters:
Design, Innovation, and Discovery,
2nd ed. (Wiley-Interscience, 2005),
What is a regression
Regression analysis is a class of statistical models used to
describe, estimate or predict causal relationships among a de
pendent variable (outcome) and one (simple/ bivariate
regression) or several (multiple/ multivariate regression) ind
ependent variables (predictor).
Regression: Introduction
Basic idea:
Use data to identify relationships
among variables and use these
relationships to make predictions.
1. Prediction: how does the outcome change, if on one or
several predictors vary?
2. Cause analysis: how strong is the influence of the pred
ictors on the outcome?
3. Time series analysis: how does the outcome change ov
er time (predictors ceteris paribus)
Linear regression
•Linear dependence: constant rate of increase of one variable
with respect to another (as opposed to, e.g., diminishing
returns).
•Regression analysis describes the relationship between two
(or more) variables.
•Examples:
– Income and educational level
– Demand for electricity and the weather
– Home sales and interest rates
•Our focus:
–Gain some understanding of the mechanics.
• the regression line
• regression error
– Learn how to interpret and use the results.
– Learn how to setup a regression analysis.
Two main questions:
•Prediction and Forecasting
– Predict home sales for December given the interest rate for this month.
– Use time series data (e.g., sales vs. year) to forecast future performance
(next year sales).
– Predict the selling price of houses in some area.
• Collect data on several houses (# of BR, #BA, sq.ft, lot size, property tax)
and their selling price.
• Can we use this data to predict the selling price of a specific house?
•Quantifying causality
– Determine factors that relate to the variable to be predicted; e.g., predict
growth for the economy in the next quarter: use past history on
quarterly growth, index of leading economic indicators, and others.
– Want to determine advertising expenditure and promotion for the 1999
Ford Explorer.
•Sales over a quarter might be influenced by: ads in print, ads in radio, ads in
TV, and other promotions.
Motivated Example
• Predict the selling prices of houses in the region.
–Intuitively, we should compare the house for which we need a predicted selling
price with houses that have sold recently in the same area, of roughly the same
size, same style etc.
•Idea: Treat it as a multiple sample problem.
•Unfortunately, the list of houses meeting these criteria may be quite small, or there
may not be a house of exactly the same characteristics.
•Alternative approach: Consider the factors that determine the selling price of a house
in this region.
• Collect recent historical data on selling prices, and a number of
characteristics about each house sold (size, age, style, etc.).
–Idea: one sample problem
•To predict the selling price of a house without any particular knowledge of the house,
we use the average selling price of all of the houses in the data set.
–Better idea:
•One of the factors that cause houses in the data set to sell for different amounts of
money is the fact that houses come in various sizes.
•A preliminary model might posit that the average value per square foot of a new
house is $40 and that the average lot sells for $20,000. The predicted selling price of
a house of size X (in square feet) would be: 20,000 + 40X.
•A house of 2,000 square feet would be estimated to sell for 20,000 + 40(2,000) =
$100,000.
Motivated Example
•Probability Model:
– We know, however, that this is just an approximation, and the selling
price of this particular house of 2,000 square feet is not likely to be
exactly $100,000.
– Prices for houses of this size may actually range from $50,000 to $150,000.
– In other words, the deterministic model is not really suitable. We should
therefore consider a probabilistic model.
•Let Y be the actual selling price of the house. Then
Y = 20,000 + 40x + ,
where  (Greek letter epsilon) represents a random error
term (which might be positive or negative).
– If the error term  is usually small, then we can say the model is a good
one.
– The random term, in theory, accounts for all the variables that are not
part of the model (for instance, lot size, neighborhood, etc.).
– The value of  will vary from sale to sale, even if the house size remains
constant. That is, houses of the exact same size may sell for different
prices.
Regression Model
•The variable we are trying to predict (Y) is called the
dependent (or response) variable.
•The variable x is called the independent (or predictor, or
explanatory) variable.
•Our model assumes that
E(Y | X = x) = 0 + 1x (the “population line”) (1)
The interpretation is as follows:
–When X (house size) is fixed at a level x, then we assume the mean of Y
(selling price) to be linear around the level x, where 0 is the (unknown)
intercept and 1 is the (unknown) slope or incremental change in Y per
unit change in X.
–0 and 1 are not known exactly, but are estimated from sample data.
Their estimates are denoted b0 and b1.
•A simple regression model: Consider a model with only one
independent variable,.
•A multiple regression model: a model with multiple
independent variables.
House Number Y: Actual Selling
Price ($1,000s)
X: House Size (100s ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5
4 56.9 12.5
5 66.6 18.0
6 82.5 14.3
7 126.3 27.5
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
Sample
15 houses
from the
region.
Least Squares Estimation
•price<- c(89.5,79.9,83.1,56.9,66.6,82.5,126.3,79.3,119.9,87.6,112.6,120.8,78.5,74.3,74.8)
•size<- c(20.0,14.8,20.5,12.5,18.0,14.3,27.5,16.5,24.3,20.2,22.0,19.0,12.3,14.0,16.7)
•plot(size,price,xlab= “House size (100 sq ft)”,ylab=“Selling price
($1,000)”,main=“House Size (X) vs Selling Price (Y)”)
Assumptions
•These data do not form a perfect line. This is not surprising,
considering that our data are random. In other words, if we
assume equation (1) then our line predicts the mean for any
given level x. However, when we actually take a
measurement (i.e., observe the data), we observe:
Yi = 0 + 1Xi + i, for i = 1,2,…, n = 15,
where i is the random error associated with the ith
observation.
–Since we don't know the true values of 0 and 1, it is clear that we do
not observe the actual errors (i) precisely either.
•Assumptions about the Error
–E(i ) = 0 for i = 1, 2,…,n.
–(i ) =  where  is unknown.
–The errors are independent, that is, the error in the ith observation is
independent of the error observed in the jth observation.
–The i are normally distributed (with mean 0 and standard deviation ).
Least Squares Estimation
•Recall 0 and 1 are (unknown) population parameters.
– From the sample data, we will calculate numbers and that are
estimates of the population parameters.
– How should these numbers be chosen? For any choice of and ,
we can write the following prediction equation
= + X.
– The “hat” is used to denote a value estimated from the model, as
opposed to one that is actually observed.
– For each house in our sample of 15 we could check to see how well
this equation works at predicting the actual selling prices. Define ei to
be the error associated with the ith observation. That is:
ei = yi - (estimated selling price)
These are sometimes called the residuals or simply errors.
•We will pick the values of and that minimize Si ei
2, the
sum of the squares of the residuals. This method is often
called Least Squares Regression.
0
ˆ
 1
ˆ

0
ˆ
 1
ˆ

0
ˆ
 1
ˆ

ŷ
0
ˆ
 1
ˆ

Using the Equation
•Method of Least squares leads to that the intercept is 18.354
and the slope is 3.879.
–How do we predict the selling price of a house of 1,650 square feet?
• Plug in the value 16.50 (1,650 translated to 100s of square feet) in the
regression equation and get predicted selling price = 18.354 + 3.879× (16.50)
= 82.357.
• Translate to a dollar amount, i.e., $82,357. This is the best estimate you have
of the selling price of this house, that is, without any further information
about the house (e.g., neighborhood, number of rooms, lot size, age, etc.).
•Analyzing a Regression
•Estimating the Standard Error
–From the assumptions about the error, the magnitude of  should be a
good guide to the accuracy of a prediction.
–The number  is a population parameter, so we cannot know for
certain what its value is.
–We therefore use an estimate s that is provided in the regression
output under the name “standard error of the estimate” or just
“standard error.”
Making Predictions
•The estimate s is calculated by (SSE/(n-2))1/2.
–The reason why we divide by n - 2 and not n - 1 has to do with the
degrees of freedom issue.
–The value of s gives us some idea of the standard deviation of the
errors if the model is used to estimate selling prices. In addition, we
will make use of the normality assumption to help us make
assessments of a prediction.
•Suppose a house occupies 2,000 square feet. How do we
predict the selling price?
–prediction interval: This is used if our goal is to determine a 95%
confidence interval on the actual selling price of the house. A 95%
prediction interval for the actual selling price is given by
(18.354 + 3.879× 20 )  t(n - 2, 0.025)s = 95.94  28.07.
–confidence interval: This is used if our goal is to determine a 95%
confidence interval on the mean selling price of all houses of this size
(2,000 square feet). (E[Y|X = x])
It is 95,940  t(n - 2, 0.025)s/√n = 95.94  7.25 .
–In the above examples use the t distribution with n - 2 degrees of
freedom. If n - 2  30 then the standard normal distribution can be used
instead.
Making Inferences about Coefficients
•To assess the accuracy of the model, it involves determining
whether a particular variable like house size has any effect
on the selling price.
–Suppose that when a regression line is drawn it produces a horizontal
line. This means the selling price of the house is unaffected by the size
of the house.
–A horizontal line has a slope of 0, so when no linear relationship exists
between an independent variable and the dependent variable we
should expect to get 1 = 0.
–But of course, we only observe estimate of 1, which might only be
“close” to zero. To systematically determine when 1 might in fact be
zero, we will make inferences about it using our estimate , specifically,
we will do hypothesis tests and build confidence intervals.
•Testing 1, we can test any of the following:
–H0 : 1 = 0 versus HA : 1  0
–H0 : 1  0 versus HA : 1 < 0
–H0 : 1  0 versus HA : 1 > 0
• In each case, the null hypothesis can be reduced to H0: 1 = 0.
The test statistic in each case is   1
ˆ
1 /
0
ˆ

 s

Example
•Can we conclude at the 1% level of significance that the size
of a house is linearly related to its selling price? Test H0 : 1 =
0 versus HA : 1  0
–Note this is a two-sided test, we are interested in whether there is any
relationship at all between price and size.
–Calculate T = (3.879 - 0) / 0.794 = 4.88.
–That is, we are 4.88 standard deviations from 0. So at the 1% level
(corresponding to thresholds  t(13, 0.005) =  3.012), we reject H0.
–There is sufficient evidence to conclude that house size does linearly
affect selling price.
•To get a p-value on this we would need to look up 4.88
inside the t-table.
–It is 0.00024 or 0.024%; very small indeed.
•A 95% confidence interval for 1 is given by
–For this example: It is 3.879  (2.160)(0.794) = 3.879  1.715.
–Using the 15 data points, we are 95% confident that every extra square
foot increases the price of the house by anywhere from $21.64 to
$55.94.
1
ˆ
)
025
.
0
,
2
(
1
ˆ

 s
t n

Method III: Measuring the Strength of the
Linear Relationship
•Consider the following equation:
Yi - = ( - ) + ei.
–Squaring both sides and summing over all data points, and after a little
algebra, we get:
i (Yi - )2 = i ( - )2 + i ei
2, which we usually rewrite as:
SST = SSR + SSE, (2)
where SST = i (Yi - )2 , SSR = i ( - )2 and SSE = i ei
2.
–Interpretation:
•SST stands for the “total sum of squares” - this is essentially the total
variation in the data set, i.e., the total variation of selling prices.
•SSR stands for “sum of squares due to regression” - this is the squared
variation around the mean of the estimated selling prices. This is sometimes
called the total variation explained by the regression.
•SSE stands for “sum of squares due to error” - this is simply the sum of the
squared residuals, and it is the variation in the Y variable that remains
unexplained after taking into account the variable X.
–The interpretation of equation (2) is that the total variation in Y (SST) is
made up of two parts: the total variation explained by the regression
(SSR) and the remaining unexplained variation (SSE).
Y Y
Y
ˆ
Y Y
Y
ˆ
Y
ˆ
Y Y
Regression Statistics
•Define R2 = SSR/SST = 1- SSE/SST
–The fraction of the total variation explained by the regression.
–R2 is a measure of the explanatory power of the model.
–Multiple-R = (R2)1/2 (in one variable case = |rXY|)
•According to the definition of R2, adding extraneous
explanatory variables will artificially inflate the R2.
–We must be careful in interpreting this number.
–Introducing extra variables can lead to spurious results and can
interfere with the proper estimation of slopes for the important
variables.
•In order to penalize an excess of variables, we consider the
adjusted R2, which is
adjusted R2 = 1- [SSE/(n-k-1)]/[SST/(n-1)] .
Here n is the number of data and k is the number of
explanatory variables.
–The adjusted R2 thus divides numerator and denominator by their DF.
How to determine the value of used cars that
customers trade in when purchasing new cars?
• Car dealers across North America use the “Red Book” to help them
determine the value of used cars that their customers trade in when
purchasing new cars.
–The book, which is published monthly, lists average trade-in values for all
basic models of North American, Japanese and European cars.
–These averages are determined on the basis of the amounts paid at recent
used-car auctions.
–The book indicates alternative values of each car model according to its
condition and optional features, but it does not inform dealers how the
odometer reading affects the trade in value.
• Question: In an experiment to determine whether the odometer
reading should be included in the Red Book, an interested buyer of
used cars randomly selects ten 3-year-old cars of the same make,
condition, and optional features.
–The trade-in value and mileage for each car are shown in the following table.
Data
Odometer Reading(1,000 miles) 59 92 61 72 52 67 88 62 95 83
Trade-in Value ($100s) 37 41 43 39 41 39 35 40 29 33
• Run the regression, with Trade-in Value as the dependent variable
(Y) and Odometer Reading as the independent variable (X). The
output appears on the following page.
•Regression Statistics
– Multiple R = 0.893, R2 = 0.798, Adjusted R2 = 0.773 Standard Error =
2.178
– Analysis of Variance
df SS MS F Significance F
Regression 1 150.14 150.14 31.64 0.000
Residual 8 37.96 4.74
Total 9 188.10
– Testing
Coeff. Stnd Error t-Stat P-value
Intercept 56.205 3.535 15.90 0.000
x -0.26682 0.04743 -5.63 0.000
F and F-significance
•F is a test statistic testing whether the estimated model is
meaningful; i.e., statistically significant.
–F =MSR/MSE
–A large F or a small p-value (or F-significance) implies that the model
is significant.
–It is unusual not to reject this null hypothesis.
Questions
• What does the regression line tell us about the relationship between
the two variables?
• Can we conclude at the 5% significance level that, for all cars of the
type described in the experiment, higher mileage results in a lower
trade-in value?
• Predict with 95% confidence the trade-in value of such a car that
has been driven 60,000 miles.
• A large national courier company has a policy of selling its cars
when the odometer reading reaches 75,000 miles. The company is
about to sell a large number of 3-year-old cars, each equipped with
the same optional features and in the same condition as the 10 cars
described in the experiment. The company president would like to
know the cars' mean trade-in price. Determine the 95% confidence
interval estimate of the expected value of all cars that have been
driven 75,000 miles.
Salary-budget Example
•A large corporation is concerned about maintaining parity in
salary levels across different divisions.
–As a rough guide, it determines that managers responsible for
comparable budgets in different divisions should have comparable
compensation.
•Data Analysis: The following is a list of salary levels for 20
managers and the sizes of the budgets they manage.
–salary<- c(59.0,67.4,50.4,83.2,105.6,86.0,74.4,52.2,82.6,59.0,44.8,111.4,122.4,
82.6,57.0,70.8,54.6,111.0,86.2,79.0)
–budget<- c(3.5,5.0,2.5,6.0,7.5,4.5,6.0,4.0,4.5,5.0,2.5,12.5,9.0,7.5,6.0, 5.0,3.0,
8.5, 7.5, 6.5)
–Salary Y ($1000s)
–Budget X ($100,000s)
Salary-budget Example
• Want to fit a straight line to this data.
– The slope of this line gives the marginal increase in salary with respect to
increase in budget responsibility.
– The regression equation is SALARY = 31.9 + 7.73 BUDGET
– Each additional $100,000 of budget responsibility translates to an expected
additional salary of $7,730.
– If we wanted to know the average salary corresponding to a budget of 6.0, we
get a salary of 31.9 + 7.73(6.0) = 78.28.
• Why is the least squares criterion the correct principle to follow?
• Assumptions Underlying Least Squares
– The errors 1,…, n are independent of the values of X1,…,Xn.
– The errors have expected value zero; i.e., E[i] = 0.
– All the errors have the same variance: Var[i] = 2, for all i = 1,…,n.
– The errors are uncorrelated; i.e., Corr[i, j] = 0 if i  j.
• The first two assumptions imply that E[Y|X = x] = 0 + 1x.
– Do we necessarily believe that the variability in salary levels among
managers with large budgets is the same as the variability among managers
with small budgets?
How do we evaluate and use the regression line?
• Evaluate the explanatory power of a model.
– Without using X, how do we predict Y?
– Determine how much of the variability in Y values is explained by the X.
• Measure variability using sums of squared quantities.
• The ANOVA table.
– ANOVA is short for analysis of variance.
– This table breaks down the total variability into the explained and
unexplained parts.
– Total SS (9535.8) measures the total variability in the salary levels.
• Without using x, we will use sample mean to do prediction.
– The Regression SS (6884.7) is the explained variation.
• It measures how much variability is explained by differences in budgets.
– Error SS (2651.1) is the unexplained variation.
• This reflects differences in salary levels that cannot be attributed to differences in
budget responsibilities.
– The explained and unexplained variation sum to the Total SS.
• R-squared: R = SSR/SST = 6884.7/9538.8 = 72:2%

More Related Content

PPTX
REGRESSION AND EXPLORATORY FACTOR ANALYSIS
PPT
Data Analysison Regression
PPTX
Sessions 18 19- Regression- SLR MLR.pptx
PDF
ML_Lec3 introduction to regression problems.pdf
PPTX
Intro to econometrics
PDF
Simple & Multiple Regression Analysis
PDF
1. Regression_V1.pdf
PPT
Regression_Analysis_Handout_(Methodology_Part_1).ppt
REGRESSION AND EXPLORATORY FACTOR ANALYSIS
Data Analysison Regression
Sessions 18 19- Regression- SLR MLR.pptx
ML_Lec3 introduction to regression problems.pdf
Intro to econometrics
Simple & Multiple Regression Analysis
1. Regression_V1.pdf
Regression_Analysis_Handout_(Methodology_Part_1).ppt

Similar to Data Analysis_6 oct 2016.ppt (20)

PPTX
Simple Linear Regression and Correlation
PPTX
Regression_Analysis_Handout_(Methodology_Part_1).pptx
PPTX
R For Data Science - Linear Regression
PPT
Regression analysis
PPT
15.Simple Linear Regression of case study-530 (2).ppt
PDF
WEEK 1 Introduction.pdf
PPTX
unit 3_Predictive Analysis Dr. Neeraj.pptx
DOC
Ordinary least squares linear regression
PDF
Module 5.pdf Machine Learning Types and examples
PPT
lecture No. 3a.ppt
PPTX
presentation on R language Regression in R
PPT
Prediction of house price using multiple regression
PPTX
Linear Regression final-1.pptx thbejnnej
PPTX
Evans_Analytics3e_ppt_08_Accessible.pptx
PPTX
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
PDF
IRJET- House Rent Price Prediction
PDF
regressionanalysis-110723130213-phpapp02.pdf
PDF
Linear models for data science
PPTX
Linear_RestqRegression_Presentation.pptx
PPTX
Linear_RegressionWEWWW_Presentation.pptx
Simple Linear Regression and Correlation
Regression_Analysis_Handout_(Methodology_Part_1).pptx
R For Data Science - Linear Regression
Regression analysis
15.Simple Linear Regression of case study-530 (2).ppt
WEEK 1 Introduction.pdf
unit 3_Predictive Analysis Dr. Neeraj.pptx
Ordinary least squares linear regression
Module 5.pdf Machine Learning Types and examples
lecture No. 3a.ppt
presentation on R language Regression in R
Prediction of house price using multiple regression
Linear Regression final-1.pptx thbejnnej
Evans_Analytics3e_ppt_08_Accessible.pptx
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
IRJET- House Rent Price Prediction
regressionanalysis-110723130213-phpapp02.pdf
Linear models for data science
Linear_RestqRegression_Presentation.pptx
Linear_RegressionWEWWW_Presentation.pptx
Ad

More from SATYAJIT58 (9)

PPTX
Power Electronics for Electric Vehicles and HEVs.pptx
PPT
Flywheel as an energy storage device.ppt
PPTX
Ergonomics with focus on anthropometry.pptx
PPTX
Automotive Ergonomics and considerations.pptx
PPTX
Automotive Back Axle_Construction and Types.pptx
PPTX
Battery Management System for Electric Vehicles.pptx
PPTX
Need of EVs and HEVs.pptx
PPTX
Computer Ethics_Satyajit Patil.pptx
PPTX
Central Tendency and Dispersion.pptx
Power Electronics for Electric Vehicles and HEVs.pptx
Flywheel as an energy storage device.ppt
Ergonomics with focus on anthropometry.pptx
Automotive Ergonomics and considerations.pptx
Automotive Back Axle_Construction and Types.pptx
Battery Management System for Electric Vehicles.pptx
Need of EVs and HEVs.pptx
Computer Ethics_Satyajit Patil.pptx
Central Tendency and Dispersion.pptx
Ad

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
annual-report-2024-2025 original latest.
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
1_Introduction to advance data techniques.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to machine learning and Linear Models
PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
Reliability_Chapter_ presentation 1221.5784
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Quality review (1)_presentation of this 21
Data_Analytics_and_PowerBI_Presentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
1_Introduction to advance data techniques.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to machine learning and Linear Models
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

Data Analysis_6 oct 2016.ppt

  • 1. The five stages of questionnaire design and testing
  • 4. “Do you have a driver’s license?” “How many employees does this business have?” “Who is the Prime Minister?”, “Are you aware of these industry support groups?” “How many times did you go to the theatre in the last 12 months?” A) Factual Questions B) Behavioural Questions
  • 5. “What is your occupation?” What is your age in years? What was your main occupation? A) Open Questions B) Closed Questions
  • 6. “Are you in favour of …?” C) Opinion Questions D) Hypothetical Questions “What would you do if ... ?”
  • 7. 1. Can research without data? 2. How can you resolve the problem without supporting data? 3. How do you convince other, that your data are sufficient to support the solution? 4. Where do you go to find data? 5. Can you have imaginary data in research? 6. Can you have data simulation on for research?
  • 35. Statistics for Experimenters: Design, Innovation, and Discovery, 2nd ed. (Wiley-Interscience, 2005),
  • 36. What is a regression Regression analysis is a class of statistical models used to describe, estimate or predict causal relationships among a de pendent variable (outcome) and one (simple/ bivariate regression) or several (multiple/ multivariate regression) ind ependent variables (predictor).
  • 37. Regression: Introduction Basic idea: Use data to identify relationships among variables and use these relationships to make predictions.
  • 38. 1. Prediction: how does the outcome change, if on one or several predictors vary? 2. Cause analysis: how strong is the influence of the pred ictors on the outcome? 3. Time series analysis: how does the outcome change ov er time (predictors ceteris paribus)
  • 39. Linear regression •Linear dependence: constant rate of increase of one variable with respect to another (as opposed to, e.g., diminishing returns). •Regression analysis describes the relationship between two (or more) variables. •Examples: – Income and educational level – Demand for electricity and the weather – Home sales and interest rates •Our focus: –Gain some understanding of the mechanics. • the regression line • regression error – Learn how to interpret and use the results. – Learn how to setup a regression analysis.
  • 40. Two main questions: •Prediction and Forecasting – Predict home sales for December given the interest rate for this month. – Use time series data (e.g., sales vs. year) to forecast future performance (next year sales). – Predict the selling price of houses in some area. • Collect data on several houses (# of BR, #BA, sq.ft, lot size, property tax) and their selling price. • Can we use this data to predict the selling price of a specific house? •Quantifying causality – Determine factors that relate to the variable to be predicted; e.g., predict growth for the economy in the next quarter: use past history on quarterly growth, index of leading economic indicators, and others. – Want to determine advertising expenditure and promotion for the 1999 Ford Explorer. •Sales over a quarter might be influenced by: ads in print, ads in radio, ads in TV, and other promotions.
  • 41. Motivated Example • Predict the selling prices of houses in the region. –Intuitively, we should compare the house for which we need a predicted selling price with houses that have sold recently in the same area, of roughly the same size, same style etc. •Idea: Treat it as a multiple sample problem. •Unfortunately, the list of houses meeting these criteria may be quite small, or there may not be a house of exactly the same characteristics. •Alternative approach: Consider the factors that determine the selling price of a house in this region. • Collect recent historical data on selling prices, and a number of characteristics about each house sold (size, age, style, etc.). –Idea: one sample problem •To predict the selling price of a house without any particular knowledge of the house, we use the average selling price of all of the houses in the data set. –Better idea: •One of the factors that cause houses in the data set to sell for different amounts of money is the fact that houses come in various sizes. •A preliminary model might posit that the average value per square foot of a new house is $40 and that the average lot sells for $20,000. The predicted selling price of a house of size X (in square feet) would be: 20,000 + 40X. •A house of 2,000 square feet would be estimated to sell for 20,000 + 40(2,000) = $100,000.
  • 42. Motivated Example •Probability Model: – We know, however, that this is just an approximation, and the selling price of this particular house of 2,000 square feet is not likely to be exactly $100,000. – Prices for houses of this size may actually range from $50,000 to $150,000. – In other words, the deterministic model is not really suitable. We should therefore consider a probabilistic model. •Let Y be the actual selling price of the house. Then Y = 20,000 + 40x + , where  (Greek letter epsilon) represents a random error term (which might be positive or negative). – If the error term  is usually small, then we can say the model is a good one. – The random term, in theory, accounts for all the variables that are not part of the model (for instance, lot size, neighborhood, etc.). – The value of  will vary from sale to sale, even if the house size remains constant. That is, houses of the exact same size may sell for different prices.
  • 43. Regression Model •The variable we are trying to predict (Y) is called the dependent (or response) variable. •The variable x is called the independent (or predictor, or explanatory) variable. •Our model assumes that E(Y | X = x) = 0 + 1x (the “population line”) (1) The interpretation is as follows: –When X (house size) is fixed at a level x, then we assume the mean of Y (selling price) to be linear around the level x, where 0 is the (unknown) intercept and 1 is the (unknown) slope or incremental change in Y per unit change in X. –0 and 1 are not known exactly, but are estimated from sample data. Their estimates are denoted b0 and b1. •A simple regression model: Consider a model with only one independent variable,. •A multiple regression model: a model with multiple independent variables.
  • 44. House Number Y: Actual Selling Price ($1,000s) X: House Size (100s ft2) 1 89.5 20.0 2 79.9 14.8 3 83.1 20.5 4 56.9 12.5 5 66.6 18.0 6 82.5 14.3 7 126.3 27.5 8 79.3 16.5 9 119.9 24.3 10 87.6 20.2 11 112.6 22.0 12 120.8 .019 13 78.5 12.3 14 74.3 14.0 15 74.8 16.7 Averages 88.84 18.17 Sample 15 houses from the region.
  • 45. Least Squares Estimation •price<- c(89.5,79.9,83.1,56.9,66.6,82.5,126.3,79.3,119.9,87.6,112.6,120.8,78.5,74.3,74.8) •size<- c(20.0,14.8,20.5,12.5,18.0,14.3,27.5,16.5,24.3,20.2,22.0,19.0,12.3,14.0,16.7) •plot(size,price,xlab= “House size (100 sq ft)”,ylab=“Selling price ($1,000)”,main=“House Size (X) vs Selling Price (Y)”)
  • 46. Assumptions •These data do not form a perfect line. This is not surprising, considering that our data are random. In other words, if we assume equation (1) then our line predicts the mean for any given level x. However, when we actually take a measurement (i.e., observe the data), we observe: Yi = 0 + 1Xi + i, for i = 1,2,…, n = 15, where i is the random error associated with the ith observation. –Since we don't know the true values of 0 and 1, it is clear that we do not observe the actual errors (i) precisely either. •Assumptions about the Error –E(i ) = 0 for i = 1, 2,…,n. –(i ) =  where  is unknown. –The errors are independent, that is, the error in the ith observation is independent of the error observed in the jth observation. –The i are normally distributed (with mean 0 and standard deviation ).
  • 47. Least Squares Estimation •Recall 0 and 1 are (unknown) population parameters. – From the sample data, we will calculate numbers and that are estimates of the population parameters. – How should these numbers be chosen? For any choice of and , we can write the following prediction equation = + X. – The “hat” is used to denote a value estimated from the model, as opposed to one that is actually observed. – For each house in our sample of 15 we could check to see how well this equation works at predicting the actual selling prices. Define ei to be the error associated with the ith observation. That is: ei = yi - (estimated selling price) These are sometimes called the residuals or simply errors. •We will pick the values of and that minimize Si ei 2, the sum of the squares of the residuals. This method is often called Least Squares Regression. 0 ˆ  1 ˆ  0 ˆ  1 ˆ  0 ˆ  1 ˆ  ŷ 0 ˆ  1 ˆ 
  • 48. Using the Equation •Method of Least squares leads to that the intercept is 18.354 and the slope is 3.879. –How do we predict the selling price of a house of 1,650 square feet? • Plug in the value 16.50 (1,650 translated to 100s of square feet) in the regression equation and get predicted selling price = 18.354 + 3.879× (16.50) = 82.357. • Translate to a dollar amount, i.e., $82,357. This is the best estimate you have of the selling price of this house, that is, without any further information about the house (e.g., neighborhood, number of rooms, lot size, age, etc.). •Analyzing a Regression •Estimating the Standard Error –From the assumptions about the error, the magnitude of  should be a good guide to the accuracy of a prediction. –The number  is a population parameter, so we cannot know for certain what its value is. –We therefore use an estimate s that is provided in the regression output under the name “standard error of the estimate” or just “standard error.”
  • 49. Making Predictions •The estimate s is calculated by (SSE/(n-2))1/2. –The reason why we divide by n - 2 and not n - 1 has to do with the degrees of freedom issue. –The value of s gives us some idea of the standard deviation of the errors if the model is used to estimate selling prices. In addition, we will make use of the normality assumption to help us make assessments of a prediction. •Suppose a house occupies 2,000 square feet. How do we predict the selling price? –prediction interval: This is used if our goal is to determine a 95% confidence interval on the actual selling price of the house. A 95% prediction interval for the actual selling price is given by (18.354 + 3.879× 20 )  t(n - 2, 0.025)s = 95.94  28.07. –confidence interval: This is used if our goal is to determine a 95% confidence interval on the mean selling price of all houses of this size (2,000 square feet). (E[Y|X = x]) It is 95,940  t(n - 2, 0.025)s/√n = 95.94  7.25 . –In the above examples use the t distribution with n - 2 degrees of freedom. If n - 2  30 then the standard normal distribution can be used instead.
  • 50. Making Inferences about Coefficients •To assess the accuracy of the model, it involves determining whether a particular variable like house size has any effect on the selling price. –Suppose that when a regression line is drawn it produces a horizontal line. This means the selling price of the house is unaffected by the size of the house. –A horizontal line has a slope of 0, so when no linear relationship exists between an independent variable and the dependent variable we should expect to get 1 = 0. –But of course, we only observe estimate of 1, which might only be “close” to zero. To systematically determine when 1 might in fact be zero, we will make inferences about it using our estimate , specifically, we will do hypothesis tests and build confidence intervals. •Testing 1, we can test any of the following: –H0 : 1 = 0 versus HA : 1  0 –H0 : 1  0 versus HA : 1 < 0 –H0 : 1  0 versus HA : 1 > 0 • In each case, the null hypothesis can be reduced to H0: 1 = 0. The test statistic in each case is   1 ˆ 1 / 0 ˆ   s 
  • 51. Example •Can we conclude at the 1% level of significance that the size of a house is linearly related to its selling price? Test H0 : 1 = 0 versus HA : 1  0 –Note this is a two-sided test, we are interested in whether there is any relationship at all between price and size. –Calculate T = (3.879 - 0) / 0.794 = 4.88. –That is, we are 4.88 standard deviations from 0. So at the 1% level (corresponding to thresholds  t(13, 0.005) =  3.012), we reject H0. –There is sufficient evidence to conclude that house size does linearly affect selling price. •To get a p-value on this we would need to look up 4.88 inside the t-table. –It is 0.00024 or 0.024%; very small indeed. •A 95% confidence interval for 1 is given by –For this example: It is 3.879  (2.160)(0.794) = 3.879  1.715. –Using the 15 data points, we are 95% confident that every extra square foot increases the price of the house by anywhere from $21.64 to $55.94. 1 ˆ ) 025 . 0 , 2 ( 1 ˆ   s t n 
  • 52. Method III: Measuring the Strength of the Linear Relationship •Consider the following equation: Yi - = ( - ) + ei. –Squaring both sides and summing over all data points, and after a little algebra, we get: i (Yi - )2 = i ( - )2 + i ei 2, which we usually rewrite as: SST = SSR + SSE, (2) where SST = i (Yi - )2 , SSR = i ( - )2 and SSE = i ei 2. –Interpretation: •SST stands for the “total sum of squares” - this is essentially the total variation in the data set, i.e., the total variation of selling prices. •SSR stands for “sum of squares due to regression” - this is the squared variation around the mean of the estimated selling prices. This is sometimes called the total variation explained by the regression. •SSE stands for “sum of squares due to error” - this is simply the sum of the squared residuals, and it is the variation in the Y variable that remains unexplained after taking into account the variable X. –The interpretation of equation (2) is that the total variation in Y (SST) is made up of two parts: the total variation explained by the regression (SSR) and the remaining unexplained variation (SSE). Y Y Y ˆ Y Y Y ˆ Y ˆ Y Y
  • 53. Regression Statistics •Define R2 = SSR/SST = 1- SSE/SST –The fraction of the total variation explained by the regression. –R2 is a measure of the explanatory power of the model. –Multiple-R = (R2)1/2 (in one variable case = |rXY|) •According to the definition of R2, adding extraneous explanatory variables will artificially inflate the R2. –We must be careful in interpreting this number. –Introducing extra variables can lead to spurious results and can interfere with the proper estimation of slopes for the important variables. •In order to penalize an excess of variables, we consider the adjusted R2, which is adjusted R2 = 1- [SSE/(n-k-1)]/[SST/(n-1)] . Here n is the number of data and k is the number of explanatory variables. –The adjusted R2 thus divides numerator and denominator by their DF.
  • 54. How to determine the value of used cars that customers trade in when purchasing new cars? • Car dealers across North America use the “Red Book” to help them determine the value of used cars that their customers trade in when purchasing new cars. –The book, which is published monthly, lists average trade-in values for all basic models of North American, Japanese and European cars. –These averages are determined on the basis of the amounts paid at recent used-car auctions. –The book indicates alternative values of each car model according to its condition and optional features, but it does not inform dealers how the odometer reading affects the trade in value. • Question: In an experiment to determine whether the odometer reading should be included in the Red Book, an interested buyer of used cars randomly selects ten 3-year-old cars of the same make, condition, and optional features. –The trade-in value and mileage for each car are shown in the following table.
  • 55. Data Odometer Reading(1,000 miles) 59 92 61 72 52 67 88 62 95 83 Trade-in Value ($100s) 37 41 43 39 41 39 35 40 29 33 • Run the regression, with Trade-in Value as the dependent variable (Y) and Odometer Reading as the independent variable (X). The output appears on the following page. •Regression Statistics – Multiple R = 0.893, R2 = 0.798, Adjusted R2 = 0.773 Standard Error = 2.178 – Analysis of Variance df SS MS F Significance F Regression 1 150.14 150.14 31.64 0.000 Residual 8 37.96 4.74 Total 9 188.10 – Testing Coeff. Stnd Error t-Stat P-value Intercept 56.205 3.535 15.90 0.000 x -0.26682 0.04743 -5.63 0.000
  • 56. F and F-significance •F is a test statistic testing whether the estimated model is meaningful; i.e., statistically significant. –F =MSR/MSE –A large F or a small p-value (or F-significance) implies that the model is significant. –It is unusual not to reject this null hypothesis.
  • 57. Questions • What does the regression line tell us about the relationship between the two variables? • Can we conclude at the 5% significance level that, for all cars of the type described in the experiment, higher mileage results in a lower trade-in value? • Predict with 95% confidence the trade-in value of such a car that has been driven 60,000 miles. • A large national courier company has a policy of selling its cars when the odometer reading reaches 75,000 miles. The company is about to sell a large number of 3-year-old cars, each equipped with the same optional features and in the same condition as the 10 cars described in the experiment. The company president would like to know the cars' mean trade-in price. Determine the 95% confidence interval estimate of the expected value of all cars that have been driven 75,000 miles.
  • 58. Salary-budget Example •A large corporation is concerned about maintaining parity in salary levels across different divisions. –As a rough guide, it determines that managers responsible for comparable budgets in different divisions should have comparable compensation. •Data Analysis: The following is a list of salary levels for 20 managers and the sizes of the budgets they manage. –salary<- c(59.0,67.4,50.4,83.2,105.6,86.0,74.4,52.2,82.6,59.0,44.8,111.4,122.4, 82.6,57.0,70.8,54.6,111.0,86.2,79.0) –budget<- c(3.5,5.0,2.5,6.0,7.5,4.5,6.0,4.0,4.5,5.0,2.5,12.5,9.0,7.5,6.0, 5.0,3.0, 8.5, 7.5, 6.5) –Salary Y ($1000s) –Budget X ($100,000s)
  • 59. Salary-budget Example • Want to fit a straight line to this data. – The slope of this line gives the marginal increase in salary with respect to increase in budget responsibility. – The regression equation is SALARY = 31.9 + 7.73 BUDGET – Each additional $100,000 of budget responsibility translates to an expected additional salary of $7,730. – If we wanted to know the average salary corresponding to a budget of 6.0, we get a salary of 31.9 + 7.73(6.0) = 78.28. • Why is the least squares criterion the correct principle to follow? • Assumptions Underlying Least Squares – The errors 1,…, n are independent of the values of X1,…,Xn. – The errors have expected value zero; i.e., E[i] = 0. – All the errors have the same variance: Var[i] = 2, for all i = 1,…,n. – The errors are uncorrelated; i.e., Corr[i, j] = 0 if i  j. • The first two assumptions imply that E[Y|X = x] = 0 + 1x. – Do we necessarily believe that the variability in salary levels among managers with large budgets is the same as the variability among managers with small budgets?
  • 60. How do we evaluate and use the regression line? • Evaluate the explanatory power of a model. – Without using X, how do we predict Y? – Determine how much of the variability in Y values is explained by the X. • Measure variability using sums of squared quantities. • The ANOVA table. – ANOVA is short for analysis of variance. – This table breaks down the total variability into the explained and unexplained parts. – Total SS (9535.8) measures the total variability in the salary levels. • Without using x, we will use sample mean to do prediction. – The Regression SS (6884.7) is the explained variation. • It measures how much variability is explained by differences in budgets. – Error SS (2651.1) is the unexplained variation. • This reflects differences in salary levels that cannot be attributed to differences in budget responsibilities. – The explained and unexplained variation sum to the Total SS. • R-squared: R = SSR/SST = 6884.7/9538.8 = 72:2%