Missing data and non response pdf

Anuj Vijay Bhatia
FPRM 14
Institute of Rural Management Anand
NON RESPONSE
ERROR
HOW TO HANDLE IT?
ResearchMethodology

 The respondent has not replied to the mail or did not
find time to give the interview or cannot be
contacted. There can be many such reasons for
nonresponse.
 High rate of non response is serious.
 Research may lose:
 Credibility
 Acceptability
 Accuracy and Professional Soundness
 Methodology used should be described completely.
 Researchers responsibility to establish external
validity.
 Appropriate sample size and acceptable response
rate must be achieved.
NON RESPONSE ERROR

 Nonresponse error exist to the extent that subjects
included in sample fail to provide usable responses.
 Research manifested by high nonresponse loses
Validity and Reliability.
 Many research articles:
 Do not mention nonresponse as a threat to external validity.
 Do not attempt to control for non response error.
 Do not provide reference to the literature of handling
nonresponse.
 It limits the ability of the researcher to generalize.
NON RESPONSE ERROR

 In a survey research, the ability to generalize is critical.
 There is a risk that non-respondents will be
systematically different from respondents.
 Response rate is higher (100% many times) when
purposive or convenience sampling is used.
 However, probability sampling is used, response rates are
low.
 Ability to generalize is limited when purposive or
convenience sampling is used.
 The threat to validity is not due to response rate but due
to nonrepresentataive sampling procedures.
 To ensure external validity answer: Will your results be
same if a 100% response rate was achieved?
SAMPLING PROCEDURES AND NON-
RESPONSE

 Suppose the population is divided into two strata i.e., the
respondents ( r ) and the non-respondents whose data is
missing (m). Suppose we want to determine 𝑌 , the total
population mean.
 𝒀 = Wr 𝒀 𝒓 + Wm 𝒀 𝒎
 Yr and Ym are the means of respondents and non—
respondents respectively. Wr and Wm are weights.
 If the survey fails to collect data from non-respondents, it will
produce result estimate equal to 𝑌 𝑟.
 The bias will be the difference between 𝑌 𝑟 𝑎𝑛𝑑 𝑌
 𝒀 𝒓 − 𝒀 = 𝒀 𝒓 − ( Wr 𝒀 𝒓 + Wm 𝒀 𝒎 )
= 𝒀 𝒓 𝟏 − 𝑾𝒓 − 𝑾𝒎 𝒀 𝒎
= Wm (𝒀 𝒓 − 𝒀 𝒎)
A SIMPLE LOGIC

 Begins with designing and implementation.
 Appropriate sampling protocols and procedures
should be used to maximize participation.
 Ensure that response rate is enough to conclude that
non-response is not a threat to external validity.
 If required go for some additional procedures to
establish that non-response is not a threat to
external validity.
CONTROLLING NON-RESPONSE ERROR

Methods for Handling Non-Response
1. Comparison of Early to Late Respondents
2. Using “Days to Respond” as a Regression Variable
3. Compare Respondents to Non-Respondents
4. Compare Respondents on Characteristics known a
priori
5. Ignore Non-Response as a Threat to External
Validity
RECOMMENDATIONS FOR HANDLING
NON-RESPONSE

Method 1: Comparison of Early to Late Respondents
 Extrapolation based on statistical inferences
 Operationally define ‘Late Respondents’
 Last wave of respondents: Late Respondents
 Compare early and late respondents based on key
variables of interest.
 If no difference, results can be generalized to larger
population.
METHODS FOR HANDLING
NON-RESPONSE

Method 2: Using “Days to Respond” as a Regression
Variable
 “Days to respond” is coded as continuous variable and
used as IV in regression equation.
 Primary variables of interest are regressed on variable
“Days to Respond”.
 If not statistically significant: Assume that respondents
are not different from non-respondents.
NON-RESPONSE

Method 3: Compare Respondents to Non-Respondents
Compute differences by sampling nonrespondents
and working extra diligently to get their responses.
Minimum 20% of responses from nonrespondents
should be obtained.
If fewer than 20% responses are obtained, Method 1
or 2 should be used by combining the results.
NON-RESPONSE

Method 4: Compare Respondents on Characteristics
known a priori
 Compare respondents to population or
characteristics known in advance
 Describe similarities and differences.
Method 5: Ignore Non-Response as a Threat to External
Validity
 If above methods are you can choose to ignore.
NON-RESPONSE

Anuj Vijay Bhatia
FPRM 14
Institute of Rural Management Anand
MISSING DATA
IN QUANTITATIVE RESEARCH
ResearchMethodology

 What is certain in life?
 Death
 Taxes
 What is certain in research?
 Measurement error
 Missing data
 Missing data can be:
 Due to preventable errors, mistakes, or lack of foresight by the
researcher
 Due to problems outside the control of the researcher
 Deliberate, intended, or planned by the researcher to reduce
cost or respondent burden
 Due to differential applicability of some items to subsets of
respondents Etc.
A FOOD FOR THOUGHT

Missing data and non response pdf

• Non-Response v/s Missing Data
• Missing Data: Where valid values on one or more
variables are not available for analysis.
• Researchers primary concern is to identify the
patterns and relationships underlying the missing
data.
• we need to understand process leading to missing
data to take appropriate course of action.
• Common in Social Research
• More acute in experiments and surveys
• Best way is to avoid it by planning and conscientious
data collection.
• Not uncommon to have some level of missing data.
MISSING DATA

Lost data
Reduces Statistical Power
Meaningfully diminishes sample size
Bias Parameter Estimates
Correlations biased downwards
Predictor scores affected
Restrict Variance
Central Tendency Biased
PRIMARY PROBLEMS

Simple Techniques
Listwise Deletion
Pairwise Deletion
Mean Substitution
Regression Imputation
Hot-Deck Imputation
Maximum Likelihood and Related Methods
Maximum Likelihood
Expectation Maximization
Repeated Measures and Time Series Designs
TECHNIQUES TO DEAL WITH
MISSING DATA

Eliminate all cases with missing data on any
predictor or criterion.
Sacrifices large amount of data
Decreases statistical power
May introduce bias in parameter
Default option in many statistical packages
LISTWISE DELETION

Deletes information only from those statistics
that “need” information.
Preserves great deal of information than
listwise deletion.
Interpretation becomes difficult.
May lead to mathematically inconsistent
correlations.
PAIRWISE DELETION

Use means in place of missing data
Allows to use rest of individual’s data
Preserves data
Easy to use
Attenuate variance and covariance estimates
Useful when correlations between variables is
low and less than 10% of data are missing.
MEAN SUBSTITUTION

 Estimate missing data based on other variables in
data set.
 Advantages:
 Preserves data
 Better than Listwise and Pairwise deletion
 Preserves the deviation from the mean
 Doesn’t attune correlations like mean substitution.
 Variants:
 Simple regression strategy
 Only one iteration
 Estimate relationships in variables and estimate missing data
 Stepwise/Iterative Regression
 Isolate a few key variables, prepare correlation matrix.
 Estimate regression equation and predict missing values
REGRESSION IMPUTATION

 Replace missing value with actual score from similar
case in current data set.
 Hot-deck? What is so hot about it?
 What is Cold-Deck then?
 Missing values are replaced with a reasonable estimate
from similar individual.
 Accurate: Real values are imputed
 May not distort distributions.
 Helpful when data is missing in patterns.
 Little literature backing the accuracy claim.
 Problematic when there are large classification variables.
 Categorizing variables sacrifices information.
 Estimating Standard Errors Difficult.
HOT-DECK IMPUTATION

 Assume: The observed data are a sample drawn from
multivariate normal distribution.
 Parameters are estimated by available data and then
missing scores are estimated based on the parameters
just estimated.
 The missing values are predicted by using conditional
distribution of variables on which data is available.
 ML provides explicit modeling of the imputation process
that is open to scientific analysis and critique.
 More accurate then Listwise deletion and better than ad
hoc approaches like mean substitution.
 However, it may be possible that differences are small
and the distributional assumptions in this method are
relatively strict.
MAXIMUM LIKELIHOOD

 Uses Expectation Maximization Algorithm
 Iterations through process of estimating missing data
 First iteration involves estimating missing data and then
estimating parameters using ML method.
 Second iteration would require re-estimating the missing
data based on new parameter estimates and then
recalculating the parameter estimates.
 This process continues till there is convergence in the
parameter estimates.
 Produces less biased estimates, more accurate.
 Open to scientific analysis and critique.
 Lengthy and complex.
EXPECTATION MAXIMIZATION

 Problem of Missing Data more severe
 Listwise deletion: Loss of more data due to repeated
measures.
 Additional data is collected on same measures at
different time.
 Opportunity to use strongly correlated variables to
impute missing data.
 Linear regression and subject mean can be used to
predict missing values, but it may be biased.
 Interpolation and Extrapolation can produced
relatively unbiased estimates.
REPEATED MEASURES AND TIME SERIES
DESIGN

 The data can be missing at three levels:
1. Item-level missingness
2. Construct-level missingness
3. Person-level missingness
LEVELS OF MISSINGNESS
(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)

Data can be missing randomly or
systematically.
Random Missingness:
Missing Completely at Random (MCR)
Systematic Missingness
Missing at Random (MAR)
Missing not at Random (MNAR)
MECHANISMS OF MISSING DATA

 MCAR (Missing Completely at Random)
 The probability that a variable value is missing does not depend on
the observed data values nor the missing data values.
 P ( missing | complete data ) = P (missing)
 MAR (Missing at Random)
 The probability that a variable value is missing partly depends on
other data that are observed in the dataset but does not depend on
any of the values that are missing.
 P(missing | complete data ) = P (missing | observed data)
 MNAR (Missing Not at Random)
 The probability that a variable value is missing depends on the
missing data values themselves.
 P (missing | complete data ) ≠ P (missing | observed data)

BIAS AND INACCURATE STANDARD
ERRORS

CHOOSING MISSING DATA TREATMENTS

STEP 1: DETERMINE THE TYPE OF MISSING DATA
 Is it under the control of researcher?
 Is it ignorable?
 Ignorable Missing Data
 Expected
 Remedies not needed
 Allowance for missing data are inherent in the technique
 Missing data is operating at random
 Non—Ignorable Missing Data
 Known to researchers: Some remedies if random
 Unknown missing data: Process less easy, but remedies
available
 Missing data known or unknown: Proceed to next step
A FOUR STEP PROCESS FOR IDENTIFYING
MISSING DATA AND APPLYING REMEDIES

STEP 2: DETERMINE THE EXTENT OF MISSING DATA
 Determine the extent of missing data
 Patterns of individual variables, individual cases and even
overall.
 Is it low enough to affect the results?
 It is random?
 If sufficiently low: Apply any remedy
 If not low: Determine the randomness before applying the
remedy
 Assessing the Extent and Pattern of Missing data:
 Tabulate
 Number of cases with missing data
 Percentage of variables with missing data in each case.
 Look for non-random pattern
 Also determine number of cases with no missing data (100%
complete)
 Is missing data too high to create a bias? (Rule of Thumb 1)
 Can deletion be used? (Rule of Thumb 2)

Missing data under 10% can generally be
ignored when it happens in random fashion.
The number of cases with no missing data
should be sufficient for the selected analysis
technique if replacement values will not be
substituted (imputed) for the missing data.
RULE OF THUMB 1
HOW MUCH MISSING DATA IS TOO MUCH?

 Variables with less 15% data are candidates for deletion.
 Higher level of missingness like 20-30% can be
remedied.
 Deletion of large data should be justifiable.
 Cases with missing data for dependent variables typically
are deleted to avoid increase in relationship with
independent variable.
 While deleting a variable, ensure a highly correlated
variable is available to represent intent of original
variable.
 Always perform analysis with or without the deleted
cases or variables to identify any marked differences.
RULE OF THUMB 2
DELETION BASED ON MISSING DATA

STEP 3: DIAGNOSE THE RANDOMNESS OF THE
MISSING DATA PROCESSES.
 Degree of randomness determines the appropriate level
of remedy.
Level of Randomness
 Random: MCAR
 Observed values of Y are truly a random sample of Y values.
 No underlying process that tends to bias the observed data.
 Missing data are indistinguishable form complete data.
 Non-Random: MAR
 Missing values of Y depends on X but not on Y
 Observed values of Y represent a random sample of Y for each
value of X.
 Cannot be generalized.
Diagnostic Tests for Level of Randomness
 Forming 2 groups, with and without missing data : T-Test
 Overall test of Randomness for MCAR

STEP 4: SELECT THE IMPUTATION METHOD

UNDER 10%
Any imputation method can be applied.
10% - 20%
For MCAR
 Hot-Deck Case Substitution and Regression Imputation
For MAR
 Model Based Methods
Over 20%
Regression method for MCAR
Model Based method for MAR
RULE OF THUMB 3
IMPUTATION OF MISSING DATA

1. Dooley, L. M., & Lindner, J. R. (2003). The handling of
nonresponse error. Human Resource Development
Quarterly, 14(1), 99-110.
2. Roth, P. L. (1994). Missing data: A conceptual review for
applied psychologists. Personnel psychology, 47(3), 537-560.
3. Blair, E., & Zinkhan, G. M. (2006). Nonresponse and
generalizability in academic research. Journal of the Academy
of Marketing Science, 34(1), 4-7.
4. Newman, D. A. (2014). Missing data five practical
guidelines. Organizational Research Methods, 17(4), 372-411.
5. Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham,
R. L. (2006). Multivariate data analysis 6th Edition. New
Jersey: Pearson Education.
REFERENCES

Missing data and non response pdf

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Missing data and non response pdf (20)

More from Anuj Bhatia (6)

Recently uploaded (20)

Missing data and non response pdf