SlideShare a Scribd company logo
3rd Socio-Cultural Data Summit


  Statistical Approaches to Missing Data:
Imputation, Interpolation, and Data Fusion


            Brian Efird, Ph.D.
       National Defense University
What Do We Mean By “Missing Data”
• In a structured, quantitative dataset, we simply mean that some of the
  “observations” have null values. That is, there is no observation for
  some part(s) of the dataset.
   − E.g. in a survey, an answer(s) was not provided to a question (or
       multiple questions) by a respondent (or multiple respondents).
   − We intended to have these observations but they are not present in
       the dataset.
• Missing responses can also be “strategic“ (e.g. deception/self
  preservation).
• However, we would still like to say something or make an inference
  about the phenomena that is supposedly measured by the dataset as if
  we had no missing values.
• One approach just ignores the missing data. Another approach applies
  one of various statistical techniques to “fill” the holes in the dataset.
• Either approach has consequences and requires one to understand a bit
  more about why the data are missing.
                                                                              2
Typical Assumptions About Missing Data for Statistics

• Values can be missing for dependent (response) variables or on
  independent (explanatory) variables.
• Missing data can effect properties of estimators (for
  example, means, percentages, percentiles, variances, ratios, regressi
  on parameters and so on).
• Missing data can also affect inferences, i.e. the properties of tests
  and confidence intervals, and Bayesian posterior distributions.
• A critical determinant of these effects is the way in which the
  probability of an observation being missing (the missingness
  mechanism) depends on other variables (measured or not) and on
  its own value.
• If one ignores missing data, it may bias the sample. E.g., if you only
  include observations in behavioral data where every question is
  answered, you typically end up with a very odd sample.
                                                                           3
More Assumptions About Missing Data for Statistics

• In contrast with the sampling process, which is usually known, the
  missingness mechanism is usually unknown.
• The additional assumptions needed to allow the observed data to
  be the basis of inferences that would have been available from
  the complete data can usually be expressed in terms of either:
   − The relationship between selection of missing observations
      and the values they would have taken, or
   − The statistical behavior of the unseen data.
• These additional assumptions are not subject to assessment from
  the data under analysis; their plausibility cannot be definitively
  determined from the data.




                                                                       4
What Type of Missing Data Do You Have – MCAR?

• Missing data are said to be missing completely at random (MCAR)
  if the probability that data are missing does not depend on
  observed or unobserved data.
• Under MCAR, the missing-data values are a simple random
  sample of all data values, and so any analysis that discards the
  missing values remains consistent (although maybe inefficient).
• An example of a MCAR mechanism would be that a laboratory
  sample is dropped, so the resulting observation is missing. Or
  data may be missing because equipment malfunctioned, the
  weather was terrible, people got sick, or the data were not
  entered correctly.
• This is the best case. It means there is no underlying mechanism
  or pattern (observed or unobserved) which explains the missing
  data. Proceed….
                                                                     5
What Type of Missing Data Do You Have – MAR?

• Missing data are said to be missing at random (MAR) if the
  probability that data are missing does not depend on unobserved
  data but may depend on observed data.

• That is, the data are not missing completely at random.

• In other words, under MAR, the probability of a value being
  missing will generally depend on observed values, so it does not
  correspond to the intuitive notion of random.




                                                                     6
What Type of Missing Data Do You Have – MAR? (cont’d)

• For example:
   − People who are depressed might be less inclined to report their
     income, and thus reported income will be related to depression.
   − Depressed people might also have a lower income in
     general, and thus when we have a high rate of missing data
     among depressed individuals, the actual mean income of the
     population might be lower than it would be without missing
     data.
   − However, if, within depressed patients the probability of
     reported income was unrelated to income level, then the data
     would be considered MAR, though not MCAR.
   − Another way of saying this is to say that to the extent that we
     can explain missingness is correlated with other variables that
     are included in the analysis, the data are MAR.
                                                                       7
What Type of Missing Data Do You Have –MNAR?

• Missing data are said to be missing not at random (MNAR) for a
  specific and systematic, but unobserved, reason.
• We cannot ignore data that are MNAR.
• For example:
   − If we are studying mental health and people who have been
      diagnosed as depressed are less likely than others to report their
      mental status, the data are not missing at random.
   − Clearly the mean mental status score for the available data will
      not be an unbiased estimate of the mean that we would have
      obtained with complete data.
   − The same thing happens when people with low income are less
      likely to report their income on a data collection form.
   − Or, if you ask opinions on a large number of
      instruments, typically only highly educated people answer all of
      them. If you drop non-responses, you bias the sample badly.
                                                                           8
Introduction to Imputation

• Missing data arise frequently.
• The technique of multiple imputation, which originated in early
  1970 in application to survey nonresponse, has gained popularity
  over the years.
• An imputation represents one set of plausible values for missing
  data. Multiple imputations represent multiple sets of plausible
  values.
• Multiple imputation is a simulation-based exercise where a
  number of plausible values for each missing observation are
  generated.
• This raises the secondary but still important question, if multiple
  imputations are to be generated, how many should one simulate?
  More is better to some extent….


                                                                        9
Interpolation – A Simple Example of Imputation

We have data points on y and x, although sometimes the
observations on y are missing. We believe that y is a function of
x, justifying filling in the missing values by linear interpolation.



               Interpolation uses the values
               of x to approximate missing
               values of y in y1 and y2




Inference is using the data that we do have (i.e. in a survey those
questions that were answered) to fill in values for what we don’t
have (i.e. what they didn't answer or were unwilling to answer).
                                                                       10
A Bit More on Imputation

• Univariate imputation is used to impute a single variable. It can
  be used repeatedly to impute multiple variables only when the
  variables are independent and will be used in separate analyses.
   − Well established techniques are available for a variety of types
       of variables, e.g. continuous variables, censored
       variables, binary variables, categorical variables, count
       variables.
• If variables follow a “monotone-missing” pattern, they can be
  imputed sequentially using univariate conditional distributions.
• When a pattern of missing values is arbitrary, iterative or
  multivariate methods should be used to fill in missing values.
• As with any statistical procedure, choosing an appropriate
  imputation approach is an art, and the choice should ultimately
  be determined by your data and research objectives. It is good
  practice to check that your imputations are sensible and to
                                                                        11
More Concretely

• Essentially, imputation is using responses we do have to construct
  a model to fill in responses we do NOT have.
• Other, naive techniques (e.g., filling in non-responses with the
  mean of the respondents) are not as good as using a model (i.e.
  treating the variable with missing data as a dependent variable
  and using logical independent variables to help fill in the values.
• For example:
   − If a person misses a policy instrument (e.g., abortion) but
     answered gay marriage, religion in politics, plus
     demographics, it's easy to impute the abortion response and
     a lot more logically satisfying than filling in the mean.




                                                                        12

More Related Content

PDF
Biostatistics Workshop: Missing Data
PPTX
Missing Data and Causes
PPTX
Imputation Techniques For Market Research Datasets With Missing Values
PDF
Statistical Methods to Handle Missing Data
PDF
Missing data handling
PPTX
Imputation techniques for missing data in clinical trials
PPTX
Imputation of missing data in clinical trials
PPTX
Data mining Part 1
Biostatistics Workshop: Missing Data
Missing Data and Causes
Imputation Techniques For Market Research Datasets With Missing Values
Statistical Methods to Handle Missing Data
Missing data handling
Imputation techniques for missing data in clinical trials
Imputation of missing data in clinical trials
Data mining Part 1

What's hot (17)

PPTX
Missing Data and data imputation techniques
PDF
Data analysis
PDF
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
PPT
Burns And Bush Chapter 15
PDF
Exploratory data analysis
PPT
PPTX
Data Analysis and Statistics
PPT
Statistical Analysis Overview
PPTX
Properties of estimators (blue)
PPT
Statistics
PDF
Outlier Detection
PPTX
Statistical Analysis for Educational Outcomes Measurement in CME
PPTX
Multivariate analyses
PPTX
Basics of Educational Statistics (Inferential statistics)
PPTX
Types of Statistics Descriptive and Inferential Statistics
PPTX
Statistical analysis and interpretation
DOCX
Multiple imputation of missing data
Missing Data and data imputation techniques
Data analysis
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
Burns And Bush Chapter 15
Exploratory data analysis
Data Analysis and Statistics
Statistical Analysis Overview
Properties of estimators (blue)
Statistics
Outlier Detection
Statistical Analysis for Educational Outcomes Measurement in CME
Multivariate analyses
Basics of Educational Statistics (Inferential statistics)
Types of Statistics Descriptive and Inferential Statistics
Statistical analysis and interpretation
Multiple imputation of missing data
Ad

Viewers also liked (9)

PDF
PROMISE 2011: "Handling missing data in software effort prediction with naive...
PDF
Stata tutorial
PPTX
Seminar presentation on AKI and CKD in pediatrics
PDF
A handbook-of-statistical-analyses-using-stata-3rd-edition
PPTX
Survival analysis
PDF
Acute Kidney Injury
PDF
Data management in Stata
PPTX
Sampling Methods in Qualitative and Quantitative Research
PROMISE 2011: "Handling missing data in software effort prediction with naive...
Stata tutorial
Seminar presentation on AKI and CKD in pediatrics
A handbook-of-statistical-analyses-using-stata-3rd-edition
Survival analysis
Acute Kidney Injury
Data management in Stata
Sampling Methods in Qualitative and Quantitative Research
Ad

Similar to Statistical Approaches to Missing Data (20)

PPTX
Presentation research- chapter 10-11 istiqlal
PDF
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
PPTX
ststs nw.pptx
PDF
DS-38data sciencehandbooknotescompiled-46.pdf
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
PDF
You Want Me to Measure What?
PPTX
Data analysis inferential Data-Analysis-v2.pptx
PPTX
Unit-4.1-Data-Analysis_DescriptiveInferential-Data-Analysis.pptx
PPT
Analysing & interpreting data.ppt
PPTX
De-Mystifying Stats: A primer on basic statistics
PPTX
statistical inference.pptx
PPTX
Fundamental of sampling
PPTX
Unit III Intellectual Property rights pptx
PDF
Data Science interview questions of Statistics
PPTX
Data science notes for ASDS calicut 2.pptx
PDF
2010 smg training_cardiff_day1_session3_higgins
PPTX
sience 2.0 : an illustration of good research practices in a real study
PPTX
IDS-Unit-II. bachelor of computer applicatio notes
PPTX
Concept of Inferential statistics
PPTX
Advanced Biostatistics presentation pptx
Presentation research- chapter 10-11 istiqlal
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
ststs nw.pptx
DS-38data sciencehandbooknotescompiled-46.pdf
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
You Want Me to Measure What?
Data analysis inferential Data-Analysis-v2.pptx
Unit-4.1-Data-Analysis_DescriptiveInferential-Data-Analysis.pptx
Analysing & interpreting data.ppt
De-Mystifying Stats: A primer on basic statistics
statistical inference.pptx
Fundamental of sampling
Unit III Intellectual Property rights pptx
Data Science interview questions of Statistics
Data science notes for ASDS calicut 2.pptx
2010 smg training_cardiff_day1_session3_higgins
sience 2.0 : an illustration of good research practices in a real study
IDS-Unit-II. bachelor of computer applicatio notes
Concept of Inferential statistics
Advanced Biostatistics presentation pptx

More from DataCards (8)

PDF
Information Extraction and Integration of Hard and Soft Information for D2D v...
PDF
Fusion of Human Geography Data
PPTX
Geohash: Integration of Disparate Geospatial Data
PPTX
Data Normalization and Alignment in Heterogeneous Data Sets
PPTX
The Challenges and Pitfalls of Aggregating Social Media Data
PPT
How NOT to Aggregrate Polling Data
PPTX
Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge...
PPTX
3rd Socio-Cultural Data Summit
Information Extraction and Integration of Hard and Soft Information for D2D v...
Fusion of Human Geography Data
Geohash: Integration of Disparate Geospatial Data
Data Normalization and Alignment in Heterogeneous Data Sets
The Challenges and Pitfalls of Aggregating Social Media Data
How NOT to Aggregrate Polling Data
Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge...
3rd Socio-Cultural Data Summit

Statistical Approaches to Missing Data

  • 1. 3rd Socio-Cultural Data Summit Statistical Approaches to Missing Data: Imputation, Interpolation, and Data Fusion Brian Efird, Ph.D. National Defense University
  • 2. What Do We Mean By “Missing Data” • In a structured, quantitative dataset, we simply mean that some of the “observations” have null values. That is, there is no observation for some part(s) of the dataset. − E.g. in a survey, an answer(s) was not provided to a question (or multiple questions) by a respondent (or multiple respondents). − We intended to have these observations but they are not present in the dataset. • Missing responses can also be “strategic“ (e.g. deception/self preservation). • However, we would still like to say something or make an inference about the phenomena that is supposedly measured by the dataset as if we had no missing values. • One approach just ignores the missing data. Another approach applies one of various statistical techniques to “fill” the holes in the dataset. • Either approach has consequences and requires one to understand a bit more about why the data are missing. 2
  • 3. Typical Assumptions About Missing Data for Statistics • Values can be missing for dependent (response) variables or on independent (explanatory) variables. • Missing data can effect properties of estimators (for example, means, percentages, percentiles, variances, ratios, regressi on parameters and so on). • Missing data can also affect inferences, i.e. the properties of tests and confidence intervals, and Bayesian posterior distributions. • A critical determinant of these effects is the way in which the probability of an observation being missing (the missingness mechanism) depends on other variables (measured or not) and on its own value. • If one ignores missing data, it may bias the sample. E.g., if you only include observations in behavioral data where every question is answered, you typically end up with a very odd sample. 3
  • 4. More Assumptions About Missing Data for Statistics • In contrast with the sampling process, which is usually known, the missingness mechanism is usually unknown. • The additional assumptions needed to allow the observed data to be the basis of inferences that would have been available from the complete data can usually be expressed in terms of either: − The relationship between selection of missing observations and the values they would have taken, or − The statistical behavior of the unseen data. • These additional assumptions are not subject to assessment from the data under analysis; their plausibility cannot be definitively determined from the data. 4
  • 5. What Type of Missing Data Do You Have – MCAR? • Missing data are said to be missing completely at random (MCAR) if the probability that data are missing does not depend on observed or unobserved data. • Under MCAR, the missing-data values are a simple random sample of all data values, and so any analysis that discards the missing values remains consistent (although maybe inefficient). • An example of a MCAR mechanism would be that a laboratory sample is dropped, so the resulting observation is missing. Or data may be missing because equipment malfunctioned, the weather was terrible, people got sick, or the data were not entered correctly. • This is the best case. It means there is no underlying mechanism or pattern (observed or unobserved) which explains the missing data. Proceed…. 5
  • 6. What Type of Missing Data Do You Have – MAR? • Missing data are said to be missing at random (MAR) if the probability that data are missing does not depend on unobserved data but may depend on observed data. • That is, the data are not missing completely at random. • In other words, under MAR, the probability of a value being missing will generally depend on observed values, so it does not correspond to the intuitive notion of random. 6
  • 7. What Type of Missing Data Do You Have – MAR? (cont’d) • For example: − People who are depressed might be less inclined to report their income, and thus reported income will be related to depression. − Depressed people might also have a lower income in general, and thus when we have a high rate of missing data among depressed individuals, the actual mean income of the population might be lower than it would be without missing data. − However, if, within depressed patients the probability of reported income was unrelated to income level, then the data would be considered MAR, though not MCAR. − Another way of saying this is to say that to the extent that we can explain missingness is correlated with other variables that are included in the analysis, the data are MAR. 7
  • 8. What Type of Missing Data Do You Have –MNAR? • Missing data are said to be missing not at random (MNAR) for a specific and systematic, but unobserved, reason. • We cannot ignore data that are MNAR. • For example: − If we are studying mental health and people who have been diagnosed as depressed are less likely than others to report their mental status, the data are not missing at random. − Clearly the mean mental status score for the available data will not be an unbiased estimate of the mean that we would have obtained with complete data. − The same thing happens when people with low income are less likely to report their income on a data collection form. − Or, if you ask opinions on a large number of instruments, typically only highly educated people answer all of them. If you drop non-responses, you bias the sample badly. 8
  • 9. Introduction to Imputation • Missing data arise frequently. • The technique of multiple imputation, which originated in early 1970 in application to survey nonresponse, has gained popularity over the years. • An imputation represents one set of plausible values for missing data. Multiple imputations represent multiple sets of plausible values. • Multiple imputation is a simulation-based exercise where a number of plausible values for each missing observation are generated. • This raises the secondary but still important question, if multiple imputations are to be generated, how many should one simulate? More is better to some extent…. 9
  • 10. Interpolation – A Simple Example of Imputation We have data points on y and x, although sometimes the observations on y are missing. We believe that y is a function of x, justifying filling in the missing values by linear interpolation. Interpolation uses the values of x to approximate missing values of y in y1 and y2 Inference is using the data that we do have (i.e. in a survey those questions that were answered) to fill in values for what we don’t have (i.e. what they didn't answer or were unwilling to answer). 10
  • 11. A Bit More on Imputation • Univariate imputation is used to impute a single variable. It can be used repeatedly to impute multiple variables only when the variables are independent and will be used in separate analyses. − Well established techniques are available for a variety of types of variables, e.g. continuous variables, censored variables, binary variables, categorical variables, count variables. • If variables follow a “monotone-missing” pattern, they can be imputed sequentially using univariate conditional distributions. • When a pattern of missing values is arbitrary, iterative or multivariate methods should be used to fill in missing values. • As with any statistical procedure, choosing an appropriate imputation approach is an art, and the choice should ultimately be determined by your data and research objectives. It is good practice to check that your imputations are sensible and to 11
  • 12. More Concretely • Essentially, imputation is using responses we do have to construct a model to fill in responses we do NOT have. • Other, naive techniques (e.g., filling in non-responses with the mean of the respondents) are not as good as using a model (i.e. treating the variable with missing data as a dependent variable and using logical independent variables to help fill in the values. • For example: − If a person misses a policy instrument (e.g., abortion) but answered gay marriage, religion in politics, plus demographics, it's easy to impute the abortion response and a lot more logically satisfying than filling in the mean. 12