Statistical Approaches to Missing Data

3rd Socio-Cultural Data Summit

Statistical Approaches to Missing Data:
Imputation, Interpolation, and Data Fusion

Brian Efird, Ph.D.
National Defense University

What Do We Mean By “Missing Data”
• In a structured, quantitative dataset, we simply mean that some of the
“observations” have null values. That is, there is no observation for
some part(s) of the dataset.
− E.g. in a survey, an answer(s) was not provided to a question (or
multiple questions) by a respondent (or multiple respondents).
− We intended to have these observations but they are not present in
the dataset.
• Missing responses can also be “strategic“ (e.g. deception/self
preservation).
• However, we would still like to say something or make an inference
about the phenomena that is supposedly measured by the dataset as if
we had no missing values.
• One approach just ignores the missing data. Another approach applies
one of various statistical techniques to “fill” the holes in the dataset.
• Either approach has consequences and requires one to understand a bit
more about why the data are missing.
2

Typical Assumptions About Missing Data for Statistics

• Values can be missing for dependent (response) variables or on
independent (explanatory) variables.
• Missing data can effect properties of estimators (for
example, means, percentages, percentiles, variances, ratios, regressi
on parameters and so on).
• Missing data can also affect inferences, i.e. the properties of tests
and confidence intervals, and Bayesian posterior distributions.
• A critical determinant of these effects is the way in which the
probability of an observation being missing (the missingness
mechanism) depends on other variables (measured or not) and on
its own value.
• If one ignores missing data, it may bias the sample. E.g., if you only
include observations in behavioral data where every question is
answered, you typically end up with a very odd sample.
3

More Assumptions About Missing Data for Statistics

• In contrast with the sampling process, which is usually known, the
missingness mechanism is usually unknown.
• The additional assumptions needed to allow the observed data to
be the basis of inferences that would have been available from
the complete data can usually be expressed in terms of either:
− The relationship between selection of missing observations
and the values they would have taken, or
− The statistical behavior of the unseen data.
• These additional assumptions are not subject to assessment from
the data under analysis; their plausibility cannot be definitively
determined from the data.

4

What Type of Missing Data Do You Have – MCAR?

• Missing data are said to be missing completely at random (MCAR)
if the probability that data are missing does not depend on
observed or unobserved data.
• Under MCAR, the missing-data values are a simple random
sample of all data values, and so any analysis that discards the
missing values remains consistent (although maybe inefficient).
• An example of a MCAR mechanism would be that a laboratory
sample is dropped, so the resulting observation is missing. Or
data may be missing because equipment malfunctioned, the
weather was terrible, people got sick, or the data were not
entered correctly.
• This is the best case. It means there is no underlying mechanism
or pattern (observed or unobserved) which explains the missing
data. Proceed….
5

What Type of Missing Data Do You Have – MAR?

• Missing data are said to be missing at random (MAR) if the
probability that data are missing does not depend on unobserved
data but may depend on observed data.

• That is, the data are not missing completely at random.

• In other words, under MAR, the probability of a value being
missing will generally depend on observed values, so it does not
correspond to the intuitive notion of random.

6

What Type of Missing Data Do You Have – MAR? (cont’d)

• For example:
− People who are depressed might be less inclined to report their
income, and thus reported income will be related to depression.
− Depressed people might also have a lower income in
general, and thus when we have a high rate of missing data
among depressed individuals, the actual mean income of the
population might be lower than it would be without missing
data.
− However, if, within depressed patients the probability of
reported income was unrelated to income level, then the data
would be considered MAR, though not MCAR.
− Another way of saying this is to say that to the extent that we
can explain missingness is correlated with other variables that
are included in the analysis, the data are MAR.
7

What Type of Missing Data Do You Have –MNAR?

• Missing data are said to be missing not at random (MNAR) for a
specific and systematic, but unobserved, reason.
• We cannot ignore data that are MNAR.
• For example:
− If we are studying mental health and people who have been
diagnosed as depressed are less likely than others to report their
mental status, the data are not missing at random.
− Clearly the mean mental status score for the available data will
not be an unbiased estimate of the mean that we would have
obtained with complete data.
− The same thing happens when people with low income are less
likely to report their income on a data collection form.
− Or, if you ask opinions on a large number of
instruments, typically only highly educated people answer all of
them. If you drop non-responses, you bias the sample badly.
8

Introduction to Imputation

• Missing data arise frequently.
• The technique of multiple imputation, which originated in early
1970 in application to survey nonresponse, has gained popularity
over the years.
• An imputation represents one set of plausible values for missing
data. Multiple imputations represent multiple sets of plausible
values.
• Multiple imputation is a simulation-based exercise where a
number of plausible values for each missing observation are
generated.
• This raises the secondary but still important question, if multiple
imputations are to be generated, how many should one simulate?
More is better to some extent….

9

Interpolation – A Simple Example of Imputation

We have data points on y and x, although sometimes the
observations on y are missing. We believe that y is a function of
x, justifying filling in the missing values by linear interpolation.

Interpolation uses the values
of x to approximate missing
values of y in y1 and y2

Inference is using the data that we do have (i.e. in a survey those
questions that were answered) to fill in values for what we don’t
have (i.e. what they didn't answer or were unwilling to answer).
10

A Bit More on Imputation

• Univariate imputation is used to impute a single variable. It can
be used repeatedly to impute multiple variables only when the
variables are independent and will be used in separate analyses.
− Well established techniques are available for a variety of types
of variables, e.g. continuous variables, censored
variables, binary variables, categorical variables, count
variables.
• If variables follow a “monotone-missing” pattern, they can be
imputed sequentially using univariate conditional distributions.
• When a pattern of missing values is arbitrary, iterative or
multivariate methods should be used to fill in missing values.
• As with any statistical procedure, choosing an appropriate
imputation approach is an art, and the choice should ultimately
be determined by your data and research objectives. It is good
practice to check that your imputations are sensible and to
11

More Concretely

• Essentially, imputation is using responses we do have to construct
a model to fill in responses we do NOT have.
• Other, naive techniques (e.g., filling in non-responses with the
mean of the respondents) are not as good as using a model (i.e.
treating the variable with missing data as a dependent variable
and using logical independent variables to help fill in the values.
• For example:
− If a person misses a policy instrument (e.g., abortion) but
answered gay marriage, religion in politics, plus
demographics, it's easy to impute the abortion response and
a lot more logically satisfying than filling in the mean.

12

Statistical Approaches to Missing Data

More Related Content

What's hot (17)

Viewers also liked (9)

Similar to Statistical Approaches to Missing Data (20)

More from DataCards (8)

Statistical Approaches to Missing Data