Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

Managing missing values in routinely reported data:
One approach from the DRC
Matt Worges
Data for Impact Webinar Series
December 2, 2020

• Framing the Webinar through the D4I lens
• DHIS2 data: advantages and issues
• Exploring a DHIS2 data set
• What to do with blanks?
• Interpolation
• Recreate the “Truth”
• Interpolation diagnostics
Overview

• The D4I team was tasked with conducting an impact evaluation of
the USAID Integrated Health Project (IHP) implemented in 9
provinces of the DRC
• IHP goal: Reduce maternal, newborn, and child deaths through delivery of
integrated health services
• IHP objectives: Increase access to and use of quality health services in
the targeted health zones
IHP Impact Evaluation

• D4I research question: What was the impact of IHP on the
utilization of health services (e.g., treatment for childhood illnesses)
over the course of the study period?
• Measuring impact: D4I is assessing impact through a difference-
in-differences (DID) with propensity score matching (PSM) model
• Data source: We are using DHIS2 data for this impact evaluation
IHP Impact Evaluation – Approach

• PSM is widely used to mitigate confounding in observational
studies
• Complications arise when the covariates used to estimate the propensity
scores are only partially observed
• Interpolation/imputation approaches provide a potential solution for
handling missing data in the estimation of the propensity scores
• Recommended to derive the propensity score after applying interpolation or
imputation
IHP Impact Evaluation – Propensity Score Matching

• Addition/removal of health facilities at different time points
• Long runs of missing values
• Zero counts are typically not entered – they are left blank
• Cannot distinguish between truly missing and zero
• Data entry errors manifesting as outliers/anomalous points
• Reporting has improved over time making older time points less
complete
Some DHIS2 Issues

• Missing data can result in:
• Reduced statistical power
• Biased estimators
• Reduced representativeness of the sample
• Generally incorrect inference and conclusions
Why do we care about missingness?
Overview of Approaches for Missing Data – Susan Buchman

• Time Series Characteristics
• Restricted to Haut-Katanga Province, DRC
• Uncomplicated + severe malaria cases (all ages)
• 24-month period from October 2018 to September 2020
• Health facility count = 1,362
• The monthly-aggregated time series appears to include both a seasonal
and positive trend component
Data Set

Unprocessed Data – Missingness Visualized
HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20
hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219
hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582
hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222
hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639
hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203
hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240
hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313
hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283
hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257
hk Kasomeno Centre de Santé de Référence 282 307 306 265
hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393
hk Belle Vue Centre de Santé 135 157 555 350 124 102 92

Unprocessed Data – Missingness Visualized
Missing (28.6%) Present (71.4%)‘visdat’ package
Malaria Cases – Haut-Katanga Province

Unprocessed Data – Histogram of Missingness
No missing values
(complete case analysis)
Completely blank records
(remove from data set) One missing value
Two missing values
‘ggplot2’ package
284
193
137
27

Unprocessed Data – Outliers?
What are these doing here?
Are they malaria outbreaks?
Are they data entry errors?

Unprocessed Data – Outliers.
‘anomalize’ package
Something looks off here This point didn’t show up as anomalous

• One method to remove outliers is to delete those values that are
± X standard deviations from the median
• The median is insensitive to extreme values in your time series
• Experiment with different thresholds (i.e., ± 4 SDs from the median
or ± 6 SDs from the median) to examine what happens to your data
Removing Egregious Outliers – One Approach

Malaria cases
Median
Standard
deviation
+ 4.5 SDs from the median
This value would be
removed from the data set

Anomalous Data Points
This is what I’m targeting for removal
Less concerned with these

Removing Egregious Outliers - Effects
Average Malaria Cases – Haut-Katanga Province
+4.5 SDs from the median
Removed 8 values or 0.025%
Unprocessed data set

Are missing values actually
zeros in the DRC DHIS2?

Link between Missingness & Median Case Counts
1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150
Median Health Facility Malaria Cases (binned)
Generalization: the lower the median case counts the
higher the number of average missing values

• Assume no item nonresponse?
• Examine this notion with two extreme examples
• One HF time series with large monthly values and 1 missing
• One HF time series with low monthly values and 1 missing
• Replace missing with zero and run anomaly detection
Assumption: Missing Values are Zeros

Initial missing value was replaced with 0
Initial missing value was replaced with 0

Interpolation on
Univariate Time Series

• A univariate time series is a sequence of single observations at
regular and successive points in time
• Possible to decompose the time series into its trend, seasonal, and
irregular components
• We can use these time series characteristics in the interpolation process
Univariate Time Series

dataseasonaltrendremainder
2017 2018 2019 2020
Loess Seasonal Decomposition of Average Malaria Cases
‘stats’ package

AutocorrelationFunction
Lag
Autocorrelation Function Plot (ACF plot)

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

• Values in a series do not have violent, unexplained fluctuations
• The rate of change (increases/decreases) between points occurs at
a uniform rate
Assumptions of Interpolation

• Easy to code (one line in R for long form data frame)
• df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2)
• Intuitive understanding of linearly interpolating across very short
gaps of missing values
• Probably a good approach for high case load facilities
• May not grossly deviate from the ‘truth’ when applied to low case load
facilities
A Role for Linear Interpolation?
‘imputeTS’ package

Linear Interpolation
----
---- ----
Joining known
values with linear
segments

Linearly interpolated

Seasonality in Interpolation
Un-imputed
data
Linearly
interpolated data
w/o seasonality
Linearly
interpolated data
w/ seasonality

• Take seasonality into account
• na.interp from the ‘forecast’ package in R
• By default, uses linear interpolation for non-seasonal series. For seasonal series, a
robust STL decomposition is first computed. Then a linear interpolation is applied to
the seasonally adjusted data, and the seasonal component is added back.
• na.StructTS from the ‘zoo’ package in R
• Interpolate with seasonal Kalman filter
• These two functions use similar mechanisms to interpolate missing
data in that they both can ‘handle’ seasonality in the time series
Univariate Time Series Interpolation

Seasonality Adjusted Time Series

Let’s reset and apply some
of these steps

Missingness Visualized – Unprocessed Data
284 HFs with no missing data

Missingness Visualized – Removed New/Defunct HFs

Missingness Visualized – Linear Interpolation (gaps ≤ 2)
807 HFs with no missing data

Time Series Trends
New/defunct HFs and outliers have been removed from all time series

• Use a data set containing only complete time series records
• 2.5% of data are zero values (primarily limited to smaller facilities)
• Introduce random missingness
• Randomly delete15% of data points
• Delete 90% of remaining zero values
• Include runs of more than 2 missing values
• Apply various imputation methods and compare against the “truth”
• Replace all blanks with zeros
• Linear interpolation on gaps ≤ 2
• Use the two identified interpolation strategies that consider seasonality
A Quick Example

Time Series Trends
Anomalous data points have been removed

na.StructTS
Average raw bias = -1.18
na.interp
Average raw bias = -0.03

na.StructTS
MAPE = 119.03
na.interp
MAPE = 117.41

The RMSE difference is positive for 1,847
HFs indicating that the ‘na.StructTS’
approach had a lower RMSE for 68% of HFs
‘na.StructTS’ approach has lower RMSE
‘na.interp’ approach has lower RMSE

• Assess missingness
• Address egregious outliers
• Manage new/defunct facility records
• Decompose the time series
• Try a few different interpolation techniques and plot results
• Isolate a subset of records with no missing data
• Introduce missing data and then recreate the “truth”
Recap

This presentation was produced with the support of the United States Agency for International
Development (USAID) under the terms of the Data for Impact (D4I) associate award
7200AA18LA00008, which is implemented by the Carolina Population Center at the University of
North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.;
John Snow, Inc.; and Tulane University. The views expressed in this publication do not
necessarily reflect the views of USAID or the United States government.
www.data4impactproject.org

• DHIS 2 time series do not always lend themselves well to multiple
imputation
• Multiple imputation is a preferable choice when there are variables
predictive of missingness that could be included in the imputation model
• With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in
the imputation process
• DHIS 2 time series may exhibit MNAR missingness structure
• Earlier time points have more missing data
• Zero values are more likely to be missing
Imputation

• Advantages of using DHIS2 data
• Access to a wide breadth of data elements/services
• Analyze at various levels of the health system
• National, regional, district, health facility
• Data are generally collected via standardized reporting tools
• Data tend to be reported at regular intervals allowing for frequent updates
to analyses
• However, not all data elements are well-reported, and it is typically
necessary to process/clean DHIS2 data
Why Use DHIS2 Data?

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

More Related Content

What's hot (20)

Similar to Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo (20)

More from removed_62798267384a091db5c693ad7f1cc5ac (20)

Recently uploaded (20)

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo