SlideShare a Scribd company logo
Managing missing values in routinely reported data:
One approach from the DRC
Matt Worges
Data for Impact Webinar Series
December 2, 2020
• Framing the Webinar through the D4I lens
• DHIS2 data: advantages and issues
• Exploring a DHIS2 data set
• What to do with blanks?
• Interpolation
• Recreate the “Truth”
• Interpolation diagnostics
Overview
• The D4I team was tasked with conducting an impact evaluation of
the USAID Integrated Health Project (IHP) implemented in 9
provinces of the DRC
• IHP goal: Reduce maternal, newborn, and child deaths through delivery of
integrated health services
• IHP objectives: Increase access to and use of quality health services in
the targeted health zones
IHP Impact Evaluation
• D4I research question: What was the impact of IHP on the
utilization of health services (e.g., treatment for childhood illnesses)
over the course of the study period?
• Measuring impact: D4I is assessing impact through a difference-
in-differences (DID) with propensity score matching (PSM) model
• Data source: We are using DHIS2 data for this impact evaluation
IHP Impact Evaluation – Approach
• PSM is widely used to mitigate confounding in observational
studies
• Complications arise when the covariates used to estimate the propensity
scores are only partially observed
• Interpolation/imputation approaches provide a potential solution for
handling missing data in the estimation of the propensity scores
• Recommended to derive the propensity score after applying interpolation or
imputation
IHP Impact Evaluation – Propensity Score Matching
• Addition/removal of health facilities at different time points
• Long runs of missing values
• Zero counts are typically not entered – they are left blank
• Cannot distinguish between truly missing and zero
• Data entry errors manifesting as outliers/anomalous points
• Reporting has improved over time making older time points less
complete
Some DHIS2 Issues
• Missing data can result in:
• Reduced statistical power
• Biased estimators
• Reduced representativeness of the sample
• Generally incorrect inference and conclusions
Why do we care about missingness?
Overview of Approaches for Missing Data – Susan Buchman
• Time Series Characteristics
• Restricted to Haut-Katanga Province, DRC
• Uncomplicated + severe malaria cases (all ages)
• 24-month period from October 2018 to September 2020
• Health facility count = 1,362
• The monthly-aggregated time series appears to include both a seasonal
and positive trend component
Data Set
Unprocessed Data – Missingness Visualized
HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20
hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219
hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582
hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222
hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639
hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203
hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240
hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313
hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283
hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257
hk Kasomeno Centre de Santé de Référence 282 307 306 265
hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393
hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
Unprocessed Data – Missingness Visualized
Missing (28.6%) Present (71.4%)‘visdat’ package
Malaria Cases – Haut-Katanga Province
Unprocessed Data – Histogram of Missingness
No missing values
(complete case analysis)
Completely blank records
(remove from data set) One missing value
Two missing values
‘ggplot2’ package
284
193
137
27
Unprocessed Data – Outliers?
What are these doing here?
Are they malaria outbreaks?
Are they data entry errors?
Unprocessed Data – Outliers.
‘anomalize’ package
Something looks off here This point didn’t show up as anomalous
• One method to remove outliers is to delete those values that are
± X standard deviations from the median
• The median is insensitive to extreme values in your time series
• Experiment with different thresholds (i.e., ± 4 SDs from the median
or ± 6 SDs from the median) to examine what happens to your data
Removing Egregious Outliers – One Approach
Malaria cases
Median
Standard
deviation
+ 4.5 SDs from the median
This value would be
removed from the data set
Anomalous Data Points
‘anomalize’ package
This is what I’m targeting for removal
Less concerned with these
Removing Egregious Outliers - Effects
Average Malaria Cases – Haut-Katanga Province
+4.5 SDs from the median
Removed 8 values or 0.025%
Unprocessed data set
Are missing values actually
zeros in the DRC DHIS2?
Link between Missingness & Median Case Counts
1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150
Median Health Facility Malaria Cases (binned)
Generalization: the lower the median case counts the
higher the number of average missing values
• Assume no item nonresponse?
• Examine this notion with two extreme examples
• One HF time series with large monthly values and 1 missing
• One HF time series with low monthly values and 1 missing
• Replace missing with zero and run anomaly detection
Assumption: Missing Values are Zeros
Initial missing value was replaced with 0
Initial missing value was replaced with 0
‘anomalize’ package
Interpolation on
Univariate Time Series
• A univariate time series is a sequence of single observations at
regular and successive points in time
• Possible to decompose the time series into its trend, seasonal, and
irregular components
• We can use these time series characteristics in the interpolation process
Univariate Time Series
dataseasonaltrendremainder
2017 2018 2019 2020
Loess Seasonal Decomposition of Average Malaria Cases
‘stats’ package
AutocorrelationFunction
Lag
Autocorrelation Function Plot (ACF plot)
Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo
• Values in a series do not have violent, unexplained fluctuations
• The rate of change (increases/decreases) between points occurs at
a uniform rate
Assumptions of Interpolation
• Easy to code (one line in R for long form data frame)
• df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2)
• Intuitive understanding of linearly interpolating across very short
gaps of missing values
• Probably a good approach for high case load facilities
• May not grossly deviate from the ‘truth’ when applied to low case load
facilities
A Role for Linear Interpolation?
‘imputeTS’ package
Linear Interpolation
----
---- ----
Joining known
values with linear
segments
Initial missing value was replaced with 0
Initial missing value was replaced with 0
‘anomalize’ package
Linearly interpolated
‘anomalize’ package
Seasonality in Interpolation
Un-imputed
data
Linearly
interpolated data
w/o seasonality
Linearly
interpolated data
w/ seasonality
• Take seasonality into account
• na.interp from the ‘forecast’ package in R
• By default, uses linear interpolation for non-seasonal series. For seasonal series, a
robust STL decomposition is first computed. Then a linear interpolation is applied to
the seasonally adjusted data, and the seasonal component is added back.
• na.StructTS from the ‘zoo’ package in R
• Interpolate with seasonal Kalman filter
• These two functions use similar mechanisms to interpolate missing
data in that they both can ‘handle’ seasonality in the time series
Univariate Time Series Interpolation
Seasonality Adjusted Time Series
Let’s reset and apply some
of these steps
Missingness Visualized – Unprocessed Data
Missing (28.6%) Present (71.4%)‘visdat’ package
284 HFs with no missing data
Missingness Visualized – Removed New/Defunct HFs
Missing (13.8%) Present (86.2%)‘visdat’ package
Missingness Visualized – Linear Interpolation (gaps ≤ 2)
Missing (6.7%) Present (93.3%)‘visdat’ package
807 HFs with no missing data
Time Series Trends
New/defunct HFs and outliers have been removed from all time series
Recreate the “Truth”
• Use a data set containing only complete time series records
• 2.5% of data are zero values (primarily limited to smaller facilities)
• Introduce random missingness
• Randomly delete15% of data points
• Delete 90% of remaining zero values
• Include runs of more than 2 missing values
• Apply various imputation methods and compare against the “truth”
• Replace all blanks with zeros
• Linear interpolation on gaps ≤ 2
• Use the two identified interpolation strategies that consider seasonality
A Quick Example
Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo
Time Series Trends
Anomalous data points have been removed
na.StructTS
na.interp
na.StructTS
Average raw bias = -1.18
na.interp
Average raw bias = -0.03
na.StructTS
MAPE = 119.03
na.interp
MAPE = 117.41
The RMSE difference is positive for 1,847
HFs indicating that the ‘na.StructTS’
approach had a lower RMSE for 68% of HFs
‘na.StructTS’ approach has lower RMSE
‘na.interp’ approach has lower RMSE
• Assess missingness
• Address egregious outliers
• Manage new/defunct facility records
• Decompose the time series
• Try a few different interpolation techniques and plot results
• Isolate a subset of records with no missing data
• Introduce missing data and then recreate the “truth”
Recap
Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo
This presentation was produced with the support of the United States Agency for International
Development (USAID) under the terms of the Data for Impact (D4I) associate award
7200AA18LA00008, which is implemented by the Carolina Population Center at the University of
North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.;
John Snow, Inc.; and Tulane University. The views expressed in this publication do not
necessarily reflect the views of USAID or the United States government.
www.data4impactproject.org
• DHIS 2 time series do not always lend themselves well to multiple
imputation
• Multiple imputation is a preferable choice when there are variables
predictive of missingness that could be included in the imputation model
• With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in
the imputation process
• DHIS 2 time series may exhibit MNAR missingness structure
• Earlier time points have more missing data
• Zero values are more likely to be missing
Imputation
• Advantages of using DHIS2 data
• Access to a wide breadth of data elements/services
• Analyze at various levels of the health system
• National, regional, district, health facility
• Data are generally collected via standardized reporting tools
• Data tend to be reported at regular intervals allowing for frequent updates
to analyses
• However, not all data elements are well-reported, and it is typically
necessary to process/clean DHIS2 data
Why Use DHIS2 Data?

More Related Content

PPTX
Routine data use in evaluation: practical guidance
PPTX
Use of Routine Data for Economic Evaluations
PPTX
Using Most Significant Change in a Mixed-Methods Evaluation in Uganda
PPTX
Monitoring and Evaluation
PPT
Monitoring and Evaluation.ppt
PPTX
Lessons Learned In Using the Most Significant Change Technique in Evaluation
PPTX
M& e slide share
PPTX
Components of a monitoring and evaluation system
Routine data use in evaluation: practical guidance
Use of Routine Data for Economic Evaluations
Using Most Significant Change in a Mixed-Methods Evaluation in Uganda
Monitoring and Evaluation
Monitoring and Evaluation.ppt
Lessons Learned In Using the Most Significant Change Technique in Evaluation
M& e slide share
Components of a monitoring and evaluation system

What's hot (20)

PDF
Introduction to Logic Models
PDF
Monitoring and Evaluation: Lesson 2
PPTX
7 principles of data quality management
PPT
Data Quality Presentation.ppt
PDF
Step 9: Monitoring, Evaluation and Learning
PDF
Data for Impact: Lessons Learned in Using the Ripple Effects Mapping Method
PPTX
Monitoring And Evaluation
PDF
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
DOCX
Project M&E (unit 1-4)
PPTX
Data Demand and Use Workshop
PPT
Monotoring and evaluation principles and theories
PPT
MEASURE Evaluation Data Quality Assessment Methodology and Tools
PPT
Project monitoring and evaluation by Samuel Obino Mokaya
PPTX
Advances in Outcome Monitoring
PPT
6 M&E - Monitoring and Evaluation of Aid Projects
PPTX
Monitoring and evaluation (Part 1)
PPTX
Difference between monitoring and evaluation
PPTX
Population Health Management Presentation
PPTX
Project Monitoring and Evaluation (M and E Plan) Notes
PPTX
Strategic plan
Introduction to Logic Models
Monitoring and Evaluation: Lesson 2
7 principles of data quality management
Data Quality Presentation.ppt
Step 9: Monitoring, Evaluation and Learning
Data for Impact: Lessons Learned in Using the Ripple Effects Mapping Method
Monitoring And Evaluation
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Project M&E (unit 1-4)
Data Demand and Use Workshop
Monotoring and evaluation principles and theories
MEASURE Evaluation Data Quality Assessment Methodology and Tools
Project monitoring and evaluation by Samuel Obino Mokaya
Advances in Outcome Monitoring
6 M&E - Monitoring and Evaluation of Aid Projects
Monitoring and evaluation (Part 1)
Difference between monitoring and evaluation
Population Health Management Presentation
Project Monitoring and Evaluation (M and E Plan) Notes
Strategic plan
Ad

Similar to Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo (20)

PDF
Julian Flowers Erpho
PDF
2010 smg training_cardiff_day1_session3_higgins
PPTX
Imputation techniques for missing data in clinical trials
PPTX
Application of microbiological data
PPTX
Biostatistics Class.pptx
PPTX
Outlier analysis and anomaly detection
PPT
3 Missing data12256429.ppt
PPTX
Analysis Report Presentation 041515 - Team 4
PPT
Unit 3 Total Quality Management _SPC.ppt
PPTX
Practical exercise: results analysis with different statistical robust methods.
PPTX
Biostatistics.pptx
PPT
18- Introduction and levels of measurements.ppt
PDF
data analysis in Statistics-2023 guide 2023
PPT
Descriptive-Stat-Average-Variation-1.ppt
PPTX
Data analysis and interpretation ppt presentation
PPTX
Statistics for the Health Scientist: Basic Statistics II
PPTX
Basics in Biostats,applications,types,about in detile
PPTX
Operational Risk: Solvency II and Exploratory Data Analysis
PDF
Statistical analysis
PDF
LESSON 4_UNGROUPED.pptx.pdf
Julian Flowers Erpho
2010 smg training_cardiff_day1_session3_higgins
Imputation techniques for missing data in clinical trials
Application of microbiological data
Biostatistics Class.pptx
Outlier analysis and anomaly detection
3 Missing data12256429.ppt
Analysis Report Presentation 041515 - Team 4
Unit 3 Total Quality Management _SPC.ppt
Practical exercise: results analysis with different statistical robust methods.
Biostatistics.pptx
18- Introduction and levels of measurements.ppt
data analysis in Statistics-2023 guide 2023
Descriptive-Stat-Average-Variation-1.ppt
Data analysis and interpretation ppt presentation
Statistics for the Health Scientist: Basic Statistics II
Basics in Biostats,applications,types,about in detile
Operational Risk: Solvency II and Exploratory Data Analysis
Statistical analysis
LESSON 4_UNGROUPED.pptx.pdf
Ad

More from removed_62798267384a091db5c693ad7f1cc5ac (20)

PPTX
Tuberculosis/HIV Mobility Study: Objectives and Background
PPTX
How to improve the capabilities of health information systems to address emer...
PPTX
LCI Evaluation Uganda Organizational Network Analysis
PPTX
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
PPTX
Understanding Referral Networks for Adolescent Girls and Young Women
PPTX
Local Capacity Initiative (LCI) Evaluation
PPTX
Development and Validation of a Reproductive Empowerment Scale
PPTX
Sustaining the Impact: MEASURE Evaluation Conversation on Maternal and Child ...
PDF
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
PDF
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
PDF
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
PPTX
Lessons learned in using process tracing for evaluation
PPTX
Use of Qualitative Comparative Analysis in the Assessment of the Actionable D...
PPTX
Sustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
PPTX
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
PPTX
Sexual Orientation and Gender Identity Measures for Global Survey Research
PPTX
What’s Next? Practical Implementation Lessons from the Partnership for HIV-Fr...
PPTX
Measuring Outcomes for Vulnerable Children: A Global Snapshot
PPTX
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
PPTX
Les dialogues communautaires pour diffuser des résultats de recherche Example...
Tuberculosis/HIV Mobility Study: Objectives and Background
How to improve the capabilities of health information systems to address emer...
LCI Evaluation Uganda Organizational Network Analysis
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
Understanding Referral Networks for Adolescent Girls and Young Women
Local Capacity Initiative (LCI) Evaluation
Development and Validation of a Reproductive Empowerment Scale
Sustaining the Impact: MEASURE Evaluation Conversation on Maternal and Child ...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
Lessons learned in using process tracing for evaluation
Use of Qualitative Comparative Analysis in the Assessment of the Actionable D...
Sustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
Sexual Orientation and Gender Identity Measures for Global Survey Research
What’s Next? Practical Implementation Lessons from the Partnership for HIV-Fr...
Measuring Outcomes for Vulnerable Children: A Global Snapshot
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
Les dialogues communautaires pour diffuser des résultats de recherche Example...

Recently uploaded (20)

PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PPTX
ACID BASE management, base deficit correction
PPTX
Uterus anatomy embryology, and clinical aspects
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPT
CHAPTER FIVE. '' Association in epidemiological studies and potential errors
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
PPTX
surgery guide for USMLE step 2-part 1.pptx
PDF
Human Health And Disease hggyutgghg .pdf
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PPTX
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
PPTX
Neuropathic pain.ppt treatment managment
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PDF
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
PPTX
Acid Base Disorders educational power point.pptx
PDF
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPTX
SKIN Anatomy and physiology and associated diseases
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
OPIOID ANALGESICS AND THEIR IMPLICATIONS
ACID BASE management, base deficit correction
Uterus anatomy embryology, and clinical aspects
Medical Evidence in the Criminal Justice Delivery System in.pdf
CHAPTER FIVE. '' Association in epidemiological studies and potential errors
MENTAL HEALTH - NOTES.ppt for nursing students
surgery guide for USMLE step 2-part 1.pptx
Human Health And Disease hggyutgghg .pdf
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
Neuropathic pain.ppt treatment managment
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
Acid Base Disorders educational power point.pptx
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
SKIN Anatomy and physiology and associated diseases
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

  • 1. Managing missing values in routinely reported data: One approach from the DRC Matt Worges Data for Impact Webinar Series December 2, 2020
  • 2. • Framing the Webinar through the D4I lens • DHIS2 data: advantages and issues • Exploring a DHIS2 data set • What to do with blanks? • Interpolation • Recreate the “Truth” • Interpolation diagnostics Overview
  • 3. • The D4I team was tasked with conducting an impact evaluation of the USAID Integrated Health Project (IHP) implemented in 9 provinces of the DRC • IHP goal: Reduce maternal, newborn, and child deaths through delivery of integrated health services • IHP objectives: Increase access to and use of quality health services in the targeted health zones IHP Impact Evaluation
  • 4. • D4I research question: What was the impact of IHP on the utilization of health services (e.g., treatment for childhood illnesses) over the course of the study period? • Measuring impact: D4I is assessing impact through a difference- in-differences (DID) with propensity score matching (PSM) model • Data source: We are using DHIS2 data for this impact evaluation IHP Impact Evaluation – Approach
  • 5. • PSM is widely used to mitigate confounding in observational studies • Complications arise when the covariates used to estimate the propensity scores are only partially observed • Interpolation/imputation approaches provide a potential solution for handling missing data in the estimation of the propensity scores • Recommended to derive the propensity score after applying interpolation or imputation IHP Impact Evaluation – Propensity Score Matching
  • 6. • Addition/removal of health facilities at different time points • Long runs of missing values • Zero counts are typically not entered – they are left blank • Cannot distinguish between truly missing and zero • Data entry errors manifesting as outliers/anomalous points • Reporting has improved over time making older time points less complete Some DHIS2 Issues
  • 7. • Missing data can result in: • Reduced statistical power • Biased estimators • Reduced representativeness of the sample • Generally incorrect inference and conclusions Why do we care about missingness? Overview of Approaches for Missing Data – Susan Buchman
  • 8. • Time Series Characteristics • Restricted to Haut-Katanga Province, DRC • Uncomplicated + severe malaria cases (all ages) • 24-month period from October 2018 to September 2020 • Health facility count = 1,362 • The monthly-aggregated time series appears to include both a seasonal and positive trend component Data Set
  • 9. Unprocessed Data – Missingness Visualized HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20 hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219 hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582 hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222 hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639 hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203 hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240 hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313 hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283 hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257 hk Kasomeno Centre de Santé de Référence 282 307 306 265 hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393 hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
  • 10. Unprocessed Data – Missingness Visualized Missing (28.6%) Present (71.4%)‘visdat’ package Malaria Cases – Haut-Katanga Province
  • 11. Unprocessed Data – Histogram of Missingness No missing values (complete case analysis) Completely blank records (remove from data set) One missing value Two missing values ‘ggplot2’ package 284 193 137 27
  • 12. Unprocessed Data – Outliers? What are these doing here? Are they malaria outbreaks? Are they data entry errors?
  • 13. Unprocessed Data – Outliers. ‘anomalize’ package Something looks off here This point didn’t show up as anomalous
  • 14. • One method to remove outliers is to delete those values that are ± X standard deviations from the median • The median is insensitive to extreme values in your time series • Experiment with different thresholds (i.e., ± 4 SDs from the median or ± 6 SDs from the median) to examine what happens to your data Removing Egregious Outliers – One Approach
  • 15. Malaria cases Median Standard deviation + 4.5 SDs from the median This value would be removed from the data set
  • 16. Anomalous Data Points ‘anomalize’ package This is what I’m targeting for removal Less concerned with these
  • 17. Removing Egregious Outliers - Effects Average Malaria Cases – Haut-Katanga Province +4.5 SDs from the median Removed 8 values or 0.025% Unprocessed data set
  • 18. Are missing values actually zeros in the DRC DHIS2?
  • 19. Link between Missingness & Median Case Counts 1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150 Median Health Facility Malaria Cases (binned) Generalization: the lower the median case counts the higher the number of average missing values
  • 20. • Assume no item nonresponse? • Examine this notion with two extreme examples • One HF time series with large monthly values and 1 missing • One HF time series with low monthly values and 1 missing • Replace missing with zero and run anomaly detection Assumption: Missing Values are Zeros
  • 21. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  • 23. • A univariate time series is a sequence of single observations at regular and successive points in time • Possible to decompose the time series into its trend, seasonal, and irregular components • We can use these time series characteristics in the interpolation process Univariate Time Series
  • 24. dataseasonaltrendremainder 2017 2018 2019 2020 Loess Seasonal Decomposition of Average Malaria Cases ‘stats’ package
  • 27. • Values in a series do not have violent, unexplained fluctuations • The rate of change (increases/decreases) between points occurs at a uniform rate Assumptions of Interpolation
  • 28. • Easy to code (one line in R for long form data frame) • df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2) • Intuitive understanding of linearly interpolating across very short gaps of missing values • Probably a good approach for high case load facilities • May not grossly deviate from the ‘truth’ when applied to low case load facilities A Role for Linear Interpolation? ‘imputeTS’ package
  • 29. Linear Interpolation ---- ---- ---- Joining known values with linear segments
  • 30. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  • 32. Seasonality in Interpolation Un-imputed data Linearly interpolated data w/o seasonality Linearly interpolated data w/ seasonality
  • 33. • Take seasonality into account • na.interp from the ‘forecast’ package in R • By default, uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is first computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal component is added back. • na.StructTS from the ‘zoo’ package in R • Interpolate with seasonal Kalman filter • These two functions use similar mechanisms to interpolate missing data in that they both can ‘handle’ seasonality in the time series Univariate Time Series Interpolation
  • 35. Let’s reset and apply some of these steps
  • 36. Missingness Visualized – Unprocessed Data Missing (28.6%) Present (71.4%)‘visdat’ package 284 HFs with no missing data
  • 37. Missingness Visualized – Removed New/Defunct HFs Missing (13.8%) Present (86.2%)‘visdat’ package
  • 38. Missingness Visualized – Linear Interpolation (gaps ≤ 2) Missing (6.7%) Present (93.3%)‘visdat’ package 807 HFs with no missing data
  • 39. Time Series Trends New/defunct HFs and outliers have been removed from all time series
  • 41. • Use a data set containing only complete time series records • 2.5% of data are zero values (primarily limited to smaller facilities) • Introduce random missingness • Randomly delete15% of data points • Delete 90% of remaining zero values • Include runs of more than 2 missing values • Apply various imputation methods and compare against the “truth” • Replace all blanks with zeros • Linear interpolation on gaps ≤ 2 • Use the two identified interpolation strategies that consider seasonality A Quick Example
  • 43. Time Series Trends Anomalous data points have been removed
  • 45. na.StructTS Average raw bias = -1.18 na.interp Average raw bias = -0.03
  • 47. The RMSE difference is positive for 1,847 HFs indicating that the ‘na.StructTS’ approach had a lower RMSE for 68% of HFs ‘na.StructTS’ approach has lower RMSE ‘na.interp’ approach has lower RMSE
  • 48. • Assess missingness • Address egregious outliers • Manage new/defunct facility records • Decompose the time series • Try a few different interpolation techniques and plot results • Isolate a subset of records with no missing data • Introduce missing data and then recreate the “truth” Recap
  • 50. This presentation was produced with the support of the United States Agency for International Development (USAID) under the terms of the Data for Impact (D4I) associate award 7200AA18LA00008, which is implemented by the Carolina Population Center at the University of North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.; John Snow, Inc.; and Tulane University. The views expressed in this publication do not necessarily reflect the views of USAID or the United States government. www.data4impactproject.org
  • 51. • DHIS 2 time series do not always lend themselves well to multiple imputation • Multiple imputation is a preferable choice when there are variables predictive of missingness that could be included in the imputation model • With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in the imputation process • DHIS 2 time series may exhibit MNAR missingness structure • Earlier time points have more missing data • Zero values are more likely to be missing Imputation
  • 52. • Advantages of using DHIS2 data • Access to a wide breadth of data elements/services • Analyze at various levels of the health system • National, regional, district, health facility • Data are generally collected via standardized reporting tools • Data tend to be reported at regular intervals allowing for frequent updates to analyses • However, not all data elements are well-reported, and it is typically necessary to process/clean DHIS2 data Why Use DHIS2 Data?