SlideShare a Scribd company logo
Working With Large-Scale 
Clinical Datasets 
Craig Smail, MA, MSc ( @craigsmail) 
KU Medical Center 
9th October 2014 
Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
Disclosures 
• Industry grant funding: 
– Merck 
– Mallinckrodt 
– Sanofi
Overview 
• Targeted audience: anyone involved (directly 
or indirectly) in clinical data extraction, 
validation, and standardization 
• Sections: 
1. Data extraction: planning 
2. Data extraction 
3. Data standardization 
4. Data transfer
Data Extraction: Planning 
• Dataset type 
– Most common: limited and de-identified 
– Difference: limited can contain some personal 
information (DOB, DOD, city, state, age) 
• Legal agreements 
– Data Use Agreement (DUA) 
– Business Associates Agreement (BAA) 
– Institutional Review Board (IRB) 
• Usually only if IRB considers activity Human Subjects 
Research
Data Extraction: Planning 
• Important to finalize list of data elements 
before pull 
– Time-consuming to repull 
– Reallocation of resources (e.g. programmer time) 
• Summary statistics are helpful in planning 
stage 
• e.g. death status requested a lot, but is very rarely 
available in the EHR
Data Extraction: Planning 
• Use of data proxy correlated with data 
element of interest 
– sometimes need to develop proxies for data 
points of interest (e.g. severity of pain; 
hypoglycemic events) 
– Example use case: aspirin as a proxy for 
antiphospholipid antibodies lab1 
• Proxy data elements should be supported by 
data 
1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med 
2011; 365:1758-1759N
Example: Proxy for Death Status 
• Data extracted from large multi-specialty 
clinic on the east coast 
• 300,000 patients in EHR 
• ~10,000 with date-of-death (we’ll take this as 
gold-standard) 
• Is days since last encounter a good proxy?
Example: Proxy for Death Status 
library(glm2) 
# import data 
setwd([dir here]) 
Encs = read.csv("lastenc.csv", header= FALSE) 
# find days since last encounter 
for (i in 1:nrow(Encs)) { 
Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y") 
} 
# binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750) 
for (i in 1:nrow(Encs)) { 
Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0) 
} 
# clean up table 
Encs = Encs[ , c(2, 4)] 
# fit model (logistic regression – but could use something else) 
fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial") 
confusionMatrix = table(round(fit$fitted.values), Encs[,1]) 
misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) # 
0.34
Example: Proxy for Death Status 
• Is days since last encounter a good proxy? 
No (error rate = 34%) 
• Consequences:
Data Extraction: Planning 
• Cohort definition 
– Spell out cohort definitions explicitly, including all 
assumptions 
– Real-world example: 
• ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90 
days apart’ 
• Further restriction specified ‘if any value > 60 in between 90 
days, then throw out’ 
• Word ‘consecutive’ means no values in between 90 days will 
be considered at all 
– If any another eGFR value occurs between 90 days, then the 
patient does not meet the first restriction
Data Extraction: Planning 
• Final thought on planning: 
“Not everything that counts can be counted and 
not everything that can be counted counts.” 
—Albert Einstein (or William Bruce Cameron, 
depends who you believe) 
• some data elements are well populated, but 
reflect things like coding bias (e.g. ‘up-coding’ 
to a code with larger reimbursement)
Data Extraction 
• What are data extractions being used for in the 
NRN? 
– Pharmaceutical companies: data on 143,057 
patients from 8 health-care organizations/health 
care systems 
– Federally-funded research (NIH, AHRQ): data on 
~100,000 patients 
– Health IT vendors: work with Cerner to produce 
performance reports for use by participating 
providers 
• Clinicians like performance feedback, if your EHR cannot 
provide it they will go elsewhere (i.e. switch to another 
vendor)
Data Extraction 
• Longitudinal data important 
– look at temporal trends over time in same 
patient 
– during EHR transitions, some EHR vendors will 
import all data, but restrict full access to only 
last 18/24/26 months – clinicians don’t like this, 
they want to be able to access all data
Data Validation 
• Date parameters (e.g. look at min and max dates of encounter 
in dataset, when 1000s of patients of dataset, would expect to 
see dates match with range) 
– Percentage of distinct patients in extraction vs. overall practice count: 
cohort percentages are quite stable across practices 
» e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes 
defined by ICD-9 code xx.xxx 
– Caveat: doesn’t work well with small practices (< 2,000 distinct 
patients)
Data Standardization 
• Open-source models (Observational Medical Outcomes 
Partnership) 
• Script data out of database (e.g. SQL view) 
• Map labs/procedures to standardized concept list 
– Why? different string labels referring to creatinine blood test from 
three data feeds, with frequency of occurrence…
Note: source values with counts < 100 were censored
Data Transfer 
• HIPPA requirements 
• Usually FTP to secure site (e.g Egnyte) 
Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/
Concluding Thoughts 
• Extracted data is treated as gold-standard, since it is pulled 
directly from data source (i.e. EHR), but data often comes 
from intermediate product (such as a registry product, like the 
product DARTNet provides); but usually don’t have control 
over data mapping from EHR to registry 
• The EHR of the future (?): 
– Genetic data (WGS or WES) 
» WGS = ~100 GB 
» WES = ~8 GB 
– Integration with consumer wearable devices (e.g. FitBit; iPhone ECG) 
– Further down the road: human microbiome; home microbiome
Always question 
the data 
Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg
Questions? 
• Slides available from slideshare 
(URL @craigsmail) 
• Email: csmail@aafp.org

More Related Content

PDF
Introduction to High-performance In-memory Genome Project at HPI
PPTX
Introduction To Medical Data
PDF
Heart Diseases Diagnosis Using Data Mining Techniques
PPTX
HEALTH PREDICTION ANALYSIS USING DATA MINING
PDF
PREDICTION and RATE analysis: Health Insurance
PPTX
Data Mining A Healthcare Database
PPTX
MongoDB Days UK: Jumpstart: Schema Design
PPTX
Data Mining & Applications
Introduction to High-performance In-memory Genome Project at HPI
Introduction To Medical Data
Heart Diseases Diagnosis Using Data Mining Techniques
HEALTH PREDICTION ANALYSIS USING DATA MINING
PREDICTION and RATE analysis: Health Insurance
Data Mining A Healthcare Database
MongoDB Days UK: Jumpstart: Schema Design
Data Mining & Applications

What's hot (18)

PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
PDF
Comparative study of frequent item set in data mining
PPTX
Webinar: Schema Design and Performance Implications
PDF
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
PPT
Data mining
PPT
Datamining
PPTX
ICBAI Presentation (2)
PDF
Data mining
PDF
AI for Precision Medicine (Pragmatic preclinical data science)
ODP
Data mining
PPTX
Data preparation and processing chapter 2
DOCX
Data miningvs datawarehouse
PPTX
1 Introduction to-data-mining lecture
PDF
ICBAI Paper (1)
PDF
IRJET- Disease Prediction and Doctor Recommendation System
PPTX
4 Data preparation and processing
PDF
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
DOCX
MC0088 Internal Assignment (SMU)
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Comparative study of frequent item set in data mining
Webinar: Schema Design and Performance Implications
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
Data mining
Datamining
ICBAI Presentation (2)
Data mining
AI for Precision Medicine (Pragmatic preclinical data science)
Data mining
Data preparation and processing chapter 2
Data miningvs datawarehouse
1 Introduction to-data-mining lecture
ICBAI Paper (1)
IRJET- Disease Prediction and Doctor Recommendation System
4 Data preparation and processing
Comprehensive Medical Malpractice Litigation Solution - LexisNexis MedMal Nav...
MC0088 Internal Assignment (SMU)
Ad

Similar to Working With Large-Scale Clinical Datasets (20)

PPTX
Big Data in Clinical Research
PPTX
The Many Lives of Data
PPTX
Ab103112 ch04
PDF
2 7 open_ehr rm reference model overview
PDF
Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22
PDF
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
PPTX
Dart net intro 2015 tw
PPTX
McGrath Health Data Analyst SXSW
PPT
Big Data in Biomedicine: Where is the NIH Headed
PPTX
The Role of Data Lakes in Healthcare
PDF
Day 1: Real-World Data Panel
PPTX
D1 Clinical Process 2223 (1).pptx
PPTX
Day 1 (Lecture 3): Predictive Analytics in Healthcare
PPTX
Clinical Data Collection: The Good, the Bad, the Beautiful
PDF
s12911-022-01756-2.pdf
PPTX
Electronic Health Records docx presentation.pptx
PPTX
Health research, clinical registries, electronic health records – how do they...
PPTX
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
PPTX
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
PPTX
Health care Analytics-Module 3-CADX150..
Big Data in Clinical Research
The Many Lives of Data
Ab103112 ch04
2 7 open_ehr rm reference model overview
Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Dart net intro 2015 tw
McGrath Health Data Analyst SXSW
Big Data in Biomedicine: Where is the NIH Headed
The Role of Data Lakes in Healthcare
Day 1: Real-World Data Panel
D1 Clinical Process 2223 (1).pptx
Day 1 (Lecture 3): Predictive Analytics in Healthcare
Clinical Data Collection: The Good, the Bad, the Beautiful
s12911-022-01756-2.pdf
Electronic Health Records docx presentation.pptx
Health research, clinical registries, electronic health records – how do they...
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Health care Analytics-Module 3-CADX150..
Ad

Recently uploaded (20)

PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Microbiology with diagram medical studies .pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Sciences of Europe No 170 (2025)
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPT
protein biochemistry.ppt for university classes
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPT
Chemical bonding and molecular structure
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
neck nodes and dissection types and lymph nodes levels
The KM-GBF monitoring framework – status & key messages.pptx
Biophysics 2.pdffffffffffffffffffffffffff
Microbiology with diagram medical studies .pptx
ECG_Course_Presentation د.محمد صقران ppt
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Sciences of Europe No 170 (2025)
HPLC-PPT.docx high performance liquid chromatography
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
INTRODUCTION TO EVS | Concept of sustainability
. Radiology Case Scenariosssssssssssssss
Phytochemical Investigation of Miliusa longipes.pdf
protein biochemistry.ppt for university classes
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Chemical bonding and molecular structure
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
neck nodes and dissection types and lymph nodes levels

Working With Large-Scale Clinical Datasets

  • 1. Working With Large-Scale Clinical Datasets Craig Smail, MA, MSc ( @craigsmail) KU Medical Center 9th October 2014 Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
  • 2. Disclosures • Industry grant funding: – Merck – Mallinckrodt – Sanofi
  • 3. Overview • Targeted audience: anyone involved (directly or indirectly) in clinical data extraction, validation, and standardization • Sections: 1. Data extraction: planning 2. Data extraction 3. Data standardization 4. Data transfer
  • 4. Data Extraction: Planning • Dataset type – Most common: limited and de-identified – Difference: limited can contain some personal information (DOB, DOD, city, state, age) • Legal agreements – Data Use Agreement (DUA) – Business Associates Agreement (BAA) – Institutional Review Board (IRB) • Usually only if IRB considers activity Human Subjects Research
  • 5. Data Extraction: Planning • Important to finalize list of data elements before pull – Time-consuming to repull – Reallocation of resources (e.g. programmer time) • Summary statistics are helpful in planning stage • e.g. death status requested a lot, but is very rarely available in the EHR
  • 6. Data Extraction: Planning • Use of data proxy correlated with data element of interest – sometimes need to develop proxies for data points of interest (e.g. severity of pain; hypoglycemic events) – Example use case: aspirin as a proxy for antiphospholipid antibodies lab1 • Proxy data elements should be supported by data 1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med 2011; 365:1758-1759N
  • 7. Example: Proxy for Death Status • Data extracted from large multi-specialty clinic on the east coast • 300,000 patients in EHR • ~10,000 with date-of-death (we’ll take this as gold-standard) • Is days since last encounter a good proxy?
  • 8. Example: Proxy for Death Status library(glm2) # import data setwd([dir here]) Encs = read.csv("lastenc.csv", header= FALSE) # find days since last encounter for (i in 1:nrow(Encs)) { Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y") } # binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750) for (i in 1:nrow(Encs)) { Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0) } # clean up table Encs = Encs[ , c(2, 4)] # fit model (logistic regression – but could use something else) fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial") confusionMatrix = table(round(fit$fitted.values), Encs[,1]) misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) # 0.34
  • 9. Example: Proxy for Death Status • Is days since last encounter a good proxy? No (error rate = 34%) • Consequences:
  • 10. Data Extraction: Planning • Cohort definition – Spell out cohort definitions explicitly, including all assumptions – Real-world example: • ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90 days apart’ • Further restriction specified ‘if any value > 60 in between 90 days, then throw out’ • Word ‘consecutive’ means no values in between 90 days will be considered at all – If any another eGFR value occurs between 90 days, then the patient does not meet the first restriction
  • 11. Data Extraction: Planning • Final thought on planning: “Not everything that counts can be counted and not everything that can be counted counts.” —Albert Einstein (or William Bruce Cameron, depends who you believe) • some data elements are well populated, but reflect things like coding bias (e.g. ‘up-coding’ to a code with larger reimbursement)
  • 12. Data Extraction • What are data extractions being used for in the NRN? – Pharmaceutical companies: data on 143,057 patients from 8 health-care organizations/health care systems – Federally-funded research (NIH, AHRQ): data on ~100,000 patients – Health IT vendors: work with Cerner to produce performance reports for use by participating providers • Clinicians like performance feedback, if your EHR cannot provide it they will go elsewhere (i.e. switch to another vendor)
  • 13. Data Extraction • Longitudinal data important – look at temporal trends over time in same patient – during EHR transitions, some EHR vendors will import all data, but restrict full access to only last 18/24/26 months – clinicians don’t like this, they want to be able to access all data
  • 14. Data Validation • Date parameters (e.g. look at min and max dates of encounter in dataset, when 1000s of patients of dataset, would expect to see dates match with range) – Percentage of distinct patients in extraction vs. overall practice count: cohort percentages are quite stable across practices » e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes defined by ICD-9 code xx.xxx – Caveat: doesn’t work well with small practices (< 2,000 distinct patients)
  • 15. Data Standardization • Open-source models (Observational Medical Outcomes Partnership) • Script data out of database (e.g. SQL view) • Map labs/procedures to standardized concept list – Why? different string labels referring to creatinine blood test from three data feeds, with frequency of occurrence…
  • 16. Note: source values with counts < 100 were censored
  • 17. Data Transfer • HIPPA requirements • Usually FTP to secure site (e.g Egnyte) Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/
  • 18. Concluding Thoughts • Extracted data is treated as gold-standard, since it is pulled directly from data source (i.e. EHR), but data often comes from intermediate product (such as a registry product, like the product DARTNet provides); but usually don’t have control over data mapping from EHR to registry • The EHR of the future (?): – Genetic data (WGS or WES) » WGS = ~100 GB » WES = ~8 GB – Integration with consumer wearable devices (e.g. FitBit; iPhone ECG) – Further down the road: human microbiome; home microbiome
  • 19. Always question the data Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg
  • 20. Questions? • Slides available from slideshare (URL @craigsmail) • Email: csmail@aafp.org

Editor's Notes

  • #6: Repulls wastes everyone’s time
  • #7: used aspirin as a proxy for antiphospholipid antibodies lab (due to practice of prescribing aspirin in these patients at site) in treating a 13 year-old girl with systemic lupus erythematosus (SLE)
  • #8: Audience participation: ask what other factors might explain a gap in encounters (e.g. moved out-of-town, changed provider)
  • #9: Only 13 lines of code Binarized time since last encounter (tried 180, 365, 750)
  • #10: CKD study: ~100,000 in dataset, say same ratio holds (3% of individuals in EHR are dead), gives 3,000 names for NDI Cost: $350 + ($0.15 * 3,000 * 10) = $4,500 So you want to make sure the cohort you send to NCI is right!
  • #13: What are data extractions being used for in the NRN? Pharmaceutical companies: type-2 diabetes study looking at drug prescribing habits of primary-care physicians for patients with type-2 diabetes (data on 143,057 patients from 8 health-care organizations/health care systems) Federally-funded research (NIH, AHRQ ): decision support for chronic kidney disease, working with National Kidney Foundation (data on ~100,000 patients) Health IT vendors: we work with Cerner to product performance reports for use by participating providers, used to compare performance on several metrics (e.g. blood pressure targets; accuracy of ICD-9 coding) Clinicians like performance feedback, if your EHR cannot provide it they will go elsewhere (i.e. switch)
  • #20: Overall take-away