SlideShare a Scribd company logo
Nationally Representative and
Other Large Datasets
Jessica Moore
Analysis and Data Management Activity (ADMA)
03/22/2012
Background
• One of our goals is to estimate the burden of
viral diseases among the US population by
describing their association with
– Hospitalizations
– ED visits
– Ambulatory/Outpatient visits
– Deaths
• Consequently, we need datasets for deriving
robust, nationally representative estimates
Background
Focus will be on datasets produced by two organizations
within the Department of Health and Human Services
• National Center for Health Statistics (NCHS)
– Mission is to provide statistical information that
will guide actions and policies to improve the
health of the American people.
• Agency for Healthcare Research and Quality
(AHRQ) “ark”
– Mission is to improve the quality, safety,
efficiency, and effectiveness of health care for all
Americans.
NCHS - Surveys
NCHS – Surveys
National Hospital Discharge Survey (NHDS)
http://www.cdc.gov/nchs/nhds/nhds_questionnaires.htm
• Designed to meet the need for nationally
representative information on inpatients discharged
from nonfederal , short –stay hospitals in the United
States
– Federal, military, and VA Hospitals, as well as hospital units
of institutions (such as prison hospitals), and hospitals with
fewer than six beds staffed for patient use, are excluded.
• Due to funding limitations, sample of hospitals reduced
by half starting in 2008.
• In 2009, approximately 160,000 inpatient records were
obtained from 205 responding hospitals.
NCHS – Surveys
National Hospital Discharge Survey (NHDS)
• Variables include discharge diagnoses, procedures,
length of stay, payer, outcome, gender, race and
ethnicity (race missing on 19% of records)
• A discharge weight (not the patient’s weight!) is
provided for each record and the sum of the weights
results in national estimates of the # of hospitalization
associated with a particular condition
• Public use data are available for download via ftp
(1996-2009)
– Epi share drive (cdcprojectNCIRD_DVD_EB_DATA_1nchspublic) has
SAS datasets from 1996-2009
– Data collection for NHDS began in 1965 (sampling design changed in
1988). CDs available from NCHS for data 1970-1978 and 1979-2007
NCHS –Surveys
National Hospital Discharge Survey (NHDS)
• To conduct statistical tests, one must use the
survey design variables (weight) to generate
standard errors (SEs) for estimates
– Calculating SEs requires SUDAAN
– For access to the survey design variables, one has
to prepare a project proposal for NCHS review
and, w/approval, utilize the Research Data Center
– Summary output only; no record-level data
NCHS – Surveys
National Ambulatory Medical Care Survey (NAMCS) & National
Hospital Ambulatory Medical Care Survey (NHAMCS)
http://www.cdc.gov/nchs/ahcd.htm
• Together – AKA “NAMCeeS/HAMCeeS”
• The NAMCS is based on a sample of visits to non-
federally employed office-based physicians who are
primarily engaged in direct patient care. Physicians in
the specialties of anesthesiology, pathology, and
radiology are excluded
– National estimates of doctor’s office visits
• The NHAMCS is based on a national sample of visits
to the emergency departments and outpatient
departments of non-institutional general and short-
stay hospitals.
– National estimates of hospital outpatient department and
ED visits
NCHS – Surveys
NAMCS & NHAMCS
• Demographic information, reason for visit, procedure
codes, prescription information, pre-existing chronic
conditions, payer, length of visit.
• By using the NAMCS and NHAMCS one can generate
national estimates of outpatient and ED visits.
• Public use files available for download
– The public use files for 1993-2009 contain sample design
variables in masked form so you don’t need any special
files/agreements
• Still need SUDAAN for calculating SEs.
• A combined SAS dataset from 1995-2009 is available in the
Epi share drive for all 3 datasets
cdcprojectNCIRD_DVD_EB_DATA_1nchspublic
• Additional data, documentation and import code found on the NCHS site.
NCHS – Surveys
National Immunization Sample (NIS)
http://www.cdc.gov/nchs/nis.htm
• Conducted jointly by NCIRD and NCHS
• List-assisted random-digit-dialing telephone survey,
with mailed survey to children’s immunization
providers
• Target population – children 19-35 months
• Rates of being up-to-date with respect to ACIP
recommended number of doses
• DTaP/DTP/DT, polio, MMR, Hib, HepB, varicella,
pneumococcal, HepA, influenza, rotavirus
• SAS datasets in EB share drive available from 2010
• 2010 dataset has records for approximately 24,000
children, with 17,000 having adequate provider data
NCHS – Surveys
National Immunization Sample – Teen (NIS-Teen)
• Same sampling method of NIS
• Public datasets and SAS datasets in EB share drive available
from 2008-2010
• Assesses vaccination coverage of teens less than 17 years
old
• 2010 dataset includes records for approximately 33,000
teens, with 20,000 having adequate provider data
• Tables of vaccination coverage for NIS and NIS-Teen
available at http://www.cdc.gov/vaccines/stats-surv/imz-
coverage.htm#nisteen
• Public use file does not include any provider-level weights.
Providers are included in the survey through the children
they vaccinate. Do not attempt provider level analysis!
NCHS- Surveys
General comments from past and present experiences
• NHDS is known as the ‘gold standard’ for national hospitalization
estimates. Public use files are likely sufficient for some activities
negating the need for design variables.
– Hospitalization estimates of less common conditions can be
problematic (i.e. Relative SEs>30% or < 30 unweighted records are
considered unreliable. Estimates based on 30-59 records are termed
‘low reliability’).
– Stringent control of design variables requires a RDC proposal
• NAMCS and NHAMCS has same limitations for less common
conditions
– Bonus is that design variables are included in public use files
• NIS/NIS-Teen – do not attempt to calculate provider-level statistics!
• For all surveys, the size of the annual files are relatively small
compared to other datasets
– Requires less disk space and processing time
NCHS- Vital Statistics
NCHS – Vital Statistics
Mortality Multiple Cause of Death
• Annual files captures approximately 100% of US resident
deaths (i.e. it is not a sample so no need to worry about
record weights).
– Underlying cause of death provided
– Up to 20 ICD-9 (ICD-10 beginning in 1999) recorded on death certificate
– Includes demographic information for the deceased
• Public use files available for download
– Geographical information suppressed for counties with pops<100,00 for 1998
through 2004
– ALL geographical information suppressed beginning in 2005
• DVD has agreement with NCHS to obtain county-level
geographic information.
NCHS – Vital Statistics
Linked Birth and Infant Death Datasets
• Information from the death certificate is linked to the information from
the birth certificate for each infant under 1 year of age who dies in the
United States, Puerto Rico, The Virgin Islands, and Guam
• Variables include age, race, and Hispanic origin of the parents, birth
weight, period of gestation, plurality, prenatal care usage, maternal
education, live birth order, marital status, and maternal smoking, linked to
information from the death certificate such as age at death and underlying
and multiple cause of death.
• Useful for identifying risk factors for infant death.
– One can compare characteristics of infant deaths from one cause to infant deaths due to
another cause(s)
– Or compare the characteristics of infant deaths to a sample of surviving infants.
NCHS – Vital Statistics
Linked Birth and Infant Death Datasets
• Two designs for the linked data - period data and birth cohort data.
– The numerator for the period linked file consists of all infant deaths occurring
in a given data year linked to their corresponding birth certificates, whether
the birth occurred in that year or the previous year.
– The numerator for the birth cohort linked file consists of the deaths (current
or subsequent year) linked to infants born in a given year.
– In both designs, the denominator is all births occurring in the year.
– Due to the different designs, the period linked files are available sooner than
the cohort linked files.
• Public use files available for download
– Geographical information suppressed for counties with populations<250,000
for 1998 through 2004
– ALL geographical information suppressed beginning in 2005
• DVD has agreement with NCHS to obtain county-level geographical
information.
NCHS – Vital Statistics
Birth Datasets
http://www.cdc.gov/nchs/births.htm
• Captures all registered live births occurring in the US. Birth certificate
information included (like in the linked denominator file).
• Public use files available for download
– Geographical information is suppressed for counties with
populations<100,000 for 1998 through 2004
– ALL geographical information suppressed beginning in 2005
• DVD has agreement with NCHS to obtain county-level geographical
information.
• SAS datasets from 1969-2009 available in EB share drive
• 2 SAS summary datasets from 1990-2009
– Summary births by year, region, state, race, ethnicity and sex
– Summary births by year, region, race, ethnicity and sex
NCHS – Vital Statistics
General comments
• Need to closely follow the documentation as variables were
added/removed/modified over time
– Particularly important for the birth and linked data due to states
adopting the 2003 revision of the birth certificate.
• For all mortality data, need to pay attention to the conversion
from ICD-9 to ICD-10 beginning in 1999.
– Comparability studies found differences in how conditions were coded
between the two coding versions
• Birth data and linked denominator data take a bit of
processing time since each calendar year has approximately 4
million live births.
• Can do basic queries online using wonder.cdc.gov or by using
VitalStats (http://www.cdc.gov/nchs/VitalStats.htm)
AHRQ
AHRQ
Healthcare Cost and Utilization Project “H-CUP”
• HCUP obtains data on inpatient and ED visits
from participating states
– Standardizes the format of data across states
– Creates samples from the states’ 100%
hospitalization and ED visit data.
• Extensive documentation on HCUP and
associated datasets available online
(http://www.ahrq.gov/data/hcup/)
AHRQ
HCUP Datasets
Samples
• Nationwide Inpatient Sample (NIS)
• Kids’ Inpatient Database (KID)
• Nationwide Emergency Department Sample (NEDS)
100% data from source
• State Inpatient Databases (SID)
• State Emergency Department Databases (SEDD)
AHRQ
HCUP - NIS
• The NIS is a very large sample of hospital discharges from
HCUP participating states (44 during 2009)
– Began in 1988 with 8 participating states. We have SAS datasets
available from 1988 to 2009.
– Database design has evolved over time but has remained
relatively constant since 1998.
– The NIS is approximately a 20% sample of US hospitalizations
occurring in community hospitals
– Similar variables to the NHDS
– Provides hospital charges and cost-to-charge ratio files so one
can estimate direct charges/costs
– Participating states account for approximately 95% of all
hospital discharges in the U.S.
AHRQ
HCUP - NIS
• Since the NIS is such a large sample, analyses of less common
conditions can be conducted.
– Also…since the NIS is such a large sample, processing time to generate
subsets can be burdensome (creating subsets from 5-8 million records
each year).
• Like the NHDS
– One must use the record weight to generate nationally representative
estimates
– For statistical tests, one must generate SEs for the estimates
• Can use SUDAAN, SAS Survey procs, or Stata
• A bonus…design variables are provided with the datasets
AHRQ
HCUP - KID
• The KID follows the same format as the NIS but focuses
exclusively on pediatric discharges (patients aged ≤ 20 years).
– Very powerful for looking at less common conditions occurring among
children that lead to hospitalization.
• Produced every 3 years (1997,2000, 2003,2006,2009)
• Same weighting and SE calculation procedures as with the NIS
• Cost-to-charge files available for 2003,2006,2009
AHRQ
HCUP - NEDS
• The NEDS yields national estimates of emergency
department (ED) visits.
• The NEDS is a brand new dataset with the 1st release
being for calendar year 2006
– ~26 million and ~27 million unweighted records for 2006 &
2007, respectively.
• Similar variables to the NIS and also includes ED charges
• SAS datasets available in EB share drive for 2006-2009
• Race is suppressed
AHRQ
HCUP – State Datasets (100% Data)
• Statewide Inpatient Databases 1988-2010* (SID)
– 100% hospitalization data from participating states in HCUP. Currently, we do not have
direct access to these data but have current/past collaborations with HCUP investigators
to conduct analyses using these data.
– The gold standard for assessing rare conditions leading to hospitalization. Can calculate
state specific and indirectly estimate nationally representative rates.
– Increasing number of states available over time.
– These are very large datasets
• State Emergency Department Databases 1999-2010* (SEDD)
– Currently, we do not have direct access to these data but have current/past
collaborations with HCUP investigators to conduct analyses using these data.
– Captures the discharge information on all emergency department visits that do not
result in a hospital admission
– These are very, very, very large datasets
*2010 already available for some states.
AHRQ
HCUP – General Comments
• Extensive online documentation available for the datasets.
http://hcup-us.ahrq.gov/
• Access to sample data requires a brief online course and a signed
DUA.
– It’s easy.
• Processing time can be a bit daunting.
– Workarounds include remote submitting to branch server, CSP (pending),
or using Citgo SAS (recommended)
• Online tool (HCUPnet) is a good start for assessing the numbers
that might be available for a study of interest.
– http://hcupnet.ahrq.gov/
• Specifically for the NIS, changes in dataset format over time can
make assessing trends challenging. HCUP has released trend files
to facilitate these types of analyses.
• AHRQ datasets include 15 possible diagnosis codes per record
MEDSTAT
• MarketScan data warehouse
• Contains individual-level healthcare claims, health risk assessments,
absence, short-term disability, workers’ compensation, and hospital
discharge info from large employers, managed care organizations,
hospitals, Medicare and Medicaid programs
• We have access to data from 1993 through 2010. Pre-2004 is in
Citgo. Some of post-2004 data is located in CDC network land and
some is located in CSP (consolidated statistical platform).
• Very large datasets with extensive admission and discharge data,
including cost and charge information, i.e. deductibles, coinsurance,
other payments, etc.
• Once you pull data from various sources and locations it’s a
powerful dataset for estimating burden of various disease.
What about rates?
Numerators are nice but what about denominators?
• Bridged-race census data available for 1990-2009
– cdcprojectNCIRD_DVD_EB_DATA_1census
– Goal of bridged race data is to make population estimates by
race/ethnicity comparable for 1990-1999 to >2000.
• Longer-term project is to also create intercensal population
estimates datasets.
• Remember that for the sample datasets, you have to factor in
the SEs (i.e. compute relative standard errors) for your rates to
do comparisons or to calculate 95% confidence intervals
Data Processing
• Many of the national datasets are very large, i.e. NEDS approx. 27
million unweighted records per year!
• While these are powerful datasets for estimating burden of disease,
especially for rare conditions, the size of the datasets make them
cumbersome to analyze.
• We suggest, if working with these data in SAS, to pull the data from
the share drive and run on SAS in Citgo. We tested various methods
and this had the shortest run time, as well as least cpu usage.
• Since the national datasets we use are complex survey designs, you
need to analyze them in SUDAAN to calculate weighted estimates.
SAS-callable SUDAAN is available in Citgo.
• Other suggestions for larger datasets:
– SGIO=yes
– Compress=binary
Limitations
• Hospitalizations, ED, outpatient visits
– Have to rely on ICD-9-CM codes to identify cause
– Primary versus secondary diagnosis versus any-listed
• Was the condition of interest the ‘cause’ of the healthcare visit
• Mortality
– Rely on ICD-9/ICD-10 codes
• Conversion to ICD-10 does result in changes in national estimates for a lot
of conditions
• Can be difficult/impossible to ascertain comorbid conditions
• Survey datasets are encounter based rather than case based
– Multiple visits by same individual can not be determined
Limitations
• Public-use data good for general picture
– Suppressed information (especially since 2005) is problematic.
• Access to suppressed elements or design information requires
annual agreements with NCHS.
– Requires RDC proposal for access to design information.
• Working with the data can be challenging, especially the birth,
linked, and HCUP sample datasets
• Caveats on producing SEs for the sample datasets
– Requires knowledge of SUDAAN, SAS Survey procs and how does one
know they got it right?
• Check published reports for NHDS, or HCUPnet for the HCUP samples.
Conclusions
• Between the NCHS and AHRQ, one can obtain a very
complete picture of the healthcare burden associated with
viral diseases
• There is extensive online documentation and queries to assist
one’s initial efforts in assessing the viability of a new project
• All datasets (except Marketscan) are located in the
Epidemiology Branch Data1 share drive:
cdcprojectNCIRD_DVD_EB_DATA_1
• The ADMA team can help pull subsets of data as needed for
various projects.
– If interested in the data it is important to do the initial legwork first to
determine if the project is feasible. Once you determine feasibility, the
ADMA team is available to answer questions as needed.
Questions?

More Related Content

PDF
Max Kuhn's talk on R machine learning
PPT
American Indians and Alaska Natives (AIAN) in National Survey Data
DOCX
Chapter 5Sources of Data for Use in Epidemiology.docx
PPT
Mccaig Woodwell2
PPTX
Public Health Information Systems and Data Standards in Public Health Informa...
PPTX
Health Datapalooza 2013: Datalab - Jim Craver
PPT
Overview Of Federal Databases
PPT
Sess_39_NAMCS&NHAMCS_hands-on_SCHAPPERT
Max Kuhn's talk on R machine learning
American Indians and Alaska Natives (AIAN) in National Survey Data
Chapter 5Sources of Data for Use in Epidemiology.docx
Mccaig Woodwell2
Public Health Information Systems and Data Standards in Public Health Informa...
Health Datapalooza 2013: Datalab - Jim Craver
Overview Of Federal Databases
Sess_39_NAMCS&NHAMCS_hands-on_SCHAPPERT

Similar to NationalDatasets (20)

PPTX
Health information standars
PPT
hus2015.ppt
DOCX
Quality Data Sources Essay Example Paper.docx
PPTX
SURVEILLANCE OF HEALTH EVENT
DOCX
MacDormanVita6-2015
PDF
SHOW Symposium
PDF
state of the Health in United states of America
PPTX
Sources of Information in Epidemiology.pptx
DOCX
Building a Citywide, All-Payer, Hospital Claims Databaseto I.docx
PPT
Understanding and using NAMCS and NHAMCS
DOCX
Quality Data Sources Organizer Discussion Paper.docx
PPTX
Secondary Data Analysis
PDF
Pres shrpig june23_spencer
PPTX
Investigating the Health of Adults: Leveraging Large Data Sets For Your Stud...
PPTX
Analyzing Child Health Data Sets: How UCSF's CELDAC Initiative Helps to Move ...
PPTX
Sources of Demographic data_NFHS.pptx
PPTX
Data sources for development professionals (India)
PPTX
SAC360 Chapter 8 role of data in public health
PPT
Using the Data that We Collect. Data Sources and Evaluation Tips
PPTX
health information system.pptx Ms. Shweta Singh
Health information standars
hus2015.ppt
Quality Data Sources Essay Example Paper.docx
SURVEILLANCE OF HEALTH EVENT
MacDormanVita6-2015
SHOW Symposium
state of the Health in United states of America
Sources of Information in Epidemiology.pptx
Building a Citywide, All-Payer, Hospital Claims Databaseto I.docx
Understanding and using NAMCS and NHAMCS
Quality Data Sources Organizer Discussion Paper.docx
Secondary Data Analysis
Pres shrpig june23_spencer
Investigating the Health of Adults: Leveraging Large Data Sets For Your Stud...
Analyzing Child Health Data Sets: How UCSF's CELDAC Initiative Helps to Move ...
Sources of Demographic data_NFHS.pptx
Data sources for development professionals (India)
SAC360 Chapter 8 role of data in public health
Using the Data that We Collect. Data Sources and Evaluation Tips
health information system.pptx Ms. Shweta Singh
Ad

NationalDatasets

  • 1. Nationally Representative and Other Large Datasets Jessica Moore Analysis and Data Management Activity (ADMA) 03/22/2012
  • 2. Background • One of our goals is to estimate the burden of viral diseases among the US population by describing their association with – Hospitalizations – ED visits – Ambulatory/Outpatient visits – Deaths • Consequently, we need datasets for deriving robust, nationally representative estimates
  • 3. Background Focus will be on datasets produced by two organizations within the Department of Health and Human Services • National Center for Health Statistics (NCHS) – Mission is to provide statistical information that will guide actions and policies to improve the health of the American people. • Agency for Healthcare Research and Quality (AHRQ) “ark” – Mission is to improve the quality, safety, efficiency, and effectiveness of health care for all Americans.
  • 5. NCHS – Surveys National Hospital Discharge Survey (NHDS) http://www.cdc.gov/nchs/nhds/nhds_questionnaires.htm • Designed to meet the need for nationally representative information on inpatients discharged from nonfederal , short –stay hospitals in the United States – Federal, military, and VA Hospitals, as well as hospital units of institutions (such as prison hospitals), and hospitals with fewer than six beds staffed for patient use, are excluded. • Due to funding limitations, sample of hospitals reduced by half starting in 2008. • In 2009, approximately 160,000 inpatient records were obtained from 205 responding hospitals.
  • 6. NCHS – Surveys National Hospital Discharge Survey (NHDS) • Variables include discharge diagnoses, procedures, length of stay, payer, outcome, gender, race and ethnicity (race missing on 19% of records) • A discharge weight (not the patient’s weight!) is provided for each record and the sum of the weights results in national estimates of the # of hospitalization associated with a particular condition • Public use data are available for download via ftp (1996-2009) – Epi share drive (cdcprojectNCIRD_DVD_EB_DATA_1nchspublic) has SAS datasets from 1996-2009 – Data collection for NHDS began in 1965 (sampling design changed in 1988). CDs available from NCHS for data 1970-1978 and 1979-2007
  • 7. NCHS –Surveys National Hospital Discharge Survey (NHDS) • To conduct statistical tests, one must use the survey design variables (weight) to generate standard errors (SEs) for estimates – Calculating SEs requires SUDAAN – For access to the survey design variables, one has to prepare a project proposal for NCHS review and, w/approval, utilize the Research Data Center – Summary output only; no record-level data
  • 8. NCHS – Surveys National Ambulatory Medical Care Survey (NAMCS) & National Hospital Ambulatory Medical Care Survey (NHAMCS) http://www.cdc.gov/nchs/ahcd.htm • Together – AKA “NAMCeeS/HAMCeeS” • The NAMCS is based on a sample of visits to non- federally employed office-based physicians who are primarily engaged in direct patient care. Physicians in the specialties of anesthesiology, pathology, and radiology are excluded – National estimates of doctor’s office visits • The NHAMCS is based on a national sample of visits to the emergency departments and outpatient departments of non-institutional general and short- stay hospitals. – National estimates of hospital outpatient department and ED visits
  • 9. NCHS – Surveys NAMCS & NHAMCS • Demographic information, reason for visit, procedure codes, prescription information, pre-existing chronic conditions, payer, length of visit. • By using the NAMCS and NHAMCS one can generate national estimates of outpatient and ED visits. • Public use files available for download – The public use files for 1993-2009 contain sample design variables in masked form so you don’t need any special files/agreements • Still need SUDAAN for calculating SEs. • A combined SAS dataset from 1995-2009 is available in the Epi share drive for all 3 datasets cdcprojectNCIRD_DVD_EB_DATA_1nchspublic • Additional data, documentation and import code found on the NCHS site.
  • 10. NCHS – Surveys National Immunization Sample (NIS) http://www.cdc.gov/nchs/nis.htm • Conducted jointly by NCIRD and NCHS • List-assisted random-digit-dialing telephone survey, with mailed survey to children’s immunization providers • Target population – children 19-35 months • Rates of being up-to-date with respect to ACIP recommended number of doses • DTaP/DTP/DT, polio, MMR, Hib, HepB, varicella, pneumococcal, HepA, influenza, rotavirus • SAS datasets in EB share drive available from 2010 • 2010 dataset has records for approximately 24,000 children, with 17,000 having adequate provider data
  • 11. NCHS – Surveys National Immunization Sample – Teen (NIS-Teen) • Same sampling method of NIS • Public datasets and SAS datasets in EB share drive available from 2008-2010 • Assesses vaccination coverage of teens less than 17 years old • 2010 dataset includes records for approximately 33,000 teens, with 20,000 having adequate provider data • Tables of vaccination coverage for NIS and NIS-Teen available at http://www.cdc.gov/vaccines/stats-surv/imz- coverage.htm#nisteen • Public use file does not include any provider-level weights. Providers are included in the survey through the children they vaccinate. Do not attempt provider level analysis!
  • 12. NCHS- Surveys General comments from past and present experiences • NHDS is known as the ‘gold standard’ for national hospitalization estimates. Public use files are likely sufficient for some activities negating the need for design variables. – Hospitalization estimates of less common conditions can be problematic (i.e. Relative SEs>30% or < 30 unweighted records are considered unreliable. Estimates based on 30-59 records are termed ‘low reliability’). – Stringent control of design variables requires a RDC proposal • NAMCS and NHAMCS has same limitations for less common conditions – Bonus is that design variables are included in public use files • NIS/NIS-Teen – do not attempt to calculate provider-level statistics! • For all surveys, the size of the annual files are relatively small compared to other datasets – Requires less disk space and processing time
  • 14. NCHS – Vital Statistics Mortality Multiple Cause of Death • Annual files captures approximately 100% of US resident deaths (i.e. it is not a sample so no need to worry about record weights). – Underlying cause of death provided – Up to 20 ICD-9 (ICD-10 beginning in 1999) recorded on death certificate – Includes demographic information for the deceased • Public use files available for download – Geographical information suppressed for counties with pops<100,00 for 1998 through 2004 – ALL geographical information suppressed beginning in 2005 • DVD has agreement with NCHS to obtain county-level geographic information.
  • 15. NCHS – Vital Statistics Linked Birth and Infant Death Datasets • Information from the death certificate is linked to the information from the birth certificate for each infant under 1 year of age who dies in the United States, Puerto Rico, The Virgin Islands, and Guam • Variables include age, race, and Hispanic origin of the parents, birth weight, period of gestation, plurality, prenatal care usage, maternal education, live birth order, marital status, and maternal smoking, linked to information from the death certificate such as age at death and underlying and multiple cause of death. • Useful for identifying risk factors for infant death. – One can compare characteristics of infant deaths from one cause to infant deaths due to another cause(s) – Or compare the characteristics of infant deaths to a sample of surviving infants.
  • 16. NCHS – Vital Statistics Linked Birth and Infant Death Datasets • Two designs for the linked data - period data and birth cohort data. – The numerator for the period linked file consists of all infant deaths occurring in a given data year linked to their corresponding birth certificates, whether the birth occurred in that year or the previous year. – The numerator for the birth cohort linked file consists of the deaths (current or subsequent year) linked to infants born in a given year. – In both designs, the denominator is all births occurring in the year. – Due to the different designs, the period linked files are available sooner than the cohort linked files. • Public use files available for download – Geographical information suppressed for counties with populations<250,000 for 1998 through 2004 – ALL geographical information suppressed beginning in 2005 • DVD has agreement with NCHS to obtain county-level geographical information.
  • 17. NCHS – Vital Statistics Birth Datasets http://www.cdc.gov/nchs/births.htm • Captures all registered live births occurring in the US. Birth certificate information included (like in the linked denominator file). • Public use files available for download – Geographical information is suppressed for counties with populations<100,000 for 1998 through 2004 – ALL geographical information suppressed beginning in 2005 • DVD has agreement with NCHS to obtain county-level geographical information. • SAS datasets from 1969-2009 available in EB share drive • 2 SAS summary datasets from 1990-2009 – Summary births by year, region, state, race, ethnicity and sex – Summary births by year, region, race, ethnicity and sex
  • 18. NCHS – Vital Statistics General comments • Need to closely follow the documentation as variables were added/removed/modified over time – Particularly important for the birth and linked data due to states adopting the 2003 revision of the birth certificate. • For all mortality data, need to pay attention to the conversion from ICD-9 to ICD-10 beginning in 1999. – Comparability studies found differences in how conditions were coded between the two coding versions • Birth data and linked denominator data take a bit of processing time since each calendar year has approximately 4 million live births. • Can do basic queries online using wonder.cdc.gov or by using VitalStats (http://www.cdc.gov/nchs/VitalStats.htm)
  • 19. AHRQ
  • 20. AHRQ Healthcare Cost and Utilization Project “H-CUP” • HCUP obtains data on inpatient and ED visits from participating states – Standardizes the format of data across states – Creates samples from the states’ 100% hospitalization and ED visit data. • Extensive documentation on HCUP and associated datasets available online (http://www.ahrq.gov/data/hcup/)
  • 21. AHRQ HCUP Datasets Samples • Nationwide Inpatient Sample (NIS) • Kids’ Inpatient Database (KID) • Nationwide Emergency Department Sample (NEDS) 100% data from source • State Inpatient Databases (SID) • State Emergency Department Databases (SEDD)
  • 22. AHRQ HCUP - NIS • The NIS is a very large sample of hospital discharges from HCUP participating states (44 during 2009) – Began in 1988 with 8 participating states. We have SAS datasets available from 1988 to 2009. – Database design has evolved over time but has remained relatively constant since 1998. – The NIS is approximately a 20% sample of US hospitalizations occurring in community hospitals – Similar variables to the NHDS – Provides hospital charges and cost-to-charge ratio files so one can estimate direct charges/costs – Participating states account for approximately 95% of all hospital discharges in the U.S.
  • 23. AHRQ HCUP - NIS • Since the NIS is such a large sample, analyses of less common conditions can be conducted. – Also…since the NIS is such a large sample, processing time to generate subsets can be burdensome (creating subsets from 5-8 million records each year). • Like the NHDS – One must use the record weight to generate nationally representative estimates – For statistical tests, one must generate SEs for the estimates • Can use SUDAAN, SAS Survey procs, or Stata • A bonus…design variables are provided with the datasets
  • 24. AHRQ HCUP - KID • The KID follows the same format as the NIS but focuses exclusively on pediatric discharges (patients aged ≤ 20 years). – Very powerful for looking at less common conditions occurring among children that lead to hospitalization. • Produced every 3 years (1997,2000, 2003,2006,2009) • Same weighting and SE calculation procedures as with the NIS • Cost-to-charge files available for 2003,2006,2009
  • 25. AHRQ HCUP - NEDS • The NEDS yields national estimates of emergency department (ED) visits. • The NEDS is a brand new dataset with the 1st release being for calendar year 2006 – ~26 million and ~27 million unweighted records for 2006 & 2007, respectively. • Similar variables to the NIS and also includes ED charges • SAS datasets available in EB share drive for 2006-2009 • Race is suppressed
  • 26. AHRQ HCUP – State Datasets (100% Data) • Statewide Inpatient Databases 1988-2010* (SID) – 100% hospitalization data from participating states in HCUP. Currently, we do not have direct access to these data but have current/past collaborations with HCUP investigators to conduct analyses using these data. – The gold standard for assessing rare conditions leading to hospitalization. Can calculate state specific and indirectly estimate nationally representative rates. – Increasing number of states available over time. – These are very large datasets • State Emergency Department Databases 1999-2010* (SEDD) – Currently, we do not have direct access to these data but have current/past collaborations with HCUP investigators to conduct analyses using these data. – Captures the discharge information on all emergency department visits that do not result in a hospital admission – These are very, very, very large datasets *2010 already available for some states.
  • 27. AHRQ HCUP – General Comments • Extensive online documentation available for the datasets. http://hcup-us.ahrq.gov/ • Access to sample data requires a brief online course and a signed DUA. – It’s easy. • Processing time can be a bit daunting. – Workarounds include remote submitting to branch server, CSP (pending), or using Citgo SAS (recommended) • Online tool (HCUPnet) is a good start for assessing the numbers that might be available for a study of interest. – http://hcupnet.ahrq.gov/ • Specifically for the NIS, changes in dataset format over time can make assessing trends challenging. HCUP has released trend files to facilitate these types of analyses. • AHRQ datasets include 15 possible diagnosis codes per record
  • 28. MEDSTAT • MarketScan data warehouse • Contains individual-level healthcare claims, health risk assessments, absence, short-term disability, workers’ compensation, and hospital discharge info from large employers, managed care organizations, hospitals, Medicare and Medicaid programs • We have access to data from 1993 through 2010. Pre-2004 is in Citgo. Some of post-2004 data is located in CDC network land and some is located in CSP (consolidated statistical platform). • Very large datasets with extensive admission and discharge data, including cost and charge information, i.e. deductibles, coinsurance, other payments, etc. • Once you pull data from various sources and locations it’s a powerful dataset for estimating burden of various disease.
  • 29. What about rates? Numerators are nice but what about denominators? • Bridged-race census data available for 1990-2009 – cdcprojectNCIRD_DVD_EB_DATA_1census – Goal of bridged race data is to make population estimates by race/ethnicity comparable for 1990-1999 to >2000. • Longer-term project is to also create intercensal population estimates datasets. • Remember that for the sample datasets, you have to factor in the SEs (i.e. compute relative standard errors) for your rates to do comparisons or to calculate 95% confidence intervals
  • 30. Data Processing • Many of the national datasets are very large, i.e. NEDS approx. 27 million unweighted records per year! • While these are powerful datasets for estimating burden of disease, especially for rare conditions, the size of the datasets make them cumbersome to analyze. • We suggest, if working with these data in SAS, to pull the data from the share drive and run on SAS in Citgo. We tested various methods and this had the shortest run time, as well as least cpu usage. • Since the national datasets we use are complex survey designs, you need to analyze them in SUDAAN to calculate weighted estimates. SAS-callable SUDAAN is available in Citgo. • Other suggestions for larger datasets: – SGIO=yes – Compress=binary
  • 31. Limitations • Hospitalizations, ED, outpatient visits – Have to rely on ICD-9-CM codes to identify cause – Primary versus secondary diagnosis versus any-listed • Was the condition of interest the ‘cause’ of the healthcare visit • Mortality – Rely on ICD-9/ICD-10 codes • Conversion to ICD-10 does result in changes in national estimates for a lot of conditions • Can be difficult/impossible to ascertain comorbid conditions • Survey datasets are encounter based rather than case based – Multiple visits by same individual can not be determined
  • 32. Limitations • Public-use data good for general picture – Suppressed information (especially since 2005) is problematic. • Access to suppressed elements or design information requires annual agreements with NCHS. – Requires RDC proposal for access to design information. • Working with the data can be challenging, especially the birth, linked, and HCUP sample datasets • Caveats on producing SEs for the sample datasets – Requires knowledge of SUDAAN, SAS Survey procs and how does one know they got it right? • Check published reports for NHDS, or HCUPnet for the HCUP samples.
  • 33. Conclusions • Between the NCHS and AHRQ, one can obtain a very complete picture of the healthcare burden associated with viral diseases • There is extensive online documentation and queries to assist one’s initial efforts in assessing the viability of a new project • All datasets (except Marketscan) are located in the Epidemiology Branch Data1 share drive: cdcprojectNCIRD_DVD_EB_DATA_1 • The ADMA team can help pull subsets of data as needed for various projects. – If interested in the data it is important to do the initial legwork first to determine if the project is feasible. Once you determine feasibility, the ADMA team is available to answer questions as needed.

Editor's Notes

  • #7: 6 diagnosis codes included Want estimate of length of stay? Multiply each record weight by the number of days admitted and then add the sum of products
  • #12: -Statistical analyses require only data from children with adequate provider data (PDAT=1) along with their final provider sampling weights (PROVWT/PROVWTVI)
  • #26: 15 diagnosis codes available
  • #31: SGIO = scatter-read/gather-write input/output