SlideShare a Scribd company logo
1
© CDISC 2016
Presented by Gene Lightfoot
2
An Approach to Combining
Disparate Clinical Study Data
across Multiple Sponsor’s Studies
participating in Project Data
Sphere®
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
INTRODUCTION
• Project Data Sphere®
• The Challenge
• The Approach
• Simplified Process Flow
• Identify the data
• Reviewing the Raw Data
• Programming the Process
• Reviewing and Data Quality
• Basic Program Flow
• Documentation
• General Issues and Things to Ponder
• The Final Data Sets
• Conclusion
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
PROJECT DATA
SPHERE®
• An independent, not-for-profit initiative of the CEO Roundtable on Cancer's
Life Sciences Consortium (LSC), operates the Project Data Sphere platform,
a free digital library-laboratory that provides one place where the research
community can broadly share, integrate and analyze historical, patient-level,
comparator-arm data from academic and industry phase III cancer clinical
trials.
• The Project Data Sphere platform is available to researchers affiliated with life
science companies, hospitals and institutions, as well as independent
researchers. Anyone interested in cancer research can apply to become an
authorized user.
• A goal of the Project Data Sphere initiative is to spark innovation.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
PROJECT DATA
SPHERE®
Some Project Data Sphere® metrics (December, 2016)
• 1,437 total users
• 51 countries
• 5,861 total downloads to date
• 40,500+ subjects
• Growing monthly
Tools are available to the registered users and the data can be downloaded
and accessed locally.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
PROJECT DATA
SPHERE®
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE CHALLENGE
• Use available data provided for the prostate cancer studies to develop and
implement a process to combine the data.
• The data comprised 12 separate studies spanning 20+ years from 7 different
sponsors. Standards represented were:
• 1 ADaM
• 5 SDTM
• 6 Other
• Three data sets for analysis were identified; labs, adverse events, and
demography.
• The task involved aggregating the data for each domain at the study level and
then harmonizing the data for analysis across all 12 of the sponsor studies.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH SIMPLIFIED PROCESS FLOW
After completing several studies across multiple sponsors, it became
evident that a process had evolved that served well for this project.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: IDENTIFY THE DATA
• Since SDTM is considered a global industry standard and recently conducted
studies uploaded to Project Data Sphere® usually conformed to this model, it
was decided to use SDTM as the standard.
• Disease expertise at this level would have made column selection and
analysis much easier. Did not have access to this resource.
Before the team started looking at the data, certain endpoints and populations
were identified for the analysis. Of particular interest was the value for the
Prostate Specific Antigen (PSA) used as a predictor for Prostate Cancer. This
project was a single gender (male) population. It was decided to include all
available labs, adverse events (AE), and demography data.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: IDENTIFY THE DATA
Reviewing the Raw Data
• Undoubtedly the hardest aspect of this project.
• Supplied as SAS data sets
• Clinical data knowledge is invaluable here – not always obvious where the data is
“hiding”. May require multiple data sets to build one domain.
• Data has been de-identified.
• Some of this data was 20+years old.
• presenting some interesting aspects of data collection – long skinny (normalized) vs short fat
(non-normalized) data sets.
• Unusual data set names – made identifying contents less intuitive .
• All sponsors provided some combination of data dictionary documents, annotated
CRFs, a study protocol document, and SAS formats.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: PROGRAMMING THE PROCESS – (MAP THE DATA)
Programming approach
• Although data mapping solutions are available, it was decided to stick with
traditional SAS programs to mimic how a solitary researcher might work.
• A global attribute program for each domain was created to manage the
column metadata as the project progressed – column name, label, type,
length, etc. This metadata was %included in each domain program.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: PROGRAMMING THE PROCESS – (MAP THE DATA)
Map the Data
Mapping programs were
written for each domain
(DM, AE, etc.) within each
study for each sponsor.
Don’t be alarmed - code
reuse within sponsor and
even within SDTM
standards across sponsor
resulted in program
efficiencies.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: COMBINING THE DATA SETS – (COMBINE THE DATA)
Code to Remove Data Formats and Informats
• To reduce notes and any warnings in the SAS log –
any SAS informats/formats were removed from the
raw input data sets.
• Used %include to use this code
Programs to Combine the Data Sets
• Simple data step procedure with multiple sets
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY)
Data Quality
• Our most important concern was the quality of the mapped data. Did we
assign the proper column during the mapping process.
• An additional programmer was tasked to review the data and confirm correct
observations counts and correct patient populations.
• Constantly ran frequencies against the raw data and the harmonized data to
verify output, paying particular attention to the remapped columns.
• Any outliers or any data that was questioned by this programmer was
reviewed and, if found to be incorrect, the appropriate changes were made to
the mapping code.
• No original source data was ever modified.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY)
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY)
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY)
In the upper right corner are four blocks with missing values. Their values from high to low are: missing,
MCG/L, UG/l, and NG/DL.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE APPROACH: BASIC PROGRAM FLOW
Programming Flow
1. Review the data and identify needed tables and columns.
2. Create a “global” metadata file for each domain. For this project it was the
SAS attrib statement used for each domain and across each study.
3. Create mapping programs for each study – should be able to re-use code
within sponsor.
4. Create data quality process flow to check the data for correct metadata,
patient counts, and any “outliers”.
5. Create code to combine data across studies – simple SET statement.
6. [Optional] Create one process that submits all the code created in items 2-5.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
DOCUMENTATION
Data Matrix Document
The data matrix document was dynamic during the development process. The end
result is a document that can be provided to the researcher tracing the harmonized
data back to the original source columns and source data sets and providing a quick
overview of the data.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
DOCUMENTATION
Data Traceability
Document
This was dynamic
also and recorded
observations and
notes about the data.
It also contains any
decisions that were
made during
mapping that might
affect the
harmonized data.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
GENERAL ISSUES
AND THINGS TO
PONDER
Not All Data is Created Equal
• Mixture of character and numeric
• Normalized versus non-normalized
• Some studies were more robust (contained more data)
Some Studies May Not Fit the Analysis
• May not find what you are looking for in the data – a key column may be missing (ie
AEREL)
To Compute or Not to Compute?
• May need to make a decision to compute relative day, age, gender??
Age and Age Groups
• If age was not available it was usually reported in an age group – across sponsor this
age group was not consistent (ie 40 – 55, 45-55, 50 – 65, etc..)
Race
• A variety of race types seen here, mostly with the legacy data.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
GENERAL ISSUES
AND THINGS TO
PONDER
Categorical Data
• Use of provided data dictionaries and SAS formats
• Cannot always make assumptions
External Terminology/Dictionary
• Found a combination of COSTART and MedDRA dictionaries
• Made no effort to upgrade to MedDRA
Dates versus Date Intervals
• Dates were rare in the data no doubt due to de-identification
• Relied on duration – But how is it calculated?? (event-start) or (event-start)+1
• Duration unit – days vs weeks
Unique Subject Identifiers
• Some studies simply gave a unique identifier starting with 1 to N number of subjects
Can the Data be too De-identified?
• In some cases yes, lack of dates, age
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
THE FINAL DATA
SETS
DM Domain 8,116 subjects
LB Domain 1,170,346 observations
AE Domain consisted of 127,067 observations
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
CONCLUSION
• This was a great project since it covered various aspects of data that a user
would expect from 20+ years of research.
• Data conforming to the SDTM models obviously were the easiest to combine.
The legacy data, as expected, required more work but in the end conformed
nicely.
• Disease experts/researchers and clinical data programmers clearly benefit
any project of this nature
• Effective analysis tools provide excellent data quality review.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
CONCLUSION
• Data harmonization requires careful analysis and understanding of the
underlying clinical data especially when legacy data exists without any
associated clinical data standard. Document, document, document.
• Choose a target standard such as SDTM when working with legacy data.
• Regard data harmonization as a continuous and valuable learning experience
as processes for data harmonization will surely evolve with time.
As a result of this work, currently working on a more robust process to
harmonize incoming data for Project Data Sphere®. A questionnaire/checklist
was created for sponsors to provide certain information felt necessary to help
get researchers started.
Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
FURTHER
INFORMATION
Project Data Sphere®
https://www.projectdatasphere.org/projectdatasphere/html/about
Author Contact information
Your comments and questions are valued and encouraged. Please contact the author at:
Gene Lightfoot
SAS Institute Inc.
SAS Campus Drive Q2372
Cary, North Carolina 27513 USA
+1 (919) 677-8000
gene.lightfoot@sas.com
• www.sas.com

More Related Content

PPTX
Medical Intelligence EDW 20 juni: Radboudumc
PPTX
An Introduction to Clinical Study Migrations
PDF
Connecting the pieces: using ORCIDs to improve research impact and repositories
PDF
Sharing and standards christopher hart - clinical innovation and partnering...
PPT
Where do we currently stand at ICARDA?
PPTX
Next Gen Clinical Data Sciences
PDF
ELSS use cases and strategy
PPTX
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
Medical Intelligence EDW 20 juni: Radboudumc
An Introduction to Clinical Study Migrations
Connecting the pieces: using ORCIDs to improve research impact and repositories
Sharing and standards christopher hart - clinical innovation and partnering...
Where do we currently stand at ICARDA?
Next Gen Clinical Data Sciences
ELSS use cases and strategy
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...

What's hot (20)

PPTX
Efficient Data Reviews and Quality in Clinical Trials - Kelci Miclaus
PDF
CDISC's CDASH and SDTM: Why You Need Both!
PPTX
Clinical Data Management: Strategies for unregulated data
PPTX
MPS webinar master deck
PDF
Pharos: A Torch to Use in Your Journey in the Dark Genome
PPTX
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
PPT
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
PDF
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
PPSX
eHealth Foundations: Can openEHR Provide One Layer?
PPTX
Wincere Inc.
PDF
Reaxys rmc unified platform_ webinar_
PDF
Importance of data standards and system validation of software for clinical r...
PPTX
Analytics in Pharmaceutical Industry
PPTX
Nif tdr project webinar mehnert
PDF
Accelerating the production of safety summary and clinical safety reports - a...
PPT
Human Genome and Big Data Challenges
PPTX
Confused by FDA Guidance on Standardized Study Data for Electronic Submissions?
PDF
Increasing transparency in Medical Education through Open Data
PDF
Evaluation of the importance of standards for data and metadata exchange for ...
PPTX
Life Science Analytics
Efficient Data Reviews and Quality in Clinical Trials - Kelci Miclaus
CDISC's CDASH and SDTM: Why You Need Both!
Clinical Data Management: Strategies for unregulated data
MPS webinar master deck
Pharos: A Torch to Use in Your Journey in the Dark Genome
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
eHealth Foundations: Can openEHR Provide One Layer?
Wincere Inc.
Reaxys rmc unified platform_ webinar_
Importance of data standards and system validation of software for clinical r...
Analytics in Pharmaceutical Industry
Nif tdr project webinar mehnert
Accelerating the production of safety summary and clinical safety reports - a...
Human Genome and Big Data Challenges
Confused by FDA Guidance on Standardized Study Data for Electronic Submissions?
Increasing transparency in Medical Education through Open Data
Evaluation of the importance of standards for data and metadata exchange for ...
Life Science Analytics
Ad

Similar to An Approach to Combining Disparate Clinical Study Data across Multiple Sponsor’s Studies participating in Project Data Sphere (20)

PPTX
Research methods group accelarating impact by sharing data
PDF
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
PPTX
Data Management Roles, Process and Technologies In Risk Based Study Execution
PDF
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
PPTX
Epoch Research Institute : Introduction to CR
PPTX
Post-lock Data Flow: From CRF to FDA
PPTX
Reveal - An Enterprise Clinical Data Search Solution
PPTX
Reproducible research: theory
PPTX
Centralizing Data to Address Imperatives in Clinical Development
DOCX
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
PDF
eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes
PDF
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
PDF
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
PDF
FAIR and metadata standards - FAIRsharing and Neuroscience
PPTX
Connected development data
PDF
CDISC-CDASH
PPT
Gain evaluator session (2)
PPTX
ERA CoBioTech Data Management Webinar
PPTX
Shifting the goal post – from high impact journals to high impact data
PPTX
Burton - Security, Privacy and Trust
Research methods group accelarating impact by sharing data
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Data Management Roles, Process and Technologies In Risk Based Study Execution
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Epoch Research Institute : Introduction to CR
Post-lock Data Flow: From CRF to FDA
Reveal - An Enterprise Clinical Data Search Solution
Reproducible research: theory
Centralizing Data to Address Imperatives in Clinical Development
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
FAIR and metadata standards - FAIRsharing and Neuroscience
Connected development data
CDISC-CDASH
Gain evaluator session (2)
ERA CoBioTech Data Management Webinar
Shifting the goal post – from high impact journals to high impact data
Burton - Security, Privacy and Trust
Ad

More from imgcommcall (20)

PDF
Making Radiology AI Models more robust: Federated Learning and other Approaches
PPTX
Agenda NCI Imaging Informatics Webinar 2020-07-06
PPTX
AI-LAB
PPTX
PathPresenter
PPTX
Agenda April 6, 2020
PPTX
Medical Segmentation Decathalon
PPTX
Agenda - NCI Imaging Community Call
PPTX
NCI Cancer Research Data Commons - Overview
PDF
Imaging Data Commons (IDC) - Introduction and intital approach
PPTX
NCI Imaging community call agenda
PDF
CPTAC Data at the Genomic Data Commons
PDF
CPTAC Data Portal and Proteomics Data Commons
PDF
Clinical Proteomic Tumor Analysis Consortium (CPTAC) Overview
PDF
PRISM Semantic Integration Approach
PDF
PRISM Project Update
PPTX
TCIA Update
PPTX
Image community 2019 06-03
PPTX
NBIA 7.0 Community Version Release
PPTX
Image community 2019 05-06
PPTX
Image community 2019 03-04
Making Radiology AI Models more robust: Federated Learning and other Approaches
Agenda NCI Imaging Informatics Webinar 2020-07-06
AI-LAB
PathPresenter
Agenda April 6, 2020
Medical Segmentation Decathalon
Agenda - NCI Imaging Community Call
NCI Cancer Research Data Commons - Overview
Imaging Data Commons (IDC) - Introduction and intital approach
NCI Imaging community call agenda
CPTAC Data at the Genomic Data Commons
CPTAC Data Portal and Proteomics Data Commons
Clinical Proteomic Tumor Analysis Consortium (CPTAC) Overview
PRISM Semantic Integration Approach
PRISM Project Update
TCIA Update
Image community 2019 06-03
NBIA 7.0 Community Version Release
Image community 2019 05-06
Image community 2019 03-04

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Computer network topology notes for revision
PPTX
Leprosy and NLEP programme community medicine
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Managing Community Partner Relationships
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Lecture1 pattern recognition............
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Transcultural that can help you someday.
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Computer network topology notes for revision
Leprosy and NLEP programme community medicine
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Managing Community Partner Relationships
ISS -ESG Data flows What is ESG and HowHow
Lecture1 pattern recognition............
SAP 2 completion done . PRESENTATION.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Transcultural that can help you someday.

An Approach to Combining Disparate Clinical Study Data across Multiple Sponsor’s Studies participating in Project Data Sphere

  • 1. 1
  • 2. © CDISC 2016 Presented by Gene Lightfoot 2 An Approach to Combining Disparate Clinical Study Data across Multiple Sponsor’s Studies participating in Project Data Sphere®
  • 3. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. INTRODUCTION • Project Data Sphere® • The Challenge • The Approach • Simplified Process Flow • Identify the data • Reviewing the Raw Data • Programming the Process • Reviewing and Data Quality • Basic Program Flow • Documentation • General Issues and Things to Ponder • The Final Data Sets • Conclusion
  • 4. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. PROJECT DATA SPHERE® • An independent, not-for-profit initiative of the CEO Roundtable on Cancer's Life Sciences Consortium (LSC), operates the Project Data Sphere platform, a free digital library-laboratory that provides one place where the research community can broadly share, integrate and analyze historical, patient-level, comparator-arm data from academic and industry phase III cancer clinical trials. • The Project Data Sphere platform is available to researchers affiliated with life science companies, hospitals and institutions, as well as independent researchers. Anyone interested in cancer research can apply to become an authorized user. • A goal of the Project Data Sphere initiative is to spark innovation.
  • 5. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. PROJECT DATA SPHERE® Some Project Data Sphere® metrics (December, 2016) • 1,437 total users • 51 countries • 5,861 total downloads to date • 40,500+ subjects • Growing monthly Tools are available to the registered users and the data can be downloaded and accessed locally.
  • 6. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. PROJECT DATA SPHERE®
  • 7. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE CHALLENGE • Use available data provided for the prostate cancer studies to develop and implement a process to combine the data. • The data comprised 12 separate studies spanning 20+ years from 7 different sponsors. Standards represented were: • 1 ADaM • 5 SDTM • 6 Other • Three data sets for analysis were identified; labs, adverse events, and demography. • The task involved aggregating the data for each domain at the study level and then harmonizing the data for analysis across all 12 of the sponsor studies.
  • 8. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH SIMPLIFIED PROCESS FLOW After completing several studies across multiple sponsors, it became evident that a process had evolved that served well for this project.
  • 9. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: IDENTIFY THE DATA • Since SDTM is considered a global industry standard and recently conducted studies uploaded to Project Data Sphere® usually conformed to this model, it was decided to use SDTM as the standard. • Disease expertise at this level would have made column selection and analysis much easier. Did not have access to this resource. Before the team started looking at the data, certain endpoints and populations were identified for the analysis. Of particular interest was the value for the Prostate Specific Antigen (PSA) used as a predictor for Prostate Cancer. This project was a single gender (male) population. It was decided to include all available labs, adverse events (AE), and demography data.
  • 10. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: IDENTIFY THE DATA Reviewing the Raw Data • Undoubtedly the hardest aspect of this project. • Supplied as SAS data sets • Clinical data knowledge is invaluable here – not always obvious where the data is “hiding”. May require multiple data sets to build one domain. • Data has been de-identified. • Some of this data was 20+years old. • presenting some interesting aspects of data collection – long skinny (normalized) vs short fat (non-normalized) data sets. • Unusual data set names – made identifying contents less intuitive . • All sponsors provided some combination of data dictionary documents, annotated CRFs, a study protocol document, and SAS formats.
  • 11. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: PROGRAMMING THE PROCESS – (MAP THE DATA) Programming approach • Although data mapping solutions are available, it was decided to stick with traditional SAS programs to mimic how a solitary researcher might work. • A global attribute program for each domain was created to manage the column metadata as the project progressed – column name, label, type, length, etc. This metadata was %included in each domain program.
  • 12. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: PROGRAMMING THE PROCESS – (MAP THE DATA) Map the Data Mapping programs were written for each domain (DM, AE, etc.) within each study for each sponsor. Don’t be alarmed - code reuse within sponsor and even within SDTM standards across sponsor resulted in program efficiencies.
  • 13. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: COMBINING THE DATA SETS – (COMBINE THE DATA) Code to Remove Data Formats and Informats • To reduce notes and any warnings in the SAS log – any SAS informats/formats were removed from the raw input data sets. • Used %include to use this code Programs to Combine the Data Sets • Simple data step procedure with multiple sets
  • 14. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY) Data Quality • Our most important concern was the quality of the mapped data. Did we assign the proper column during the mapping process. • An additional programmer was tasked to review the data and confirm correct observations counts and correct patient populations. • Constantly ran frequencies against the raw data and the harmonized data to verify output, paying particular attention to the remapped columns. • Any outliers or any data that was questioned by this programmer was reviewed and, if found to be incorrect, the appropriate changes were made to the mapping code. • No original source data was ever modified.
  • 15. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY)
  • 16. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY)
  • 17. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: REVIEWING AND DATA QUALITY – (DATA QUALITY) In the upper right corner are four blocks with missing values. Their values from high to low are: missing, MCG/L, UG/l, and NG/DL.
  • 18. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE APPROACH: BASIC PROGRAM FLOW Programming Flow 1. Review the data and identify needed tables and columns. 2. Create a “global” metadata file for each domain. For this project it was the SAS attrib statement used for each domain and across each study. 3. Create mapping programs for each study – should be able to re-use code within sponsor. 4. Create data quality process flow to check the data for correct metadata, patient counts, and any “outliers”. 5. Create code to combine data across studies – simple SET statement. 6. [Optional] Create one process that submits all the code created in items 2-5.
  • 19. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. DOCUMENTATION Data Matrix Document The data matrix document was dynamic during the development process. The end result is a document that can be provided to the researcher tracing the harmonized data back to the original source columns and source data sets and providing a quick overview of the data.
  • 20. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. DOCUMENTATION Data Traceability Document This was dynamic also and recorded observations and notes about the data. It also contains any decisions that were made during mapping that might affect the harmonized data.
  • 21. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. GENERAL ISSUES AND THINGS TO PONDER Not All Data is Created Equal • Mixture of character and numeric • Normalized versus non-normalized • Some studies were more robust (contained more data) Some Studies May Not Fit the Analysis • May not find what you are looking for in the data – a key column may be missing (ie AEREL) To Compute or Not to Compute? • May need to make a decision to compute relative day, age, gender?? Age and Age Groups • If age was not available it was usually reported in an age group – across sponsor this age group was not consistent (ie 40 – 55, 45-55, 50 – 65, etc..) Race • A variety of race types seen here, mostly with the legacy data.
  • 22. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. GENERAL ISSUES AND THINGS TO PONDER Categorical Data • Use of provided data dictionaries and SAS formats • Cannot always make assumptions External Terminology/Dictionary • Found a combination of COSTART and MedDRA dictionaries • Made no effort to upgrade to MedDRA Dates versus Date Intervals • Dates were rare in the data no doubt due to de-identification • Relied on duration – But how is it calculated?? (event-start) or (event-start)+1 • Duration unit – days vs weeks Unique Subject Identifiers • Some studies simply gave a unique identifier starting with 1 to N number of subjects Can the Data be too De-identified? • In some cases yes, lack of dates, age
  • 23. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. THE FINAL DATA SETS DM Domain 8,116 subjects LB Domain 1,170,346 observations AE Domain consisted of 127,067 observations
  • 24. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. CONCLUSION • This was a great project since it covered various aspects of data that a user would expect from 20+ years of research. • Data conforming to the SDTM models obviously were the easiest to combine. The legacy data, as expected, required more work but in the end conformed nicely. • Disease experts/researchers and clinical data programmers clearly benefit any project of this nature • Effective analysis tools provide excellent data quality review.
  • 25. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. CONCLUSION • Data harmonization requires careful analysis and understanding of the underlying clinical data especially when legacy data exists without any associated clinical data standard. Document, document, document. • Choose a target standard such as SDTM when working with legacy data. • Regard data harmonization as a continuous and valuable learning experience as processes for data harmonization will surely evolve with time. As a result of this work, currently working on a more robust process to harmonize incoming data for Project Data Sphere®. A questionnaire/checklist was created for sponsors to provide certain information felt necessary to help get researchers started.
  • 26. Copyr ight © 2016, SAS Institute Inc. All rights reser ved. FURTHER INFORMATION Project Data Sphere® https://www.projectdatasphere.org/projectdatasphere/html/about Author Contact information Your comments and questions are valued and encouraged. Please contact the author at: Gene Lightfoot SAS Institute Inc. SAS Campus Drive Q2372 Cary, North Carolina 27513 USA +1 (919) 677-8000 gene.lightfoot@sas.com • www.sas.com

Editor's Notes

  • #17: Sponsor was contacted and the SAS format catalog was provided.