SlideShare a Scribd company logo
June 2020
These slides: bit.ly/ccdh-prototype-june2020
CRDC-H Draft Model Presentation to Nodes
Community
Development
(lead: Volchenboum;
co-lead Vasilevsky)
Data Model
harmonization
(lead: Chute;
co-lead Furner)
Ontology & Terminology
Ecosystem
(lead: Solbrig)
Tools & Data Quality
(lead: Balhoff)
Program Management and operations:
(lead: Haendel, co-lead Munoz-Torres)
Programmatic oversight:
CBIIT: Sherri De Coronado, Allen Dearry
FNL: Todd Pihl, Resham Kulkarni
From Practice-based Evidence
to Evidence-based Practice
Clinical
Databases
Registries
et al.
Clinical
Guidelines
Expert
Systems
Data Inference
Knowledge
Management
Decision
support
Terminologies and data models provide the consistency and comparability
essential for a Learning Health System
Patient
Encounters
Medical
Knowledge
Terminologies
Data models
Role of CCDH in the CRDC ecosystem
Facilitate retrospective and
prospective semantic harmonization of
data across nodes of the CRDC
Coordinate the community to ensure
quality “fit for purpose” design and
implementation of standards that will
facilitate interoperability of
heterogeneous data types and CRDC
resources
Find agreement across the
communities built around CRDC
- match and extend data models
- annotation, harmonization
- quality assurance
Data Model
harmonization
(lead: Chute
co-lead: Furner)
Ontology & Terminology
Ecosystem
(lead: Solbrig)
Tools & Data Quality
(lead: Balhoff)
Schema to
schema
OMOP to
FHIR
Term to
Term
Oncotree to
NCIt
Data records to
data records
“Smoking status
>7 packs per day”
to NCIT:C154510
[Heavy Smoker]
Data model harmonization
Structure:
Syntactic
Concept:
Representation
Ontology:
Meaning in
context
Relationships:
Connections
● Goal is to support harmonization of equivalent data elements in disparate models to
enable cross-node querying and data aggregation
● Node models have developed somewhat independently to fit specific use cases
○ Overall modeling space is broad: there is overlap, but each model covers unique semantic space
○ Divergence in modeling approach: equivalent entities and properties are not always captured in
syntactically equivalent ways
○ Heterogeneity of source data model artifacts
● The CCDH Data Model Harmonization group is defining a shared data model for use
across the CRDC, leveraging existing standards (e.g. FHIR, BRIDG) where possible.
● This harmonized model (CRDC-H) and terminological infrastructure are being designed to
meet the needs of systems like the Cancer Data Aggregator (CDA) that support integrated
search and metadata-based analyses across datasets in the CRDC ecosystem.
Data Model Harmonization: Overview
Data Model Harmonization: Overview
● Phase 1 has focused on foundational effort necessary to support more nuanced work in
additional phases
○ Phase 1 work was exploratory and the modeling abstract
● Phase 2 will provide more concrete model useful for implementation
○ Converge on a modeling and implementation approach that will work for CRDC
Five steps
in the
CRDC-H
Model
Development
Workflow
An iterative process through which content of source models is evaluated, aggregated,
mapped, and refactored into a standards- aligned and harmonized data model.
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
(1) Standardized concept map and spreadsheet representations of source node models
provide a consistent, comparable, and computable substrate for harmonization efforts
Step 1:
Standardize
Source
Data Model
Documentation
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
Links:
- Source model cmaps
- Standardized data dictionaries
(2) Equivalent elements are merged across sources to produce a single aggregated model,
providing a unified view of all information that the final CRDC-H model must represent.
Step 2.
Generate an
Aggregated
Data Model
(ADM)
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
Links:
- ADM cmap
- ADM data dictionary
- February Progress Report
and Slide Deck
(3) Mappings of ADM elements to standard models like BRIDG and FHIR facilitate
understanding of source models, and development of a standards-aligned model.
Step 3.
Map the ADM
to Community
Standard Data
Models
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
(4) Deeper harmonization is achieved as ADM elements are refactored into a more
normalized and standards-aligned conceptual domain model (CDM)
Step 4.
Refactor the
ADM into a
Conceptual
Domain Model
(CDM)
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
(5) Mature elements of the CDM are refined into a concrete logical model, the CRDC-H,
which that will support implementation by CRDC nodes and the CDA
Step 5.
Refactor the
CDM into a
Logical Data
Model
(CRDC-H)
CRDC-H Model Development Workflow
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
Lessons learned from this initial deep dive will inform subsequent iterations that
incorporate new data sources and domains.
CRDC-H Model Development Workflow
First Iteration:
Biospecimen
and
Administrative
entities from
GDC, PDC,
ICDC, and
HTAN
Abstract specification
Low harmonization
Not standards-aligned
Concrete specification
Deep harmonization
Standards-aligned
1. The Aggregated Data Model (ADM) (Step 2)
2. Mapping the ADM to Standard Data Models (Step 3)
3. Refactoring the ADM into a CDM Prototype (Step 4)
4. Next Steps and Future Directions
Outline
The Aggregated Data Model
(ADM)
The substrate for mapping and refactoring efforts
The Aggregated Data Model (ADM)
The ADM represents the union of all elements
across our set of source data models, where
‘equivalent’ entities and attributes are merged.
● Provides a unified view of all information that
the final CRDC-H model must represent.
● Captures an initial set of entity and property
mappings across sources.
● Serves as a base data model that can be
evolved incrementally into a final CRDC-H
An excerpt from the ADM Data Dictionary showing the ADM.Program.name property, which
aggregates and deprecates equivalent properties from GDC, PDC, and ICDC models.
Content from standardized source dictionaries is merged and reorganized in a single sheet
1. Equivalent entities are collapsed into single record, with source definitions retained (rows 1-5)
2. Within an aggregated entity (EA), properties are ordered to group those that are equivalent (rows 8-10)
3. A new ADM row is created for each unique property in the aggregated entity (‘PA’, green row 7)
4. Rows for source properties it aggregates are marked as deprecated (‘PD’, yellow rows 8-10)
The Aggregated Data Model (ADM): Data Dictionary
The Aggregated Data Model (ADM)
Node models are not very well aligned at the outset
● e.g. ICDC and GDC: ~30% entity equivalence , <5% attribute equivalence
Source Model
Alignment
Property aggregation in the ADM was based on superficial analysis strict
aggregation criteria
● Only strictly equivalent elements within strictly equivalent entities are merged
Deeper aggregation and harmonization of elements will be achieved as the
ADM is refactored into the CDM.
The ADM as a whole is large and flat (55 entities, 984 attributes)
ICDC.birthdate vs GDC.birthyear: capture same concept at different level of precision - not aggregated in ADM
Harmonization in the ADM is Minimal
(1)
Examples:
GDC.gender vs ICDC.sex: capture related but distinct concept, using same values (M, F) - not aggregated in ADM(2)
Harmonization in the ADM is Minimal
Examples:
ADM.freezing_method and ADM.preservation_method: separate properties for different types of
specimen processing methods - not aggregated in ADM
(3)
While harmonization achieved in the ADM is minimal, it will serve as a substrate for
mapping and refactoring toward a much more deeply harmonized CDM prototype,
and maintain mappings back to elements in source node models.
Mapping the ADM to
Domain Standard Models
BRIDG and FHIR
● The BRIDG (Biomedical Research Integrated Domain Group) Model is a UML-based
Conceptual Model covering the domains of clinical and translational research
● A collaborative effort engaging stakeholders from CDISC, HL7, ISO, NCI, and the FDA
● Not an implementation model, but can be refined into a logical data model to support
application in data systems.
● One common use is as a 'hub' supporting cross-model mapping between any two models
that have individually been mapped to BRIDG
● Supporting infrastructure maintains computable mappings of BRIDG to community
models, and links to common data elements in semantic standards like the caDSR.
The BRIDG Conceptual Model
http://bridgmodel.nci.nih.gov
The BRIDG
Concept Map
shows scope
of the model
and high-level
concepts it
covers
https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA1.htm
The BRIDG Conceptual Model: Coverage and Scope
The
Comprehensive
BRIDG UML
Diagram shows
attributes and
relationships of
all classes in
the model
https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA3.htm
The BRIDG Conceptual Model: Full Model
The
Comprehensive
BRIDG UML
Diagram shows
attributes and
relationships of
all classes in
the model
https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA3.htm
The BRIDG Conceptual Model: Full Model
BRIDG
Biospecimen
View shows
only modeling
related to this
subdomain
https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA2/EA51.htm
The BRIDG Conceptual Model: Biospecimen Subdomain
● Analysts from Samvit Solutions on loan from NCI CBIIT assisted in the mapping
process (Smita Hastak, Wendy Ver Hoef, Charles Yaghmour)
● Utilized a standard spreadsheet-based mapping template, widely used for other
BRIDG mapping efforts (e.g. OMOP, Sentinel, i2b2, mCODE)
● Mappings are defined as ‘paths’, rooted at the BRIDG equivalent of the mapped
ADM class (e.g. BRIDG.BiologicSpecimen for ADM.Sample)
● Mapping path for ADM.Sample.freezing_method:
● Full mapping spreadsheet located here (‘Mappings’ sheet, column K)
ADM -> BRIDG Mapping: Process and Tools
BiologicSpecimen <--beAFunctionPerformedBy-- Subject <--beParticipatedInBy--
PerformedMaterialProcessStep.methodCode
WHERE PerformedMaterialProcessStep--instantiate-->DefinedMaterialProcessStep.nameCode="freeze"
ADM -> BRIDG Mapping: Covering Model Diagrams
‘Covering’ views show all the classes and patterns in the BRIDG model needed to represent the content
of a single ADM entity (shown here for ADM.Sample)
ADM -> BRIDG Mapping: Covering Model Diagrams
The yellow path traces the BRIDG mapping for ADM.Sample.preinvasive_morphology, from the
PerformedDiagnosis.value field holding the data, to the BiologicSpecimen class rooting the mapping.
Start
End
ADM -> BRIDG Mapping: Applications and Benefits
1. Provides Semantic Clarity to Source Models
a. Forces us to deeply understand the meaning and utility of each ADM element
b. Highlights areas where node models or documentation are unclear or duplicative
2. Enables Cross-Model Mappings
a. Facilitates mappings to other models mapped to BRIDG
(e.g. OMOP, Sentinel, ACT/i2b2, PCORNet, HL7 FHIR mCODE IG, ...)
b. Provides a connection to the NCI semantic infrastructure and standards
(e.g. caDSR, EVS)
3. Informs ADM -> CDM Refactoring
a. Represents a hyper-normalized counterpoint to the flat node models in the ADM,
ensuring our harmonized model is grounded in reality.
● FHIR is a data exchange model and API framework
● Primary domain is patient-level healthcare data from EHRs
● Provides set of core resources, and a profiling mechanism that allows
implementations to add custom constraints and extensions to core resources
● Implementation Guides instruct implementers on how to assemble profiles into
exchange schema tailored for a specific community, application, or use case.
● Widely used in healthcare settings, with developing coverage of research
concepts, making it attractive candidate for re-use or alignment in our work.
Fast Healthcare Interoperability Resource (FHIR) Model
https://www.hl7.org/fhir/index.html
Catalog and
Example
Specification
of FHIR
Resources
https://www.hl7.org/fhir/index.html, https://www.hl7.org/fhir/specimen.html
Fast Healthcare Interoperability Resource (FHIR) Model
Data model harmonization
Structure:
Syntactic
Code/Value Set:
Representation
Ontology:
Meaning in
context
Relationships:
Connections
● Adapted the BRIDG-Mapping template to accommodate FHIR mappings
● Applied the BRIDG mapping path syntax to FHIR Resource model
(so mappings expressed with same language and level of granularity)
● FHIR mapping paths are typically shorter/simpler than those for the more
highly normalized BRIDG model
● Mapping path for ADM.Sample.freezing_method:
● Full mapping spreadsheet located here (‘Mappings’ sheet, column S)
ADM -> FHIR Mapping: Process and Tools
Specimen --processing--> Processing.procedure(CodableConcept)
ADM -> FHIR Mapping: Covering Model Diagrams
‘Covering’ views show all the classes and patterns in the FHIR models needed to
represent the content of a single ADM entity (shown here for ADM.Sample)
ADM -> FHIR Mapping: Applications and Benefits
1. Target for Model Alignment and Re-Use
a. FHIR provided a pragmatic target to guide CDM modeling - a middle ground
between the ADM and BRIDG
2. Interoperability with Clinical Data Systems
a. Alignment may facilitate broader interoperability with clinical systems that have
adopted FHIR
3. Potential to Leverage FHIR Infrastructure and Tooling
a. Use of the FHIR metamodel and/or Resource models can let us leverage tools
supporting API implementation, data validation, and automated documentation
ADM Models
Represented
using FHIR
Metamodel,
and generated
documentation
https://fhir.hotecosystem.org/ccdh/fhir/, https://fhir.hotecosystem.org/ccdh/fhir/aliquot.html
FHIR as a Modeling Framework
FHIR Resources Models For CCDH Data Harmonization
Model in Google Sheets FHIR Resource Model (Spreadsheet)
FHIR Resource
https://fhir.hotecosystem.org/ccdh/fhir/
FHIR Publish Process
caDSR identifiers
https://github.com/HOT-Ecosystem/cadsr-from-gdrive
Data model harmonization
Structure:
Syntactic
Code/Value Set:
Representation
Ontology:
Meaning in
context
Relationships:
Connections
ISO 11179-3
Data model harmonization
Structure:
Syntactic
Code/Value Set:
Representation
Ontology:
Meaning in
context
Relationships:
Connections
ISO 11179-3
CTS2
The CCDH Conceptual Domain
Model (CDM) Prototype
A Standards-Informed Refactoring of the ADM
Scope of
Phase 1
Effort
The CCDH Conceptual Domain Model
Subdomains:
● Biospecimen: Sample, Portion, Analyte, Aliquot, Slide
● Administrative: Case, Project, Program, Tissue Source Site, Center
Sources:
● CRDCs: GDC, PDC, ICDC, HTAN
● Standards: BRIDG, FHIR
Model Components Harmonized:
● Yes: Entities, Relationships, Properties
● No: Data Types, Value Set and Terminologies
Level of Formalization:
● An abstract conceptual model exploring different modeling approaches.
● Formalization into a concrete implementation model to follow in Phase 2.
Entity-
Level View
of Model
Refactoring
The CCDH Conceptual Domain Model
Model structure before and after refactoring of the ADM into the more normalized CDM
(Administrative (blue) and Biospecimen (orange) subdomains only)
ADM CDM
refactoring
144 specimen
properties in
total
74 specimen
properties in
total
Property-
Level View
of Model
Refactoring
The CCDH Conceptual Domain Model
Harmonization of properties capturing
specimen processing methods, as
source models are aggregated and
refactored into the CDM.
● During aggregation, five separate
properties found across source node
models are merged into two
properties in the ADM.
● During refactoring of the ADM into
the CDM, these two properties get
merged into a single ‘method’
property.
● The CDM ‘method’ element provides
a more flexible and generic structure
that will accommodate any type of
method, where some semantics get
pushed into the terminology.
refactoring
ADM
(2 properties)
aggregation
Node Models
(5 properties)
aggregation
CDM
(1 property)
Detailed
View of
CDM
Entities
and
Attributes
The CCDH Conceptual Domain Model
Entities in the CDM
prototype, and the
attributes held by each
Attribute count shown in
parentheses.
CDM Data
Dictionary
(link)
The CCDH Conceptual Domain Model
● The CDM prototype is presently specified as a spreadsheet-based data dictionary
● Entities and their Attributes are each described in a separate sheet
● Cardinality of attributes is specified to be as permissive as possible initially
● Data Types are minimally specified
○ Simple: declared only at a high level (limited to literal, boolean)
○ Complex: proposals for Identifier, Coding, DateTime, Quantity, . . .
● A ‘Referenced Entities’ sheet lists entities that are referenced in CDM relationships,
but are not in scope to model in this phase of work.
○ e.g. Organization, Visit, ConditionDiagnosis
● A ‘Data Containers’ sheet holds placeholders for objects that will be defined to group
sets of related properties (specific structures for these t.b.d.)
● Mappings of several types are also provided in the main Entity sheets:
○ ADM attributes that map to each CDM attribute (column L)
○ Source node attributes aggregated by these ADM attributes (column M)
○ CDM to FHIR mappings (column N)
1. Use of
Complex
Data Types
Key Features and Design Decisions
We explore the use of several complex data types to represent certain kinds of
related information
1. Identifier: groups an external identifier value with info about its source
a. avoids need for multiple source-specific identifier properties
2. Coding: formal structure for enumerated values that groups a code with its label and info
about its source
a. avoids need for separate properties for label and id)
3. DateTime: supports different ways to represent a date or time (precise vs offset)
a. avoids need for different properties to capture dates in different representations or formats
Pros: concise way to represent specific types of information using fewer properties
Cons: may add level of nesting that needs to be traversed to find data
2. Collapsing
Specimen
Entity
Subtypes
Key Features and Design Decisions
● A single CDM.Specimen entity covers entities distinguished at the class level in
some node models (Sample, Portion, Aliquot, Analyte, Slide)
● The Specimen.specimen_type property is used to indicate which of these more
specific types a particular instance represents.
● The goal here is to keep the initial prototype simpler, and reduce the redundancy of
properties that appear across specimen subtypes in the ADM
● This decision can be reversed if challenges are encountered, or we conclude that the
differences between these warrants an explicit entity-level distinction
3. Location of
Domain
Semantics:
‘In the Model’
vs
‘In the Data’
Key Features and Design Decisions
Where node models in the ADM lean heavily toward hard coding domain semantics in the model
itself, the CDM explores several approaches to capturing more of the semantics in the data.
Consider how Specimen composition measurements are represented:
The CompositionMeasurement object is an example of what we call 'Data Containers' in the CDM
● placeholders that will be formalized once we accrue the requirements needed to commit to
a specific type of structure.
Approaches like this let us achieve a deeper level of aggregation and harmonization, and better
accommodate future data and use cases.
Value Set = ‘non tumor tissue area’,
‘tumor tissue area’, ‘percentage tumor’,
‘percentage stroma’, ‘analysis area’, . . .
Value
Set
Future Directions and Next
Steps
Continued Evolution Toward the CRDC-H
● Phase 1 has focused on foundational effort necessary to support more nuanced work in additional
phases
○ Phase 1 work was exploratory and the modeling abstract
● Phase 2 will provide concrete model useful for implementation
○ Converge on a modeling and implementation approach that will work for CRDC
Continued Evolution of the CDM
Continued Evolution of the CDM
Multiple streams of activity in Phase II
● Stream One: Incorporate additional CRDC
source nodes/models into the ADM (Steps 1
and 2)
○ HTAN
○ IDC
● Stream Two: Incorporate additional ADM
entities into the CDM (Steps 3 and 4)
○ Clinical subdomain entities
○ Input from stakeholders critical in
guiding this evolution
Continued Evolution of the CDM
Multiple streams of activity in Phase II
● Stream Three: Evolve the existing CDM into
an implementable logical model (Step 5)
○ Further exploration of FHIR meta-
modeling language and biolinkml as
candidate languages for representing
CRDC-H with input from nodes and
CDA
● Test / validate the current CDM prototype
○ against feedback from nodes
○ against source node data
○ against competency queries
○ against requirements from other stakeholders
● Terminology / value set harmonization
● Melissa Haendel
● Christopher Chute
● Sam Volchenboum
● Jim Balhoff
● Nicole Vasilevsky
● Harold Solbrig
● Brian Furner
● Monica Munoz-Torres
● Anne Thessen
● Bill Duncan
● Davera Gabriel
● Dazhi Jiao
Acknowledgements
Center for Cancer Data Harmonization Center for Biomedical Informatics
& Information Technology
● Allen Dearry
● Sherri de Coronado
● Melissa Cook
Samvit Solutions
● Smita Hastak
● Wendy Ver Hoef
● Charles Yaghmour
● Todd Pihl
● Resham Kulkarni
Frederick National Laboratory
for Cancer Research
● Gaurav Vaidya
● Julie McMurry
● Kat Blumhardt
● Maura Kush
● Matt Brush
● Monica Palese
● Richard Zhu
● Steven Cox
● Shahim Essaid
● Shalki Shrivastava
● Tricia Francis
EXTRA SLIDES
CRDC-H Draft Model Presentation to Nodes
Source Node (aka 'Source', 'Node'): sources of data that our data models are being built to support/accommodate. Most are proper Data Commons,
some are Data Coordinating Centers, some or related data collection efforts like HTAN.
CRDC-H = Cancer Research Data Commons Harmonized Model. This is the final, fully harmonized, implementable specification.
● Status: Not yet being developed, but the CDM will evolve into the CRDC-H as it modeling matures we commit to a formal modeling
language/framework to specify the model.
CDM = Conceptual Domain Model. A prototype that will evolve into the final CRDC-H model. Created by refactoring the ADM into a more deeply
harmonized model, aligned with standards like BRIDG and FHIR as possible.
● Sources: Currently covers models from GDC, PDC, ICDC, HTAN.
● Scope: Currently covers only the Biospecimen and Administrative subdomains
● Status: Actively evolving. Parts are incomplete, and defined at a more abstract/conceptual level - so not suitable for implementation at this time.
ADM = Aggregated Data Model. Simple aggregation of content from source node models into a single artifact. Strictly equivalent entities and
properties are collapsed, but overall harmonization provided by the ADM is minimal.
● Sources: Currently incorporates GDC, PDC, ICDC models, and the Level 1 Biospecimen model from HTAN
● Scope: Currently covers all subdomains
● Status: Not actively evolving, but will grow as we tackle new sources and elements of their model are incorporated
Key Terms and Definitions
The Aggregated Data Model (ADM): Concept Map
GDC
PDC
ICDC
Aggregated Data
Model (ADM)
ADM -> BRIDG Mapping: Covering Model Diagrams
The yellow path traces the BRIDG mapping for ADM.Sample.freezing_method, from the
PerformedMaterialProcessStep.method field holding the data, to the BiologicSpecimen root of the mapping.
Patient vs
Research
Subject Roles
Key Features and Design Decisions
● ADM.Case entity refactored into CDM.Patient and CDM.ResearchSubject
● Provides support for the use case
of a single individual being a
research subject on more than one
study
○ Assumes there are
mechanisms in place to
de-duplicate patients who
may exist in multiple different
repositories (e.g. USI in
pediatric cancer)
ADM
attributes
Mapping
to CDM
Entities
The CCDH Conceptual Domain Model
Entities in the CDM
prototype, holding
attributes form the ADM
that map into each.
Counts of mapped ADM
attributes in parentheses.
● Concept maps support
high-level understanding and
comparison of scope and
structure
● Entities in each cmap are
annotated with a count of
properties and relationships
they contain.
● Entities are color-coded
according to the subdomain
they cover.
● Diagrams for all node models
can be found here.
I. Standardized Data Model Documentation: Concept Maps
GDC Concept Map
An excerpt of the GDC.Case entry in the Google Sheets format used to standardize documentation
across all source nodes. Complete dictionaries for GDC, PDC, anD ICDC models are here.
I. Standardized Data Model Documentation: Data Dictionaries
I. Standardized Data Model Documentation: Metrics
Analysis of standardized documentation quantifies size and coverage of each model
Element Density (average P + R per E)
Model Density
GDC 21.6
PDC 23.8
ICDC 9.8
Element Counts in Source Data Models
Model Entity Relationship Property
GDC 26 34 527
PDC 21 27 473
ICDC 27 34 231
AD-Administrative, BP-Biospecimen Processing, BA-Biospecimen Analysis, CC-Cross-sectional Clinical, LC-Longitudinal Clinical, ST-Study, FI-FIle, BI-Biological.
II. Aggregated Data Model: Initial Mapping Metrics
● Metrics reflect mappings based on very strict criteria (full equivalence within an aggregated entity)
● GDC-PDC models show significant similarity (~50% E mapping and 35% P+R mapping)
● The ICDC model is very different from GDC/PDC (~30% E mapping and <5% P+R mapping).
● Many differences are related to the distinct biology and privacy considerations of the species the nodes
cover (dog vs human), and the differences in scope of the models (e.g. ICDC focus on clinical studies).
AD-Administrative, BP-Biospecimen Processing, BA-Biospecimen Analysis, CC-Cross-sectional Clinical, LC-Longitudinal Clinical, ST-Study, FI-FIle, BI-Biological.
II. Aggregated Data Model: Early Outcomes and Insights
● Differences in Scope - e.g. the ICDC model covers aspects of clinical trial design and execution not in
scope for GDC and PDC, but lacks a rich representation of biospecimen processing found in other models.
● Differences in Granularity - e.g. GDC model goes into much finer detail about specific tumor staging
systems and evidence than does the ICDC.
● Differences in Structure - e.g. the ICDC defines a larger set of more specialized entities to capture
clinical metadata than do GDC and PDC.
● Differences in Semantics: e.g. different elements or values are used for representing the same type of
information (gender vs sex, birth_date vs birth_year)
● Differences in Terminology - e.g. use of same term in different ways ('Study' in PDC vs ICDC), and use
of different terms for same concept (‘Treatment’ vs ‘Agent Administration’)
We have and will continue to identify many categories of differences to address in
harmonization efforts:

More Related Content

PDF
An Introduction to CCDH
PDF
A REVIEW ON RDB TO RDF MAPPING FOR SEMANTIC WEB
PDF
A unified approach for spatial data query
PDF
Refactoring Metadata:
PDF
Query Optimization Techniques in Graph Databases
PDF
A SEMANTIC RESOURCE BASED APPROACH FOR STAR SCHEMAS MATCHING
PDF
C1803041317
PDF
Converting UML Class Diagrams into Temporal Object Relational DataBase
An Introduction to CCDH
A REVIEW ON RDB TO RDF MAPPING FOR SEMANTIC WEB
A unified approach for spatial data query
Refactoring Metadata:
Query Optimization Techniques in Graph Databases
A SEMANTIC RESOURCE BASED APPROACH FOR STAR SCHEMAS MATCHING
C1803041317
Converting UML Class Diagrams into Temporal Object Relational DataBase

What's hot (17)

PDF
Practical Parallel Hypergraph Algorithms | PPoPP ’20
PPT
Modeling Search Computing Applications
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
PDF
Paper id 25201463
PDF
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...
PDF
Evaluation of graph databases
PDF
Big dataintegration rahm-part3Scalable and privacy-preserving data integratio...
PDF
Business process management
PDF
data Fusion and log correlation
PPTX
Crowdsourcing tasks in Linked Data management
PPTX
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
PDF
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
PPTX
Metadata Mapping & Crosswalks
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
PDF
IRJET- Data Retrieval using Master Resource Description Framework
PDF
TOPOLOGY AWARE LOAD BALANCING FOR GRIDS
PPTX
Jarrar: Architectural solutions in Data Integration
Practical Parallel Hypergraph Algorithms | PPoPP ’20
Modeling Search Computing Applications
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
Paper id 25201463
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...
Evaluation of graph databases
Big dataintegration rahm-part3Scalable and privacy-preserving data integratio...
Business process management
data Fusion and log correlation
Crowdsourcing tasks in Linked Data management
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
Metadata Mapping & Crosswalks
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
IRJET- Data Retrieval using Master Resource Description Framework
TOPOLOGY AWARE LOAD BALANCING FOR GRIDS
Jarrar: Architectural solutions in Data Integration
Ad

Similar to CRDC-H Draft Model Presentation to Nodes (20)

PDF
Standards for clinical research data - steps to an information model (CRIM).
PDF
A Review Of CAD To CAE Integration With A Hierarchical Data Format (HDF)-Base...
PPTX
Lecture 1 to 3intro to normalization in database
PPT
Data Models.ppt
PDF
A_Logical_Design_Methodology_for_Relational_Databa.pdf
PPT
Data Models [DATABASE SYSTEMS: Design, Implementation, and Management]
PDF
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
PDF
Government GraphSummit: And Then There Were 15 Standards
PDF
Open Services for Lifecycle Collaboration (OSLC)
PPTX
CoDe Modeling of Graph Composition for Data Warehouse Report Visualization
PDF
Populating a Release History Database (ICSM 2013 MIP)
PPTX
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
PPTX
CLARIN CMDI use case and flexible metadata schemes
 
PPT
Database Systems Concepts, 5th Ed
PDF
Trends in Computer Science and Information Technology
PPT
deep_Visualization in Data mining.ppt
PDF
2 data warehouse life cycle golfarelli
PDF
P209 leithiser-relationaldb-formal-specifications
PDF
Semantic models for cdisc based standards and metadata management (1)
Standards for clinical research data - steps to an information model (CRIM).
A Review Of CAD To CAE Integration With A Hierarchical Data Format (HDF)-Base...
Lecture 1 to 3intro to normalization in database
Data Models.ppt
A_Logical_Design_Methodology_for_Relational_Databa.pdf
Data Models [DATABASE SYSTEMS: Design, Implementation, and Management]
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
Government GraphSummit: And Then There Were 15 Standards
Open Services for Lifecycle Collaboration (OSLC)
CoDe Modeling of Graph Composition for Data Warehouse Report Visualization
Populating a Release History Database (ICSM 2013 MIP)
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
CLARIN CMDI use case and flexible metadata schemes
 
Database Systems Concepts, 5th Ed
Trends in Computer Science and Information Technology
deep_Visualization in Data mining.ppt
2 data warehouse life cycle golfarelli
P209 leithiser-relationaldb-formal-specifications
Semantic models for cdisc based standards and metadata management (1)
Ad

More from Nicole Vasilevsky (13)

PPTX
Teaching Data Science to Undergraduate Students
PDF
Improving Knowledge Discovery Through Development of Big Data to Knowledge S...
PPTX
Empowering patients by increasing accessibility to clinical terminology
PPTX
Data science education resources for everyone
PPTX
Enhancing the Human Phenotype Ontology for Use by the Layperson
PDF
Enhancing the Human Phenotype Ontology for Use by the Layperson
PDF
Couture Curricula - BD2K Data Science Tailored to Your Needs
PPTX
Monarch Initiative Poster - Rare Disease Symposium 2015
PPTX
Acrl march2015 final
PPTX
The Role of Libraries in Data Management and Curation
PDF
Resource Identification Initiative_RDA_March2014
PPTX
On the Reproducibility of Science: Unique Identification of Research Resourc...
PDF
Research resources: curating the new eagle-i discovery system
Teaching Data Science to Undergraduate Students
Improving Knowledge Discovery Through Development of Big Data to Knowledge S...
Empowering patients by increasing accessibility to clinical terminology
Data science education resources for everyone
Enhancing the Human Phenotype Ontology for Use by the Layperson
Enhancing the Human Phenotype Ontology for Use by the Layperson
Couture Curricula - BD2K Data Science Tailored to Your Needs
Monarch Initiative Poster - Rare Disease Symposium 2015
Acrl march2015 final
The Role of Libraries in Data Management and Curation
Resource Identification Initiative_RDA_March2014
On the Reproducibility of Science: Unique Identification of Research Resourc...
Research resources: curating the new eagle-i discovery system

Recently uploaded (20)

PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPTX
BIOMOLECULES PPT........................
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Biophysics 2.pdffffffffffffffffffffffffff
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
TOTAL hIP ARTHROPLASTY Presentation.pptx
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Placing the Near-Earth Object Impact Probability in Context
Introduction to Cardiovascular system_structure and functions-1
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Science Quipper for lesson in grade 8 Matatag Curriculum
BIOMOLECULES PPT........................
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Seminar Hypertension and Kidney diseases.pptx
Fluid dynamics vivavoce presentation of prakash
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
7. General Toxicologyfor clinical phrmacy.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf

CRDC-H Draft Model Presentation to Nodes

  • 1. June 2020 These slides: bit.ly/ccdh-prototype-june2020 CRDC-H Draft Model Presentation to Nodes
  • 2. Community Development (lead: Volchenboum; co-lead Vasilevsky) Data Model harmonization (lead: Chute; co-lead Furner) Ontology & Terminology Ecosystem (lead: Solbrig) Tools & Data Quality (lead: Balhoff) Program Management and operations: (lead: Haendel, co-lead Munoz-Torres) Programmatic oversight: CBIIT: Sherri De Coronado, Allen Dearry FNL: Todd Pihl, Resham Kulkarni
  • 3. From Practice-based Evidence to Evidence-based Practice Clinical Databases Registries et al. Clinical Guidelines Expert Systems Data Inference Knowledge Management Decision support Terminologies and data models provide the consistency and comparability essential for a Learning Health System Patient Encounters Medical Knowledge Terminologies Data models
  • 4. Role of CCDH in the CRDC ecosystem Facilitate retrospective and prospective semantic harmonization of data across nodes of the CRDC Coordinate the community to ensure quality “fit for purpose” design and implementation of standards that will facilitate interoperability of heterogeneous data types and CRDC resources Find agreement across the communities built around CRDC - match and extend data models - annotation, harmonization - quality assurance
  • 5. Data Model harmonization (lead: Chute co-lead: Furner) Ontology & Terminology Ecosystem (lead: Solbrig) Tools & Data Quality (lead: Balhoff) Schema to schema OMOP to FHIR Term to Term Oncotree to NCIt Data records to data records “Smoking status >7 packs per day” to NCIT:C154510 [Heavy Smoker]
  • 7. ● Goal is to support harmonization of equivalent data elements in disparate models to enable cross-node querying and data aggregation ● Node models have developed somewhat independently to fit specific use cases ○ Overall modeling space is broad: there is overlap, but each model covers unique semantic space ○ Divergence in modeling approach: equivalent entities and properties are not always captured in syntactically equivalent ways ○ Heterogeneity of source data model artifacts ● The CCDH Data Model Harmonization group is defining a shared data model for use across the CRDC, leveraging existing standards (e.g. FHIR, BRIDG) where possible. ● This harmonized model (CRDC-H) and terminological infrastructure are being designed to meet the needs of systems like the Cancer Data Aggregator (CDA) that support integrated search and metadata-based analyses across datasets in the CRDC ecosystem. Data Model Harmonization: Overview
  • 8. Data Model Harmonization: Overview ● Phase 1 has focused on foundational effort necessary to support more nuanced work in additional phases ○ Phase 1 work was exploratory and the modeling abstract ● Phase 2 will provide more concrete model useful for implementation ○ Converge on a modeling and implementation approach that will work for CRDC
  • 9. Five steps in the CRDC-H Model Development Workflow An iterative process through which content of source models is evaluated, aggregated, mapped, and refactored into a standards- aligned and harmonized data model. CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 10. (1) Standardized concept map and spreadsheet representations of source node models provide a consistent, comparable, and computable substrate for harmonization efforts Step 1: Standardize Source Data Model Documentation CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned Links: - Source model cmaps - Standardized data dictionaries
  • 11. (2) Equivalent elements are merged across sources to produce a single aggregated model, providing a unified view of all information that the final CRDC-H model must represent. Step 2. Generate an Aggregated Data Model (ADM) CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned Links: - ADM cmap - ADM data dictionary - February Progress Report and Slide Deck
  • 12. (3) Mappings of ADM elements to standard models like BRIDG and FHIR facilitate understanding of source models, and development of a standards-aligned model. Step 3. Map the ADM to Community Standard Data Models CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 13. (4) Deeper harmonization is achieved as ADM elements are refactored into a more normalized and standards-aligned conceptual domain model (CDM) Step 4. Refactor the ADM into a Conceptual Domain Model (CDM) CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 14. (5) Mature elements of the CDM are refined into a concrete logical model, the CRDC-H, which that will support implementation by CRDC nodes and the CDA Step 5. Refactor the CDM into a Logical Data Model (CRDC-H) CRDC-H Model Development Workflow Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 15. Lessons learned from this initial deep dive will inform subsequent iterations that incorporate new data sources and domains. CRDC-H Model Development Workflow First Iteration: Biospecimen and Administrative entities from GDC, PDC, ICDC, and HTAN Abstract specification Low harmonization Not standards-aligned Concrete specification Deep harmonization Standards-aligned
  • 16. 1. The Aggregated Data Model (ADM) (Step 2) 2. Mapping the ADM to Standard Data Models (Step 3) 3. Refactoring the ADM into a CDM Prototype (Step 4) 4. Next Steps and Future Directions Outline
  • 17. The Aggregated Data Model (ADM) The substrate for mapping and refactoring efforts
  • 18. The Aggregated Data Model (ADM) The ADM represents the union of all elements across our set of source data models, where ‘equivalent’ entities and attributes are merged. ● Provides a unified view of all information that the final CRDC-H model must represent. ● Captures an initial set of entity and property mappings across sources. ● Serves as a base data model that can be evolved incrementally into a final CRDC-H
  • 19. An excerpt from the ADM Data Dictionary showing the ADM.Program.name property, which aggregates and deprecates equivalent properties from GDC, PDC, and ICDC models. Content from standardized source dictionaries is merged and reorganized in a single sheet 1. Equivalent entities are collapsed into single record, with source definitions retained (rows 1-5) 2. Within an aggregated entity (EA), properties are ordered to group those that are equivalent (rows 8-10) 3. A new ADM row is created for each unique property in the aggregated entity (‘PA’, green row 7) 4. Rows for source properties it aggregates are marked as deprecated (‘PD’, yellow rows 8-10) The Aggregated Data Model (ADM): Data Dictionary
  • 20. The Aggregated Data Model (ADM) Node models are not very well aligned at the outset ● e.g. ICDC and GDC: ~30% entity equivalence , <5% attribute equivalence Source Model Alignment Property aggregation in the ADM was based on superficial analysis strict aggregation criteria ● Only strictly equivalent elements within strictly equivalent entities are merged Deeper aggregation and harmonization of elements will be achieved as the ADM is refactored into the CDM. The ADM as a whole is large and flat (55 entities, 984 attributes)
  • 21. ICDC.birthdate vs GDC.birthyear: capture same concept at different level of precision - not aggregated in ADM Harmonization in the ADM is Minimal (1) Examples: GDC.gender vs ICDC.sex: capture related but distinct concept, using same values (M, F) - not aggregated in ADM(2)
  • 22. Harmonization in the ADM is Minimal Examples: ADM.freezing_method and ADM.preservation_method: separate properties for different types of specimen processing methods - not aggregated in ADM (3) While harmonization achieved in the ADM is minimal, it will serve as a substrate for mapping and refactoring toward a much more deeply harmonized CDM prototype, and maintain mappings back to elements in source node models.
  • 23. Mapping the ADM to Domain Standard Models BRIDG and FHIR
  • 24. ● The BRIDG (Biomedical Research Integrated Domain Group) Model is a UML-based Conceptual Model covering the domains of clinical and translational research ● A collaborative effort engaging stakeholders from CDISC, HL7, ISO, NCI, and the FDA ● Not an implementation model, but can be refined into a logical data model to support application in data systems. ● One common use is as a 'hub' supporting cross-model mapping between any two models that have individually been mapped to BRIDG ● Supporting infrastructure maintains computable mappings of BRIDG to community models, and links to common data elements in semantic standards like the caDSR. The BRIDG Conceptual Model http://bridgmodel.nci.nih.gov
  • 25. The BRIDG Concept Map shows scope of the model and high-level concepts it covers https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA1.htm The BRIDG Conceptual Model: Coverage and Scope
  • 26. The Comprehensive BRIDG UML Diagram shows attributes and relationships of all classes in the model https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA3.htm The BRIDG Conceptual Model: Full Model
  • 27. The Comprehensive BRIDG UML Diagram shows attributes and relationships of all classes in the model https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA3.htm The BRIDG Conceptual Model: Full Model
  • 28. BRIDG Biospecimen View shows only modeling related to this subdomain https://cbiit.github.io/bridg-model/HTML/BRIDG5.3.1/EARoot/EA2/EA51.htm The BRIDG Conceptual Model: Biospecimen Subdomain
  • 29. ● Analysts from Samvit Solutions on loan from NCI CBIIT assisted in the mapping process (Smita Hastak, Wendy Ver Hoef, Charles Yaghmour) ● Utilized a standard spreadsheet-based mapping template, widely used for other BRIDG mapping efforts (e.g. OMOP, Sentinel, i2b2, mCODE) ● Mappings are defined as ‘paths’, rooted at the BRIDG equivalent of the mapped ADM class (e.g. BRIDG.BiologicSpecimen for ADM.Sample) ● Mapping path for ADM.Sample.freezing_method: ● Full mapping spreadsheet located here (‘Mappings’ sheet, column K) ADM -> BRIDG Mapping: Process and Tools BiologicSpecimen <--beAFunctionPerformedBy-- Subject <--beParticipatedInBy-- PerformedMaterialProcessStep.methodCode WHERE PerformedMaterialProcessStep--instantiate-->DefinedMaterialProcessStep.nameCode="freeze"
  • 30. ADM -> BRIDG Mapping: Covering Model Diagrams ‘Covering’ views show all the classes and patterns in the BRIDG model needed to represent the content of a single ADM entity (shown here for ADM.Sample)
  • 31. ADM -> BRIDG Mapping: Covering Model Diagrams The yellow path traces the BRIDG mapping for ADM.Sample.preinvasive_morphology, from the PerformedDiagnosis.value field holding the data, to the BiologicSpecimen class rooting the mapping. Start End
  • 32. ADM -> BRIDG Mapping: Applications and Benefits 1. Provides Semantic Clarity to Source Models a. Forces us to deeply understand the meaning and utility of each ADM element b. Highlights areas where node models or documentation are unclear or duplicative 2. Enables Cross-Model Mappings a. Facilitates mappings to other models mapped to BRIDG (e.g. OMOP, Sentinel, ACT/i2b2, PCORNet, HL7 FHIR mCODE IG, ...) b. Provides a connection to the NCI semantic infrastructure and standards (e.g. caDSR, EVS) 3. Informs ADM -> CDM Refactoring a. Represents a hyper-normalized counterpoint to the flat node models in the ADM, ensuring our harmonized model is grounded in reality.
  • 33. ● FHIR is a data exchange model and API framework ● Primary domain is patient-level healthcare data from EHRs ● Provides set of core resources, and a profiling mechanism that allows implementations to add custom constraints and extensions to core resources ● Implementation Guides instruct implementers on how to assemble profiles into exchange schema tailored for a specific community, application, or use case. ● Widely used in healthcare settings, with developing coverage of research concepts, making it attractive candidate for re-use or alignment in our work. Fast Healthcare Interoperability Resource (FHIR) Model https://www.hl7.org/fhir/index.html
  • 34. Catalog and Example Specification of FHIR Resources https://www.hl7.org/fhir/index.html, https://www.hl7.org/fhir/specimen.html Fast Healthcare Interoperability Resource (FHIR) Model
  • 35. Data model harmonization Structure: Syntactic Code/Value Set: Representation Ontology: Meaning in context Relationships: Connections
  • 36. ● Adapted the BRIDG-Mapping template to accommodate FHIR mappings ● Applied the BRIDG mapping path syntax to FHIR Resource model (so mappings expressed with same language and level of granularity) ● FHIR mapping paths are typically shorter/simpler than those for the more highly normalized BRIDG model ● Mapping path for ADM.Sample.freezing_method: ● Full mapping spreadsheet located here (‘Mappings’ sheet, column S) ADM -> FHIR Mapping: Process and Tools Specimen --processing--> Processing.procedure(CodableConcept)
  • 37. ADM -> FHIR Mapping: Covering Model Diagrams ‘Covering’ views show all the classes and patterns in the FHIR models needed to represent the content of a single ADM entity (shown here for ADM.Sample)
  • 38. ADM -> FHIR Mapping: Applications and Benefits 1. Target for Model Alignment and Re-Use a. FHIR provided a pragmatic target to guide CDM modeling - a middle ground between the ADM and BRIDG 2. Interoperability with Clinical Data Systems a. Alignment may facilitate broader interoperability with clinical systems that have adopted FHIR 3. Potential to Leverage FHIR Infrastructure and Tooling a. Use of the FHIR metamodel and/or Resource models can let us leverage tools supporting API implementation, data validation, and automated documentation
  • 39. ADM Models Represented using FHIR Metamodel, and generated documentation https://fhir.hotecosystem.org/ccdh/fhir/, https://fhir.hotecosystem.org/ccdh/fhir/aliquot.html FHIR as a Modeling Framework
  • 40. FHIR Resources Models For CCDH Data Harmonization Model in Google Sheets FHIR Resource Model (Spreadsheet) FHIR Resource https://fhir.hotecosystem.org/ccdh/fhir/ FHIR Publish Process caDSR identifiers https://github.com/HOT-Ecosystem/cadsr-from-gdrive
  • 41. Data model harmonization Structure: Syntactic Code/Value Set: Representation Ontology: Meaning in context Relationships: Connections ISO 11179-3
  • 42. Data model harmonization Structure: Syntactic Code/Value Set: Representation Ontology: Meaning in context Relationships: Connections ISO 11179-3 CTS2
  • 43. The CCDH Conceptual Domain Model (CDM) Prototype A Standards-Informed Refactoring of the ADM
  • 44. Scope of Phase 1 Effort The CCDH Conceptual Domain Model Subdomains: ● Biospecimen: Sample, Portion, Analyte, Aliquot, Slide ● Administrative: Case, Project, Program, Tissue Source Site, Center Sources: ● CRDCs: GDC, PDC, ICDC, HTAN ● Standards: BRIDG, FHIR Model Components Harmonized: ● Yes: Entities, Relationships, Properties ● No: Data Types, Value Set and Terminologies Level of Formalization: ● An abstract conceptual model exploring different modeling approaches. ● Formalization into a concrete implementation model to follow in Phase 2.
  • 45. Entity- Level View of Model Refactoring The CCDH Conceptual Domain Model Model structure before and after refactoring of the ADM into the more normalized CDM (Administrative (blue) and Biospecimen (orange) subdomains only) ADM CDM refactoring 144 specimen properties in total 74 specimen properties in total
  • 46. Property- Level View of Model Refactoring The CCDH Conceptual Domain Model Harmonization of properties capturing specimen processing methods, as source models are aggregated and refactored into the CDM. ● During aggregation, five separate properties found across source node models are merged into two properties in the ADM. ● During refactoring of the ADM into the CDM, these two properties get merged into a single ‘method’ property. ● The CDM ‘method’ element provides a more flexible and generic structure that will accommodate any type of method, where some semantics get pushed into the terminology. refactoring ADM (2 properties) aggregation Node Models (5 properties) aggregation CDM (1 property)
  • 47. Detailed View of CDM Entities and Attributes The CCDH Conceptual Domain Model Entities in the CDM prototype, and the attributes held by each Attribute count shown in parentheses.
  • 48. CDM Data Dictionary (link) The CCDH Conceptual Domain Model ● The CDM prototype is presently specified as a spreadsheet-based data dictionary ● Entities and their Attributes are each described in a separate sheet ● Cardinality of attributes is specified to be as permissive as possible initially ● Data Types are minimally specified ○ Simple: declared only at a high level (limited to literal, boolean) ○ Complex: proposals for Identifier, Coding, DateTime, Quantity, . . . ● A ‘Referenced Entities’ sheet lists entities that are referenced in CDM relationships, but are not in scope to model in this phase of work. ○ e.g. Organization, Visit, ConditionDiagnosis ● A ‘Data Containers’ sheet holds placeholders for objects that will be defined to group sets of related properties (specific structures for these t.b.d.) ● Mappings of several types are also provided in the main Entity sheets: ○ ADM attributes that map to each CDM attribute (column L) ○ Source node attributes aggregated by these ADM attributes (column M) ○ CDM to FHIR mappings (column N)
  • 49. 1. Use of Complex Data Types Key Features and Design Decisions We explore the use of several complex data types to represent certain kinds of related information 1. Identifier: groups an external identifier value with info about its source a. avoids need for multiple source-specific identifier properties 2. Coding: formal structure for enumerated values that groups a code with its label and info about its source a. avoids need for separate properties for label and id) 3. DateTime: supports different ways to represent a date or time (precise vs offset) a. avoids need for different properties to capture dates in different representations or formats Pros: concise way to represent specific types of information using fewer properties Cons: may add level of nesting that needs to be traversed to find data
  • 50. 2. Collapsing Specimen Entity Subtypes Key Features and Design Decisions ● A single CDM.Specimen entity covers entities distinguished at the class level in some node models (Sample, Portion, Aliquot, Analyte, Slide) ● The Specimen.specimen_type property is used to indicate which of these more specific types a particular instance represents. ● The goal here is to keep the initial prototype simpler, and reduce the redundancy of properties that appear across specimen subtypes in the ADM ● This decision can be reversed if challenges are encountered, or we conclude that the differences between these warrants an explicit entity-level distinction
  • 51. 3. Location of Domain Semantics: ‘In the Model’ vs ‘In the Data’ Key Features and Design Decisions Where node models in the ADM lean heavily toward hard coding domain semantics in the model itself, the CDM explores several approaches to capturing more of the semantics in the data. Consider how Specimen composition measurements are represented: The CompositionMeasurement object is an example of what we call 'Data Containers' in the CDM ● placeholders that will be formalized once we accrue the requirements needed to commit to a specific type of structure. Approaches like this let us achieve a deeper level of aggregation and harmonization, and better accommodate future data and use cases. Value Set = ‘non tumor tissue area’, ‘tumor tissue area’, ‘percentage tumor’, ‘percentage stroma’, ‘analysis area’, . . . Value Set
  • 52. Future Directions and Next Steps Continued Evolution Toward the CRDC-H
  • 53. ● Phase 1 has focused on foundational effort necessary to support more nuanced work in additional phases ○ Phase 1 work was exploratory and the modeling abstract ● Phase 2 will provide concrete model useful for implementation ○ Converge on a modeling and implementation approach that will work for CRDC Continued Evolution of the CDM
  • 54. Continued Evolution of the CDM Multiple streams of activity in Phase II ● Stream One: Incorporate additional CRDC source nodes/models into the ADM (Steps 1 and 2) ○ HTAN ○ IDC ● Stream Two: Incorporate additional ADM entities into the CDM (Steps 3 and 4) ○ Clinical subdomain entities ○ Input from stakeholders critical in guiding this evolution
  • 55. Continued Evolution of the CDM Multiple streams of activity in Phase II ● Stream Three: Evolve the existing CDM into an implementable logical model (Step 5) ○ Further exploration of FHIR meta- modeling language and biolinkml as candidate languages for representing CRDC-H with input from nodes and CDA ● Test / validate the current CDM prototype ○ against feedback from nodes ○ against source node data ○ against competency queries ○ against requirements from other stakeholders ● Terminology / value set harmonization
  • 56. ● Melissa Haendel ● Christopher Chute ● Sam Volchenboum ● Jim Balhoff ● Nicole Vasilevsky ● Harold Solbrig ● Brian Furner ● Monica Munoz-Torres ● Anne Thessen ● Bill Duncan ● Davera Gabriel ● Dazhi Jiao Acknowledgements Center for Cancer Data Harmonization Center for Biomedical Informatics & Information Technology ● Allen Dearry ● Sherri de Coronado ● Melissa Cook Samvit Solutions ● Smita Hastak ● Wendy Ver Hoef ● Charles Yaghmour ● Todd Pihl ● Resham Kulkarni Frederick National Laboratory for Cancer Research ● Gaurav Vaidya ● Julie McMurry ● Kat Blumhardt ● Maura Kush ● Matt Brush ● Monica Palese ● Richard Zhu ● Steven Cox ● Shahim Essaid ● Shalki Shrivastava ● Tricia Francis
  • 59. Source Node (aka 'Source', 'Node'): sources of data that our data models are being built to support/accommodate. Most are proper Data Commons, some are Data Coordinating Centers, some or related data collection efforts like HTAN. CRDC-H = Cancer Research Data Commons Harmonized Model. This is the final, fully harmonized, implementable specification. ● Status: Not yet being developed, but the CDM will evolve into the CRDC-H as it modeling matures we commit to a formal modeling language/framework to specify the model. CDM = Conceptual Domain Model. A prototype that will evolve into the final CRDC-H model. Created by refactoring the ADM into a more deeply harmonized model, aligned with standards like BRIDG and FHIR as possible. ● Sources: Currently covers models from GDC, PDC, ICDC, HTAN. ● Scope: Currently covers only the Biospecimen and Administrative subdomains ● Status: Actively evolving. Parts are incomplete, and defined at a more abstract/conceptual level - so not suitable for implementation at this time. ADM = Aggregated Data Model. Simple aggregation of content from source node models into a single artifact. Strictly equivalent entities and properties are collapsed, but overall harmonization provided by the ADM is minimal. ● Sources: Currently incorporates GDC, PDC, ICDC models, and the Level 1 Biospecimen model from HTAN ● Scope: Currently covers all subdomains ● Status: Not actively evolving, but will grow as we tackle new sources and elements of their model are incorporated Key Terms and Definitions
  • 60. The Aggregated Data Model (ADM): Concept Map GDC PDC ICDC Aggregated Data Model (ADM)
  • 61. ADM -> BRIDG Mapping: Covering Model Diagrams The yellow path traces the BRIDG mapping for ADM.Sample.freezing_method, from the PerformedMaterialProcessStep.method field holding the data, to the BiologicSpecimen root of the mapping.
  • 62. Patient vs Research Subject Roles Key Features and Design Decisions ● ADM.Case entity refactored into CDM.Patient and CDM.ResearchSubject ● Provides support for the use case of a single individual being a research subject on more than one study ○ Assumes there are mechanisms in place to de-duplicate patients who may exist in multiple different repositories (e.g. USI in pediatric cancer)
  • 63. ADM attributes Mapping to CDM Entities The CCDH Conceptual Domain Model Entities in the CDM prototype, holding attributes form the ADM that map into each. Counts of mapped ADM attributes in parentheses.
  • 64. ● Concept maps support high-level understanding and comparison of scope and structure ● Entities in each cmap are annotated with a count of properties and relationships they contain. ● Entities are color-coded according to the subdomain they cover. ● Diagrams for all node models can be found here. I. Standardized Data Model Documentation: Concept Maps GDC Concept Map
  • 65. An excerpt of the GDC.Case entry in the Google Sheets format used to standardize documentation across all source nodes. Complete dictionaries for GDC, PDC, anD ICDC models are here. I. Standardized Data Model Documentation: Data Dictionaries
  • 66. I. Standardized Data Model Documentation: Metrics Analysis of standardized documentation quantifies size and coverage of each model Element Density (average P + R per E) Model Density GDC 21.6 PDC 23.8 ICDC 9.8 Element Counts in Source Data Models Model Entity Relationship Property GDC 26 34 527 PDC 21 27 473 ICDC 27 34 231 AD-Administrative, BP-Biospecimen Processing, BA-Biospecimen Analysis, CC-Cross-sectional Clinical, LC-Longitudinal Clinical, ST-Study, FI-FIle, BI-Biological.
  • 67. II. Aggregated Data Model: Initial Mapping Metrics ● Metrics reflect mappings based on very strict criteria (full equivalence within an aggregated entity) ● GDC-PDC models show significant similarity (~50% E mapping and 35% P+R mapping) ● The ICDC model is very different from GDC/PDC (~30% E mapping and <5% P+R mapping). ● Many differences are related to the distinct biology and privacy considerations of the species the nodes cover (dog vs human), and the differences in scope of the models (e.g. ICDC focus on clinical studies). AD-Administrative, BP-Biospecimen Processing, BA-Biospecimen Analysis, CC-Cross-sectional Clinical, LC-Longitudinal Clinical, ST-Study, FI-FIle, BI-Biological.
  • 68. II. Aggregated Data Model: Early Outcomes and Insights ● Differences in Scope - e.g. the ICDC model covers aspects of clinical trial design and execution not in scope for GDC and PDC, but lacks a rich representation of biospecimen processing found in other models. ● Differences in Granularity - e.g. GDC model goes into much finer detail about specific tumor staging systems and evidence than does the ICDC. ● Differences in Structure - e.g. the ICDC defines a larger set of more specialized entities to capture clinical metadata than do GDC and PDC. ● Differences in Semantics: e.g. different elements or values are used for representing the same type of information (gender vs sex, birth_date vs birth_year) ● Differences in Terminology - e.g. use of same term in different ways ('Study' in PDC vs ICDC), and use of different terms for same concept (‘Treatment’ vs ‘Agent Administration’) We have and will continue to identify many categories of differences to address in harmonization efforts: