Precision Medicine in Health Agency Submissions: Biomarker Data Management

Disclaimer: The views and opinions expressed in this presentation are solely those of the author's and do not represent their employer or any affiliated organizations.

Clinical biomarker management presents significant challenges due to the limitations of the CDISC system in effectively handling biomarker data, particularly in the context of high-throughput genetic and genomic information. This inadequacy is compounded by the fact that clinical data managers often possess limited expertise in biomarkers, while biomarker specialists typically have a restricted understanding of clinical data management processes. As a clinical statistician, I am positioned to offer a unique perspective on this matter.

CDISC standards

CDISC, which stands for Clinical Data Interchange Standards Consortium, is a global, open, multidisciplinary, non-profit organization that has established standards to support the acquisition, exchange, submission, and archiving of clinical research data and metadata.

The main components of CDISC include:

SDTM (Study Data Tabulation Model): A standard for organizing and formatting data to submit to regulatory authorities. It includes standards for data structure and content.
CDASH (Clinical Data Acquisition Standards Harmonization): Provides standards for data collection forms and processes in clinical trials.
ADaM (Analysis Data Model): Used for the preparation of statistical analysis datasets.
ODM (Operational Data Model): Facilitates data exchange between different systems used in clinical research.
SEND (Standard for Exchange of Nonclinical Data): Similar to SDTM, but for nonclinical data.

Understanding and implementing CDISC standards is a key component of modern clinical data management, contributing to the efficiency, reliability, and regulatory compliance of clinical trials.

SDTM and relational database

The Study Data Tabulation Model (SDTM) developed by the CDISC for clinical trial data doesn't strictly adhere to the first three normal forms (1NF, 2NF, and 3NF) as used in traditional relational database design. SDTM has its own set of guidelines and structures that are specifically tailored for clinical trial data submission to regulatory agencies like the FDA. Here's how SDTM relates to the normal forms:

First Normal Form (1NF):

SDTM datasets generally conform to 1NF as they store data in tables with rows and columns. Each column represents a specific attribute, and each row represents a unique instance (record).

However, the concept of atomicity in SDTM might differ from traditional databases. For example, SDTM often uses composite fields where a single field might represent multiple pieces of information, although these are standardized.

Second Normal Form (2NF):

SDTM's relationship to 2NF is more complex. While SDTM datasets use keys (like subject identifiers), these datasets are often denormalized for the sake of clarity and ease of use in the context of clinical trials.

SDTM focuses more on standardization and clear documentation rather than eliminating all partial dependencies, as seen in traditional 2NF design.

Third Normal Form (3NF):

SDTM datasets are not strictly designed to eliminate transitive dependencies, as is the goal in 3NF. The structure of SDTM is more driven by the need to clearly and comprehensively represent clinical trial data in a format that is understandable to regulatory bodies.

The focus of SDTM is on standardization of data representation and ensuring completeness and clarity of the data for submission, rather than on optimizing database normalization principles like in 3NF.

Dataset-JSON initiatives

The "Dataset-JSON" project from CDISC represents an effort to establish a standard format for exchanging and submitting clinical trial data in JSON (JavaScript Object Notation) format. This initiative is part of CDISC's ongoing efforts to modernize data standards to improve the efficiency, effectiveness, and accessibility of clinical trial data.

Dataset-JSON is crucial for biomarker data management. To understand its significance, let's begin with an overview of biomarker basics.

VCF format for genetic data

The VCF format, which stands for Variant Call Format, is a widely used text file format in bioinformatics for storing gene sequence variations. It's particularly crucial for analyses in genomics, such as genome-wide association studies (GWAS), population genetics, and personal genomics. The format is designed to store data generated by genome sequencing projects, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants.

Here's a breakdown of the key features of the VCF format:

Header Lines: The VCF file begins with header lines, indicated by a "##" prefix. These lines contain meta-information, such as the VCF format version, reference genome, file creation date, and descriptions of the fields in the data section.
Column Header Line: Following the header lines, there's a line starting with a single "#", which names the columns in the data section. The standard columns are:
#CHROM: Chromosome number.
POS: The position of the variant on the chromosome.
ID: An identifier for the variant, if available.
REF: The reference base(s) at this position in the reference genome.
ALT: The alternate base(s) at this position (the variant).
QUAL: A quality score for the variant call.
FILTER: Indicates if the variant passed quality filters.
INFO: Additional information about the variant. This field can contain multiple semicolon-separated subfields.
Sample Columns (optional): If the VCF file includes genotype information for individuals, their data appear in additional columns after the INFO column.
Data Lines: Each line after the column header represents a genetic variant, with fields separated by tabs. The data corresponds to the columns described above.

Below is a sample of a VCF file containing a single record.

##fileformat=VCFv4.2
##source=SimulatedData
##reference=GRCh38
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123456 . G A 50 PASS .

The JSON standard

The VCF file can be transformed into the JSON format. Here, we provide a straightforward demonstration. A more practical approach to converting VCF to JSON will be discussed subsequently.

import json

# VCF content (you can replace this with file reading in a real scenario)
vcf_content = """
##fileformat=VCFv4.2
##source=SimulatedData
##reference=GRCh38
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1234567 . A G 29.6 PASS DP=14;AF=0.5
""".strip().split('\n')

# Function to parse the INFO field into a dictionary
def parse_info(info_str):
    info_dict = {}
    for item in info_str.split(';'):
        key, value = item.split('=')
        info_dict[key] = value
    return info_dict

# Parsing the VCF content
vcf_records = []
for line in vcf_content:
    if not line.startswith('#'):
        parts = line.split()
        record = {
            "chrom": parts[0],
            "pos": int(parts[1]),
            "id": parts[2],
            "ref": parts[3],
            "alt": parts[4],
            "qual": float(parts[5]),
            "filter": parts[6],
            "info": parse_info(parts[7])
        }
        vcf_records.append(record)

# Convert to JSON
json_output = json.dumps(vcf_records, indent=2)
print(json_output)

[
  {
    "chrom": "1",
    "pos": 1234567,
    "id": ".",
    "ref": "A",
    "alt": "G",
    "qual": 29.6,
    "filter": "PASS",
    "info": {
      "DP": "14",
      "AF": "0.5"
    }
  }
]

Now, it becomes evident that JSON serves as a pivotal step in bridging genetic biomarker data with the CDISC GF domain.

In practice, a VCF file can be converted into JSON format using the tool 'vcf2fhir' by HL7. Detailed instructions and information are available at: https://vcf2fhir.readthedocs.io/en/latest/

HL7 FHIR

Clinical data managers are normally familiar with CDSIC system but not with HL7 (Health Level Seven International, https://www.hl7.org/) . However, for biomarker data management, we could learn a lot from HL7. Here is the link for the HL7 FAIR cookbook: https://faircookbook.elixir-europe.org/content/home.html. I have reproduced HL7's diagram here, illustrating the conversion of a VCF file to both HL7 and JSON formats.

Visualization using D3.js

Using JSON in conjunction with D3.js for data visualization brings a host of significant advantages, particularly due to its inherent compatibility with native JavaScript. This compatibility ensures a seamless and efficient integration of data into D3.js visualizations. Below is a straightforward demonstration of how D3.js can be used to display gene mutations, showcasing the synergy between JSON data and D3.js's powerful visualization capabilities.

<!DOCTYPE html>
<html>
<head>
    <title>Gene-ABC Gene Sequence with Mutation Highlights</title>
    <script src="https://d3js.org/d3.v6.min.js"></script>
    <style>
        .nucleotide {
            stroke: #fff;
            stroke-width: 1px;
        }
        .mutation {
            fill: red;
        }
        text {
            font-size: 10px;
            text-anchor: middle;
        }
        .tooltip {
            position: absolute;
            text-align: center;
            width: 120px;
            height: auto;
            padding: 2px;
            font: 12px sans-serif;
            background: lightsteelblue;
            border: 0px;
            border-radius: 8px;
            pointer-events: none;
            opacity: 0;
        }
        h1 {
            text-align: center;
        }
        .patient-label {
            font-size: 16px;
            font-weight: bold;
        }
    </style>
</head>
<body>
    <h1>GeneABC Mutations</h1>
    <div style="display: flex; align-items: center;">
        <div class="patient-label">Patient 1</div>
        <svg width="1000" height="40"></svg>
    </div>
    <div class="tooltip"></div>
    <script>
        document.addEventListener("DOMContentLoaded", function() {
            var data = {
                "gene": "Gene-ABC",
                "gene_sequence": "AGCTTGCCGATGGCGTAGGCA...",
                "mutations": [
                    {
                        "mutation_id": "G12D",
                        "position": 12,
                        "nucleotide_change": "G>T",
                        "amino_acid_change": "Gly>Asp"
                    },
                    // ... more mutations ...
                ]
            };

            var geneSequence = data.gene_sequence;
            var mutations = data.mutations;
            var mutationPositions = mutations.map(m => m.position);

            var svg = d3.select("svg"),
                width = +svg.attr("width"),
                height = +svg.attr("height");

            var nucleotideWidth = width / geneSequence.length;

            var tooltip = d3.select(".tooltip");

            var nucleotides = svg.selectAll("rect")
                .data(geneSequence.split(''))
                .enter()
                .append("rect")
                .attr("class", "nucleotide")
                .attr("x", (d, i) => i * nucleotideWidth)
                .attr("y", 0)
                .attr("width", nucleotideWidth)
                .attr("height", height)
                .style("fill", (d, i) => mutationPositions.includes(i + 1) ? 'red' : 'lightgray');

            nucleotides.on("mouseover", function(event, d) {
                var index = nucleotides.nodes().indexOf(this);
                var mutation = mutations.find(m => m.position === index + 1);
                if (mutation) {
                    tooltip.transition()
                        .duration(200)
                        .style("opacity", .9);
                    tooltip.html("Position: " + (index + 1) + "<br>Mutation: " + mutation.mutation_id + "<br>Nucleotide Change: " + mutation.nucleotide_change + "<br>Amino Acid Change: " + mutation.amino_acid_change)
                        .style("left", (event.pageX + 5) + "px")
                        .style("top", (event.pageY - 28) + "px");
                }
            })
            .on("mouseout", function() {
                tooltip.transition()
                    .duration(500)
                    .style("opacity", 0);
            });

            svg.selectAll("text")
                .data(geneSequence.split(''))
                .enter()
                .append("text")
                .attr("x", (d, i) => i * nucleotideWidth + nucleotideWidth / 2)
                .attr("y", height / 2)
                .text(d => d);
        });
    </script>
</body>
</html>

Concluding Remarks

In conclusion, managing clinical biomarker data, particularly genetic data sourced from central labs, is a critical component in supporting clinical submissions for targeted therapy drug development. In this discussion, we proposed a practical approach for managing clinical genetic data, drawing inspiration from the Dataset-JSON initiative and HL7's data standards. Pharmaceutical clinical biomarker data managers can leverage insights from the HL7 system to develop cutting-edge data management systems.

Disclaimer: The development of this article was assisted by ChatGPT.

LinkedIn respects your privacy

Precision Medicine in Health Agency Submissions: Biomarker Data Management

Kui Shen

Senior Director of Clinical Statistics at Bayer | Bayer Science Fellow

CDISC standards

SDTM and relational database

Dataset-JSON initiatives

VCF format for genetic data

The JSON standard

HL7 FHIR

Visualization using D3.js

Concluding Remarks

More articles by this author

Others also viewed

Traceability, proportionality, and flexibility: Strategies for operationalizing ICH E6(R3)

Top 5 Regulatory Picks in the Life Sciences Industry - June

Mastering CDISC Standards: The Strategic Edge in Clinical Trial Biostatistics

CDISC/SDTM standards Compliance in CDM practice

Outliers and their impact on Clinical Data Management

Electronic Case Report Form (eCRF) Design

5 Tips for Choosing a CRO for Your Biometrics Needs (2024)

Unlocking Clinical Trial Efficiency with the Unified Study Definition Model (USDM)

How Can Sponsors Determine Their ICH E6(R3) Risk Profile for Biospecimens?

Content Plan for Biometric CRO

Explore content categories

CDISC standards

SDTM and relational database

Dataset-JSON initiatives

VCF format for genetic data

The JSON standard

HL7 FHIR

Visualization using D3.js

Concluding Remarks

Jumpstart Your CDx Development: Creating an FDA-Approved CDx Database in Just Five Minutes

Dec 10, 2023

Deep Learning in Survival Analysis: Implementation of Breslow Approximation for Tied Event Times from Scratch

Dec 3, 2023

Others also viewed

Traceability, proportionality, and flexibility: Strategies for operationalizing ICH E6(R3)

Top 5 Regulatory Picks in the Life Sciences Industry - June

Mastering CDISC Standards: The Strategic Edge in Clinical Trial Biostatistics

CDISC/SDTM standards Compliance in CDM practice

Outliers and their impact on Clinical Data Management

Electronic Case Report Form (eCRF) Design

5 Tips for Choosing a CRO for Your Biometrics Needs (2024)

Unlocking Clinical Trial Efficiency with the Unified Study Definition Model (USDM)

How Can Sponsors Determine Their ICH E6(R3) Risk Profile for Biospecimens?

Content Plan for Biometric CRO

Explore content categories