Precision Medicine in Health Agency Submissions: Biomarker Data Management
Disclaimer: The views and opinions expressed in this presentation are solely those of the author's and do not represent their employer or any affiliated organizations.
Clinical biomarker management presents significant challenges due to the limitations of the CDISC system in effectively handling biomarker data, particularly in the context of high-throughput genetic and genomic information. This inadequacy is compounded by the fact that clinical data managers often possess limited expertise in biomarkers, while biomarker specialists typically have a restricted understanding of clinical data management processes. As a clinical statistician, I am positioned to offer a unique perspective on this matter.
CDISC standards
CDISC, which stands for Clinical Data Interchange Standards Consortium, is a global, open, multidisciplinary, non-profit organization that has established standards to support the acquisition, exchange, submission, and archiving of clinical research data and metadata.
The main components of CDISC include:
Understanding and implementing CDISC standards is a key component of modern clinical data management, contributing to the efficiency, reliability, and regulatory compliance of clinical trials.
SDTM and relational database
The Study Data Tabulation Model (SDTM) developed by the CDISC for clinical trial data doesn't strictly adhere to the first three normal forms (1NF, 2NF, and 3NF) as used in traditional relational database design. SDTM has its own set of guidelines and structures that are specifically tailored for clinical trial data submission to regulatory agencies like the FDA. Here's how SDTM relates to the normal forms:
SDTM datasets generally conform to 1NF as they store data in tables with rows and columns. Each column represents a specific attribute, and each row represents a unique instance (record).
However, the concept of atomicity in SDTM might differ from traditional databases. For example, SDTM often uses composite fields where a single field might represent multiple pieces of information, although these are standardized.
SDTM's relationship to 2NF is more complex. While SDTM datasets use keys (like subject identifiers), these datasets are often denormalized for the sake of clarity and ease of use in the context of clinical trials.
SDTM focuses more on standardization and clear documentation rather than eliminating all partial dependencies, as seen in traditional 2NF design.
SDTM datasets are not strictly designed to eliminate transitive dependencies, as is the goal in 3NF. The structure of SDTM is more driven by the need to clearly and comprehensively represent clinical trial data in a format that is understandable to regulatory bodies.
The focus of SDTM is on standardization of data representation and ensuring completeness and clarity of the data for submission, rather than on optimizing database normalization principles like in 3NF.
Dataset-JSON initiatives
The "Dataset-JSON" project from CDISC represents an effort to establish a standard format for exchanging and submitting clinical trial data in JSON (JavaScript Object Notation) format. This initiative is part of CDISC's ongoing efforts to modernize data standards to improve the efficiency, effectiveness, and accessibility of clinical trial data.
Dataset-JSON is crucial for biomarker data management. To understand its significance, let's begin with an overview of biomarker basics.
VCF format for genetic data
The VCF format, which stands for Variant Call Format, is a widely used text file format in bioinformatics for storing gene sequence variations. It's particularly crucial for analyses in genomics, such as genome-wide association studies (GWAS), population genetics, and personal genomics. The format is designed to store data generated by genome sequencing projects, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants.
Here's a breakdown of the key features of the VCF format:
Below is a sample of a VCF file containing a single record.
##fileformat=VCFv4.2
##source=SimulatedData
##reference=GRCh38
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123456 . G A 50 PASS .
The JSON standard
The VCF file can be transformed into the JSON format. Here, we provide a straightforward demonstration. A more practical approach to converting VCF to JSON will be discussed subsequently.
import json
# VCF content (you can replace this with file reading in a real scenario)
vcf_content = """
##fileformat=VCFv4.2
##source=SimulatedData
##reference=GRCh38
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1234567 . A G 29.6 PASS DP=14;AF=0.5
""".strip().split('\n')
# Function to parse the INFO field into a dictionary
def parse_info(info_str):
info_dict = {}
for item in info_str.split(';'):
key, value = item.split('=')
info_dict[key] = value
return info_dict
# Parsing the VCF content
vcf_records = []
for line in vcf_content:
if not line.startswith('#'):
parts = line.split()
record = {
"chrom": parts[0],
"pos": int(parts[1]),
"id": parts[2],
"ref": parts[3],
"alt": parts[4],
"qual": float(parts[5]),
"filter": parts[6],
"info": parse_info(parts[7])
}
vcf_records.append(record)
# Convert to JSON
json_output = json.dumps(vcf_records, indent=2)
print(json_output)
[
{
"chrom": "1",
"pos": 1234567,
"id": ".",
"ref": "A",
"alt": "G",
"qual": 29.6,
"filter": "PASS",
"info": {
"DP": "14",
"AF": "0.5"
}
}
]
Now, it becomes evident that JSON serves as a pivotal step in bridging genetic biomarker data with the CDISC GF domain.
In practice, a VCF file can be converted into JSON format using the tool 'vcf2fhir' by HL7. Detailed instructions and information are available at: https://vcf2fhir.readthedocs.io/en/latest/
HL7 FHIR
Clinical data managers are normally familiar with CDSIC system but not with HL7 (Health Level Seven International, https://www.hl7.org/) . However, for biomarker data management, we could learn a lot from HL7. Here is the link for the HL7 FAIR cookbook: https://faircookbook.elixir-europe.org/content/home.html. I have reproduced HL7's diagram here, illustrating the conversion of a VCF file to both HL7 and JSON formats.
Visualization using D3.js
Using JSON in conjunction with D3.js for data visualization brings a host of significant advantages, particularly due to its inherent compatibility with native JavaScript. This compatibility ensures a seamless and efficient integration of data into D3.js visualizations. Below is a straightforward demonstration of how D3.js can be used to display gene mutations, showcasing the synergy between JSON data and D3.js's powerful visualization capabilities.
<!DOCTYPE html>
<html>
<head>
<title>Gene-ABC Gene Sequence with Mutation Highlights</title>
<script src="https://d3js.org/d3.v6.min.js"></script>
<style>
.nucleotide {
stroke: #fff;
stroke-width: 1px;
}
.mutation {
fill: red;
}
text {
font-size: 10px;
text-anchor: middle;
}
.tooltip {
position: absolute;
text-align: center;
width: 120px;
height: auto;
padding: 2px;
font: 12px sans-serif;
background: lightsteelblue;
border: 0px;
border-radius: 8px;
pointer-events: none;
opacity: 0;
}
h1 {
text-align: center;
}
.patient-label {
font-size: 16px;
font-weight: bold;
}
</style>
</head>
<body>
<h1>GeneABC Mutations</h1>
<div style="display: flex; align-items: center;">
<div class="patient-label">Patient 1</div>
<svg width="1000" height="40"></svg>
</div>
<div class="tooltip"></div>
<script>
document.addEventListener("DOMContentLoaded", function() {
var data = {
"gene": "Gene-ABC",
"gene_sequence": "AGCTTGCCGATGGCGTAGGCA...",
"mutations": [
{
"mutation_id": "G12D",
"position": 12,
"nucleotide_change": "G>T",
"amino_acid_change": "Gly>Asp"
},
// ... more mutations ...
]
};
var geneSequence = data.gene_sequence;
var mutations = data.mutations;
var mutationPositions = mutations.map(m => m.position);
var svg = d3.select("svg"),
width = +svg.attr("width"),
height = +svg.attr("height");
var nucleotideWidth = width / geneSequence.length;
var tooltip = d3.select(".tooltip");
var nucleotides = svg.selectAll("rect")
.data(geneSequence.split(''))
.enter()
.append("rect")
.attr("class", "nucleotide")
.attr("x", (d, i) => i * nucleotideWidth)
.attr("y", 0)
.attr("width", nucleotideWidth)
.attr("height", height)
.style("fill", (d, i) => mutationPositions.includes(i + 1) ? 'red' : 'lightgray');
nucleotides.on("mouseover", function(event, d) {
var index = nucleotides.nodes().indexOf(this);
var mutation = mutations.find(m => m.position === index + 1);
if (mutation) {
tooltip.transition()
.duration(200)
.style("opacity", .9);
tooltip.html("Position: " + (index + 1) + "<br>Mutation: " + mutation.mutation_id + "<br>Nucleotide Change: " + mutation.nucleotide_change + "<br>Amino Acid Change: " + mutation.amino_acid_change)
.style("left", (event.pageX + 5) + "px")
.style("top", (event.pageY - 28) + "px");
}
})
.on("mouseout", function() {
tooltip.transition()
.duration(500)
.style("opacity", 0);
});
svg.selectAll("text")
.data(geneSequence.split(''))
.enter()
.append("text")
.attr("x", (d, i) => i * nucleotideWidth + nucleotideWidth / 2)
.attr("y", height / 2)
.text(d => d);
});
</script>
</body>
</html>
Concluding Remarks
In conclusion, managing clinical biomarker data, particularly genetic data sourced from central labs, is a critical component in supporting clinical submissions for targeted therapy drug development. In this discussion, we proposed a practical approach for managing clinical genetic data, drawing inspiration from the Dataset-JSON initiative and HL7's data standards. Pharmaceutical clinical biomarker data managers can leverage insights from the HL7 system to develop cutting-edge data management systems.
Disclaimer: The development of this article was assisted by ChatGPT.