What does Prospective Data F.A.I.R.ification mean, and why should you care?

Benjamin Szilagyi

Helping Biotech & Pharma Build AI-Ready Data Foundations | Trusted Data Science Leadership | Founder Data4Trust AG

Published Apr 4, 2025

A Structured Approach to Ensure Maximum Value Extraction from your Data

Introduction

In 2016, the ‘F.A.I.R. Guiding Principles for scientific data management and stewardship’ were published in Scientific Data.

The authors (Barend Mons Et al., University of Leiden) provided clear guidelines to improve the

- Findability

- Accessibility

- Interoperability and

- Reuse of digital assets

“The principles emphasise machine-actionability (i.e. the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) as humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.”

While F.A.I.R. Data Principles hold their value across various industries, I will focus in this article on Healthcare specifically on Clinical Trial Data.

In my past positions at F. Hoffmann-La Roche, we developed a “Prospective Data F.A.I.R.ification Engine” that ensures clinical data to meet the required quality metrics right as it’s being generated, captured and processed.

This is very much the opposite of usual F.A.I.R.ification efforts that take place in a reactive, retrospective and ad hoc manner in cases of immediate urgency (e.g. safety breakthroughs). In these times of immediate urgency, data first needs to get extracted from disjointed systems, data models, identifiers, taxonomy and standards need to get harmonized before pooling of the data can even begin.

Consequences being massive resource needs, delays of urgently required analysis or at worse, incompatibility of data assets for pooling and failure to derive further value from the data.

Prospective F.A.I.R.ification in contrast intends to implement a working environment in order to “make the right way the easy way”, enabling not only secondary use of data at scale, but significantly improving study execution and analysis as such.

Studies that used the Prospective Data F.A.I.R.ification Engine as a pilot showed a 50% reduction in study setup time, faster execution and ease of data sharing with authorities at a speed unseen in the industry.

It’s important to note that the Prospective Data F.A.I.R.ification Engine actually is a systemic approach that involves

- Systems and Tools

- Decision Making Principles

- Processes and Metrics

… in order to make it work. Involving people, culture, collaboration spirit through a shared sense of purpose and trust must not be underestimated in order to successfully deliver!

A basic schematic overview of the engine:

The Engine is segmented into four distinct parts, each representing specific activities being executed on patient data during the planning, conduct and closure of clinical trials. 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹, 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗣𝗹𝗮𝗻, 𝗗𝗮𝘁𝗮 𝗣𝗹𝗮𝗻 This first step arguably impacts 80% of requirements on data F.A.I.R.ness according to my experience and industry benchmarks.

The Engine Parts

Appropriate planning ahead of clinical study conduct positively impacts:

- The validity of the analysis and the study endpoints

- The usefulness of the data for re-purposing

- Delivering the analysis in time and quality

- Being compliant with any regulatory requirements

It is essential to involve the Data Management and Biostatistics team right from the start and provide the team enough time at this stage. This will inevitably erase the majority of negative data surprises during study conduct and analysis stage.

The Data Plan

Creating a data plan is essential in order to:

- Identify all data sources (i.e. eCRF, external labs, imaging providers etc.)

- Timing of analysis events and what data types are required at which state (e.g. Interim & Futility Analysis, Safety Reviews, Database Locks)

- Expectations on data quality (data standards & model conformance, completeness etc.)

- Roles and Responsibilities for data quality checks between internal and external team members (e.g. CROs)

Data Acquisition

As patients are being recruited into the study and go through the treatments and assessments, data is being captured via eCRFs, LIMS-Systems and other devices. They eventually are being transferred to the sponsor’s database. At this acquisition-stage there are essential activities to be taken to ensure further compliance to F.A.I.R.

- Data integrity scripts screen for data completeness, accuracy, consistency and compliance to models (e.g. CDISC) and standards

- the higher the standardization level, the higher the automation level will be

- the data engineer can ensure reliable data piping from the data sources (e.g. EDC tool) to the data ingestion application

- Relevant and contextual metadata will ensure reliable data findability and interpretation downstream for primary and secondary use

- automated data consistency checks need to be balanced against the data producer's willingness and time to comply with the required standards (i.e. a study nurse entering patient data vs. a product costumer providing feedback)

Data Processing

F.A.I.R. data also means F.A.I.R. data processing and analysis code. F.A.I.R. syntax requirements are amongst others:

- Clear and understandable naming conventions of code and versions

- code transparency, clear indication of macro usage

- well documented and structured syntax, with consistent formatting (clear header with author, input, output, documents referenced, line breaks for readability)

- explicit documentation of all assumptions made

- maximally re-usable (even for the coder’s own sake when he/she picks the code up in a couple of years)

In general, F.A.I.R. code should be regarded as F.A.I.R. metadata and therefore part of the entire engine.

Data Storage and Sharing

With the industry moving increasingly towards Data Mesh and Data Fabric System Architectures, the emphasis on data accordingly moves from Storage towards being a Product. Data being a product basically requires

- transparent accessibility through APIs

- clear metadata for contextual interpretation

- data re-usability (as given by the patient’s informed consent form)

- alignment to standards, project endpoints and IDs make data interoperable with other data domains

- compliance with GDPR requirements (i.e. pseudonymization and anonymization) ensure safe data sharing inhouse and with external partners

Building a Prospective Data F.A.I.R.ification Engine takes a systemic, long-term approach. Having said that, I recommend starting with the very first engine segment, the planning stage, as the 80% of value creation is taking place there and implementation is very low on tech requirements.

Adrien Thomas

We make you the first brand ChatGPT recommends

4mo

Hope it will reach lots of people. Must read.

1 Reaction

Nathalie Batoux

CEO and Founding consultant @SALDS | Data shepherdess: Streamlining Life Sciences R&D - Expert Business Analysis for Pharma & Biotech Data Workflows

Great article Benjamin. It is great to read that we are very aligned in our thinking and approaches.

Vaclav Sulista

Guiding Pharma & Supply Chain Careers | 500+ Success Stories | Honorary Consul (Czechia in CH) | 200+ 5 Stars Google Reviews

Very useful Benjamin Szilagyi I now understand what this term means.

See more comments

What does Prospective Data F.A.I.R.ification mean, and why should you care?

Benjamin Szilagyi

Helping Biotech & Pharma Build AI-Ready Data Foundations | Trusted Data Science Leadership | Founder Data4Trust AG

A Structured Approach to Ensure Maximum Value Extraction from your Data

Introduction

A basic schematic overview of the engine:

The Engine Parts

The Data Plan

Data Acquisition

Data Processing

Data Storage and Sharing

Ben's F.A.I.R. Data Newsletter

916 followers

More articles by this author

Others also viewed

DGIQ + AIGOV Conference 2024 Takeaways: Trending Topics in AI Governance

AI-Data Readiness Playbook: Preparing your organization for AI-Data Symbiosis

More Data, Less Space: The Unseen Magic of Lossless Compression

The Difference Between Data, Information, and Knowledge

Why Multiple Imputation is Indefensible for Handling Missing Data

Harnessing Data for the NHS 10-Year Plan: Lessons from the Frontlines

🌍 Why Big Data Should Partner with Convergence

Beyond Data Collection: Transforming Raw Data into Actionable Healthcare Insights

Counterfeit Knowledge Graphs

Moving Beyond the Hype: Unlocking Real Value from Your GenAI Initiatives

Explore topics

A Structured Approach to Ensure Maximum Value Extraction from your Data

Introduction

A basic schematic overview of the engine:

The Engine Parts

The Data Plan

Data Acquisition

Data Processing

Data Storage and Sharing

Ben's F.A.I.R. Data Newsletter

916 followers

Breaking Down the Re-Usability Challenge: Making F.A.I.R. Data Work in Pharma & Biotech R&D

Jun 15, 2025

From Molecule to Market: Strategies to Connect Discovery and Clinical Data in Pharma R&D

May 10, 2025

Others also viewed

DGIQ + AIGOV Conference 2024 Takeaways: Trending Topics in AI Governance

AI-Data Readiness Playbook: Preparing your organization for AI-Data Symbiosis

More Data, Less Space: The Unseen Magic of Lossless Compression

The Difference Between Data, Information, and Knowledge

Why Multiple Imputation is Indefensible for Handling Missing Data

Harnessing Data for the NHS 10-Year Plan: Lessons from the Frontlines

🌍 Why Big Data Should Partner with Convergence

Beyond Data Collection: Transforming Raw Data into Actionable Healthcare Insights

Counterfeit Knowledge Graphs

Moving Beyond the Hype: Unlocking Real Value from Your GenAI Initiatives

Explore topics