SlideShare a Scribd company logo
Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana
www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
 Challenges
 Future Work
Overview
www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials
www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders
• Study Managers
• Program Managers
• Monitors
• Data Managers
• Bio-statisticians
• Executives
• Medical Affairs
• Regulatory
• Vendors
• CROs
• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent
● Fragmented
Data
PV Data
Excel
Sponsor
Contract Research Organization (CRO)
Sites and Investigators
www.comprehend.com
For decades, clinical development
was primarily paper-based.
www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies
www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical
Operations
Sites
Labs
Patients
● Consolidated
● Real-time
● Self-Service
● Mobile
Clinical
Analytics &
Collaboration
Data
Safet
y
EDC
PV Data
Excel
www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
FDA, HIPAA Compliance
Metadata/Database structure synchronization
Less frequent (once a day)
Data Synchronization
More frequent (multiple times a day)
Ability to plugin various data sources
RAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagations
Adverse events (AEs) - the need for early
identification
Business Requirements
www.comprehend.com
Hardware agnostic for resiliency and better
utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and
monitoring
Flexible and pluggable adapter architecture
Time travel
Audit trails
Report generations
Technical Requirements
www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
• Data processing
 Apache Storm and Trident, Apache
Spark and Spark Streaming,
Samza, Summingbird, Scalding,
Apache Falcon, Azkaban
• Coordination and Configuration
Management
 Apache Zookeeper, Redis, Apache
Curator
• Event Queue
 Apache Kafka
• Scheduling
 Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization
 Liquibase, Flyway DB
• Data Representations
 Apache Thrift, protobuf, Avro
• Deployments
 Ansible
• File Management
 Apache HDFS
• Monitoring and alerting
 Graphite, StatsD
• Database
 PostgreSQL, Apache Spark
• Resource Isolation
 LXC, Docker
Technologies Evaluated
www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm +
Trident
Spark +
Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG
Support
Y DAGScheduler
Y Y Y Y Y N Y
DAG Nodes
Resiliency
Y Y Y Y Y Y Y N Y
Event
Driven
Y Y Y Y N N N N N
Timed
Execution
Y Y Y Y Y Y Y Y
DAG
Extension
Y Y Y Y Y Y Y Y Y
Inflight and
end state
metrics
Y Y Y Y Y Y Y Y Y
Hardware
Agnostic
Y Y Y Y Y Y Y Y Y
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
High Level Architecture
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Adapters – High Level
• Syncher is for DB structural
changes
 Syncher creates a database schema
from the source information
 Runs a generic database diff and
applies those to the target database
• Seeder is for data
synchronization
 Uses the database schema created
by Syncher
• Seeders gets jobs from
 Syncher or
 Timed scheduler
Data Adapters
• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and
counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration
www.comprehend.com
Data Adapters - Implementation
www.comprehend.com
 Syncher
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 Schema changes to the database fails in the middle
• Transaction rollback
 Seeder
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 If seeding fails midway
• Storm retries tuples
• Failing tuples are moved to an error queue
 Table and row level failues
• Option to skip the tables/rows but send a report at the end
 Effect on “live” tables during data synchronizations
• Option to use transactions or
• Use temporary tables and swap with original upon completion
Failure Modes
www.comprehend.com
Can bring in data from more data sources and
more studies effectively
Run real time reports on studies and configure
alerts (future)
Can configure refreshes as needed by each
use case
Can throttle input and output sources at
study/customer level
Ability to onboard new customers and deploy
new studies with minimal human intervention
What Have We Gained
www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Accessibility
Customers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLs
Data storage
Scalability and Redundancy
Scale-out by adding nodes
Resilience against loss of nodes, data centers and
replication
Miscellaneous
Access control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements
www.comprehend.com
Two name nodes running
in HA mode, co-located
with two journal nodes
Third journal node on a
separate node
Data nodes on all bare
metal nodes
Mounting HDFS with
FUSE and enabling SFTP
through OS level features
Automatic failover through
DNS and HA Proxy
HDFS with High Availability Mode
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Regulatory requirements
Data encryption requirements for clinical data
Audit trails
Data quality
Source system constraints
Coordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in
HDFS
HDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work
www.comprehend.com
Team
www.comprehend.com
Thank You !!
Questions …

More Related Content

PPTX
User Inspired Management of Scientific Jobs in Grids and Clouds
PPTX
Usage Patterns to Provision for Scientific Experiments in Clouds
PPTX
PNNL April 2011 ogce
PDF
Advanced Research Computing at York
PDF
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
PPTX
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
PPTX
Taming Big Data!
PPT
DIET_BLAST
User Inspired Management of Scientific Jobs in Grids and Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
PNNL April 2011 ogce
Advanced Research Computing at York
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Taming Big Data!
DIET_BLAST

What's hot (18)

PPTX
Accelerating Discovery via Science Services
PPTX
Scientific workflow-overview-2012-01-rev-2
PPTX
Big data at experimental facilities
PPT
Semantics in Sensor Networks
PPTX
Journals analysis ppt
PPT
Integrating scientific laboratories into the cloud
PPTX
Continuous modeling - automating model building on high-performance e-Infrast...
PDF
Big data and open access: a collision course for science
PDF
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
PDF
The Interplay of Workflow Execution and Resource Provisioning
PDF
Analysis of User Submission Behavior on HPC and HTC
PPTX
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
PDF
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
PPTX
CHASE-CI: A Distributed Big Data Machine Learning Platform
PDF
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
PDF
CV_myashar_2017
DOCX
Mark_Yashar_Resume_2017
PDF
Method for conducting a combined analysis of grid environment’s fta and gwa t...
Accelerating Discovery via Science Services
Scientific workflow-overview-2012-01-rev-2
Big data at experimental facilities
Semantics in Sensor Networks
Journals analysis ppt
Integrating scientific laboratories into the cloud
Continuous modeling - automating model building on high-performance e-Infrast...
Big data and open access: a collision course for science
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The Interplay of Workflow Execution and Resource Provisioning
Analysis of User Submission Behavior on HPC and HTC
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
CHASE-CI: A Distributed Big Data Machine Learning Platform
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
CV_myashar_2017
Mark_Yashar_Resume_2017
Method for conducting a combined analysis of grid environment’s fta and gwa t...
Ad

Viewers also liked (19)

PPTX
ICH and GCP by Naveen
PDF
A CTTI Survey of Current Monitoring Practices
PPTX
Clinical Trials 101
PPT
Developing Protocols & Procedures for CT Data Integrity
PPTX
Clinical Trials Glossary
PPT
Clinical Trials in India
PPTX
Monitoring Visits
PPT
Qc in clinical trials
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PDF
High-level Programming Languages: Apache Pig and Pig Latin
PPTX
Scope of pharmacology
PPTX
ICH GCP
PPT
Ethical Considerations In Clinical Trials
PPT
Clinical Trials Introduction
PPTX
Clinical trials flow process
PDF
Imaging biomarkers in Clinical trials
PPTX
Monitoring and auditing in clinical trials
PDF
Data warehouse architecture
PDF
Clinical Trials - An Introduction
ICH and GCP by Naveen
A CTTI Survey of Current Monitoring Practices
Clinical Trials 101
Developing Protocols & Procedures for CT Data Integrity
Clinical Trials Glossary
Clinical Trials in India
Monitoring Visits
Qc in clinical trials
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
High-level Programming Languages: Apache Pig and Pig Latin
Scope of pharmacology
ICH GCP
Ethical Considerations In Clinical Trials
Clinical Trials Introduction
Clinical trials flow process
Imaging biomarkers in Clinical trials
Monitoring and auditing in clinical trials
Data warehouse architecture
Clinical Trials - An Introduction
Ad

Similar to Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials (20)

PPTX
Ogce Workflow Suite
PDF
How To Build A Stable And Robust Base For a “Cloud”
PPTX
Scientific
PPTX
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
PPTX
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
PPTX
Using VisualSim Architect for Semiconductor System Analysis
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PPT
Pattern-Oriented Distributed Software Architectures
PPTX
VMware vFabric gemfire for high performance, resilient distributed apps
PPTX
XSEDE14 SciGaP-Apache Airavata Tutorial
PDF
C19013010 the tutorial to build shared ai services session 2
PPTX
PMIx Updated Overview
PDF
Khushi Muhammad Resume
PPTX
How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
PPT
Case Study For Service Providers Analysis Platform
PPTX
Data Science in the cloud with Microsoft Azure
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
PDF
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
PPTX
Where Should You Deliver Database Services From?
 
Ogce Workflow Suite
How To Build A Stable And Robust Base For a “Cloud”
Scientific
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
Using VisualSim Architect for Semiconductor System Analysis
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
Pattern-Oriented Distributed Software Architectures
VMware vFabric gemfire for high performance, resilient distributed apps
XSEDE14 SciGaP-Apache Airavata Tutorial
C19013010 the tutorial to build shared ai services session 2
PMIx Updated Overview
Khushi Muhammad Resume
How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
Case Study For Service Providers Analysis Platform
Data Science in the cloud with Microsoft Azure
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Where Should You Deliver Database Services From?
 

More from Eran Chinthaka Withana (7)

PPTX
Cassandra At Wize Commerce
PPTX
Opensource development and apache software foundation
PPTX
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
PPTX
Versioning for Workflow Evolution
PPTX
Web Services in the Real World
PPTX
Axis2 Landscape
PPT
CBR Based Workflow Composition Assistant
Cassandra At Wize Commerce
Opensource development and apache software foundation
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Versioning for Workflow Evolution
Web Services in the Real World
Axis2 Landscape
CBR Based Workflow Composition Assistant

Recently uploaded (20)

PPTX
Management Information system : MIS-e-Business Systems.pptx
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
communication and presentation skills 01
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPT
Total quality management ppt for engineering students
PPTX
Software Engineering and software moduleing
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Visual Aids for Exploratory Data Analysis.pdf
Management Information system : MIS-e-Business Systems.pptx
737-MAX_SRG.pdf student reference guides
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Safety Seminar civil to be ensured for safe working.
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Information Storage and Retrieval Techniques Unit III
Abrasive, erosive and cavitation wear.pdf
Fundamentals of Mechanical Engineering.pptx
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
communication and presentation skills 01
Categorization of Factors Affecting Classification Algorithms Selection
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Total quality management ppt for engineering students
Software Engineering and software moduleing
Soil Improvement Techniques Note - Rabbi
"Array and Linked List in Data Structures with Types, Operations, Implementat...
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Visual Aids for Exploratory Data Analysis.pdf

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

  • 1. Redefining ETL Pipelines with Apache Technologies to Accelerate Decision Making for Clinical Trials Eran Withana
  • 2. www.comprehend.com Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System  Challenges  Future Work Overview
  • 3. www.comprehend.com Open Source Member, PMC member and committer of ASF Apache Axis2, Web Services, Synapse, Airavata Education PhD in Computer Science from Indiana University Software engineer at Comprehend Systems About me …
  • 4. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 5. www.comprehend.com Clinical Trials – Lay of the land Number of Drugs in Development Worldwide (Source: CenterWatch Drugs in Clinical Trial Database 2014) Source: http://www.phrma.org/innovation/clinical-trials
  • 6. www.comprehend.com Clinical Trials – Lay of the Land Multiple Stakeholders • Study Managers • Program Managers • Monitors • Data Managers • Bio-statisticians • Executives • Medical Affairs • Regulatory • Vendors • CROs • CRAs Sites Labs Patients Safety EDC Reports ● Latent ● Fragmented Data PV Data Excel Sponsor Contract Research Organization (CRO) Sites and Investigators
  • 7. www.comprehend.com For decades, clinical development was primarily paper-based.
  • 8. www.comprehend.com Various Software and Practices Used in Each Layer medidata CROs and SIs Technologies
  • 9. www.comprehend.com Clinical Trials with Centralized Monitoring Clinical Operations Sites Labs Patients ● Consolidated ● Real-time ● Self-Service ● Mobile Clinical Analytics & Collaboration Data Safet y EDC PV Data Excel
  • 10. www.comprehend.com Providing up-to-date answers Executives Medical Review CRAs Data Management Clinical Operations EDC CTMS Safety ePro Other Web Ad-Hoc Mobile Collab
  • 11. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 12. www.comprehend.com FDA, HIPAA Compliance Metadata/Database structure synchronization Less frequent (once a day) Data Synchronization More frequent (multiple times a day) Ability to plugin various data sources RAVE, MERGE, BioClinica, File Imports, DB-to-DB Synchs Real time event propagations Adverse events (AEs) - the need for early identification Business Requirements
  • 13. www.comprehend.com Hardware agnostic for resiliency and better utilization Repeatable deployments Real time processing and real time events Fault Tolerance In flight and end state metrics for alerting and monitoring Flexible and pluggable adapter architecture Time travel Audit trails Report generations Technical Requirements
  • 14. www.comprehend.com Events all the way Shared event bus for multiple consumers Use of language agnostic data representations (via protobuf) Automatic datacenter resources management (Mesos/Marathon/Docker) Core Design Principles
  • 15. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 16. www.comprehend.com • Data processing  Apache Storm and Trident, Apache Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban • Coordination and Configuration Management  Apache Zookeeper, Redis, Apache Curator • Event Queue  Apache Kafka • Scheduling  Chronos, Apache Mesos, Marathon, Apache Aurora • Database Synchronization  Liquibase, Flyway DB • Data Representations  Apache Thrift, protobuf, Avro • Deployments  Ansible • File Management  Apache HDFS • Monitoring and alerting  Graphite, StatsD • Database  PostgreSQL, Apache Spark • Resource Isolation  LXC, Docker Technologies Evaluated
  • 17. www.comprehend.com Data Processing Technology Evaluation Criteria Storm + Trident Spark + Streaming Samza Summingbird Scalding Falcon Chronos Aurora Azkaban DAG Support Y DAGScheduler Y Y Y Y Y N Y DAG Nodes Resiliency Y Y Y Y Y Y Y N Y Event Driven Y Y Y Y N N N N N Timed Execution Y Y Y Y Y Y Y Y DAG Extension Y Y Y Y Y Y Y Y Y Inflight and end state metrics Y Y Y Y Y Y Y Y Y Hardware Agnostic Y Y Y Y Y Y Y Y Y
  • 18. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 20. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 21. www.comprehend.com Bare Metal Boxes Partitioned using LXC containers Use of Mesos to do the resource allocations as needed for jobs Managing Hardware
  • 22. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 24. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 26. • Syncher is for DB structural changes  Syncher creates a database schema from the source information  Runs a generic database diff and applies those to the target database • Seeder is for data synchronization  Uses the database schema created by Syncher • Seeders gets jobs from  Syncher or  Timed scheduler Data Adapters
  • 27. • Coordination and Configuration through Zookeeper Job configuration Connection information Distributed locking and counters Metric Maintenance Last successful run Data Adapters – Coordination and Configuration
  • 29. www.comprehend.com  Syncher  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  Schema changes to the database fails in the middle • Transaction rollback  Seeder  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  If seeding fails midway • Storm retries tuples • Failing tuples are moved to an error queue  Table and row level failues • Option to skip the tables/rows but send a report at the end  Effect on “live” tables during data synchronizations • Option to use transactions or • Use temporary tables and swap with original upon completion Failure Modes
  • 30. www.comprehend.com Can bring in data from more data sources and more studies effectively Run real time reports on studies and configure alerts (future) Can configure refreshes as needed by each use case Can throttle input and output sources at study/customer level Ability to onboard new customers and deploy new studies with minimal human intervention What Have We Gained
  • 31. www.comprehend.com A generic framework which eases integration with new data sources • For each new source, implement a method to create a virtual schema and to get data for a given table can scale and fault tolerant has generic monitoring and alerting eases maintenance since its mostly generic code notification of important events through messages runs on any hardware What Have We Gained
  • 32. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 33. www.comprehend.com Accessibility Customers must be able to drop files securely (SFTP like functionality) Ability to access resources through URLs Data storage Scalability and Redundancy Scale-out by adding nodes Resilience against loss of nodes, data centers and replication Miscellaneous Access control over read/write Performance/usage/resource utilization monitoring Distributed File System - Requirements
  • 34. www.comprehend.com Two name nodes running in HA mode, co-located with two journal nodes Third journal node on a separate node Data nodes on all bare metal nodes Mounting HDFS with FUSE and enabling SFTP through OS level features Automatic failover through DNS and HA Proxy HDFS with High Availability Mode
  • 35. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 36. www.comprehend.com Regulatory requirements Data encryption requirements for clinical data Audit trails Data quality Source system constraints Coordination between Synchers and Seeders Distributed locks and counters Automatic fail over when a name node fails in HDFS HDFS HA mode stores active name node in ZK as a java serialized object, yikes !! Challenges
  • 37. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 38. www.comprehend.com Time travel Ability to go back in time and run reports at any given point of time Trail of data Containerization In-memory query execution with Apache Spark Future Work

Editor's Notes

  • #6: Dose 20-100 Efficacy and safety 100-300 > 1000