SlideShare a Scribd company logo
National Center for Emerging and Zoonotic Infectious Diseases
Flattening the Curve with
Covid-19 Electronic Lab Reporting
Rishi Tarar, Northrop Grumman
Jason Hall, CDC
Kafka Summit , 2020
Background
§ This architecture stemmed out of necessity for CDC’s
EIP(Emerging Infections Program) programs, with an eye on
ongoing agency efforts (CDC Data and IT Modernization)
§ Multiple national level use case Implementations proved out
the architecture and exposed commonality that can extend
enterprise wide…
§ And meet hard challenges like a Pandemic – head on
COVID-19 Electronic Lab Reporting(CELR) - Scope
§ Agency initiative to collect COVID-19 line level lab testing data
from alljurisdictions in United States
§ Goal to have most comprehensive testing data
§ Improve the quality and fidelity of line level data on an
ongoing basis
§ Could be used for other conditions
PHD
PHL
Private
Labs
X
CDC
CSV v1
2.52.3.1
2.5.1 2.3.1
CELR
(Alice)
Live
Live
PHLIP
AIMS
CSV
Lab
Device
Manufacturers
NotActive
HHS
AIMS Hub
2.3z
CSV v2
2.5.12.3
PHLIP
Very High level Data Flows
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2016
Future State is a Mirage – Transition State is reality
Current
Stream
New
Stream
2020
Another
Current State Future StateTransition State
Primary Citizen -> TESTING EVENT
§ Data: Each record is an TESTING EVENT
§ Producers organized adjacent to feed formats
§ Streaming data and shaping it record-by-record through the pipelines
§ Each record is a primary citizen
– Each record flows through the set of stream processors
– Metadata is added to each record
– Makes “things” happening to “a” record rapidly observable
– Each record conforms to an evolving schema capability
§ Data can be aggregated and streamed to any destination on the fly
Event Pipelines
Event sourcing
Program Y
Data
Sources
Program X
Feed1
Data
Lake
New
Producer
Current
Producer
Feed2
Validate Redact Transform Translate
Biz
Rules
Case
Clasification
S3 Sink
Connector
JDBC
Sink
Connector
event event event event event
Elastic
Sink
Connector
Data Lake
Kafka
Data Streams
Configuration driven workloads
Data events sink-ed
Storage (Blob,Relational, ElasticSearch)
Frameworks
Pipelines organized
by Program and
Pathogen
The Platform High Level Architecture
Kafka
FLAT Pipelines
Pipelines
HL7
CDA Pipelines
FHIR Pipelines
CDA
Labs
Hospital
FHIR
SPHL
CSV/
JSON
XLXS
HL7 Pipelines
Registri
es
Data Lake
S3
Dashboards
Data Sets
Data Science Tools
Case Notifications Lab Reporting
Healthcare
Interoperability
Use Cases Implemented
Athena
Redshift
Schema
Dictionary
Partner
Collaboration
Tools Real Time Data
Stream
Custom Data
Sets
Bulk Exports
Machine
Learning
Business
User
(Non Tech)
Data Manager
Data Science
User
Data Storefronts
Merged Lab Data
Athena
Tables
Redshift
Tables
Quick
Sights
Data
Science
Tools
DCIPHER
Curated
Views
All Data
HHS
CELR
Portal
Self Service Data Storefront
Business
User
Data Manager Data Science
User
Automated Data Storefront
Line Level
Lab Data
Aggregated
Lab Data
Data Products
Provenance
Validation Reports
Dead letter Reports
Audit Reports
VAR
Team
Glue Crawler Glue Jobs
Analytical Pipelines
Data Lake
Update
Hourly
Glue
ETL
Translation
Exclusion
Tagging
Race and Age
Calculation
Fllter
Schedule
Trigger
Trigger
Schedule
Data
Catalogue
Features in place TODAY
§ Ingest
• Real time Staged Event Pipeline Processing or
Manual Upload
• HL7 Pipelines - Support HL7 (2.5.1, 2.5, 2.3.z,
2.3.1)
• FLAT Pipelines - CSV/FLAT/JSON (Any Size)
• FHIR Pipeline*
§ Validation
• FLAT File and Record level Validations via
Configurations (no code)
• HL7 2.5.1, 2.5, 2.3.z, 2.3.1 Validations via
Configurations
§ Transformations
• FLAT to HL7 Hierarchy
• HL7 to FHIR (per build.fhir.org) *
• HL7 to FLAT via Configurations
§ Translations
• Terminology transformations via Configurations
§ Data Lake Management Services
• At scale ETL Workflows
• SQL Style Querying on all Data
• Data Replay and Data de-Duplication
• Biz rules for calculating fields
• Machine Learning for feature extraction
from raw data and ETL for Data cleaning *
• Configuration Management
• Data Case Classifications
• Data Catalogue (Schemas and Dictionaries)
• Auditor Services for proactive issue
detections
§ Data Policy and Governance
• Data Use Agreement Filter
• Data Enrichment
• Auto Data Catalogue
• Data Security
• Data Redaction
• Pseudonymization for linking*
• De-Identification
§ Data Products (Reporting/Provision)
• Merged Line level Data from all sources
in single schema
• On demand canned Data Products
(extracts)
• Bulk Data Exports - time stamped data
sets at scale
• Self Service Custom queries
• De-duplications for resubmissions at the
record level
§ Data Integration Products
• Data Routing
• Clinical Decision Support for Guidance
Delivery *
• Exposing Data as FHIR API *
• SMART on FHIR App for integration with
EHR *
§ Analytics
• Real Time Dashboards for Lake Operations
• Real Time Dashboards for Lake Data Quality
and Provenance
• Jupyter Notebooks with all tooling (R, Python,
Scala) for Data Science
• Spark Jobs for high volume batch processing
• Canned ML algorithms
§ DEVSECOPS
• DEV to PROD in hours not days
• Full scans and deployment as part of CI/CD
• HOSTED on FISMA Moderate Cloud
Environment
• CDC ATO Environment
• HIPAA Compliant Environment
§ Data Apps
• Portal Access for Partner Agencies based on
Business needs
More Features in place TODAY
Tech Stack
• AWS EKS – Kubernetes <- Microservices
• Rancher
• AWS Lambda <- Serverless
• AWS Glue <- Serverless
• AWS Athena <- Serverless
• AWS Redshift
• AWS Sagemaker /JupyterLab
• AWS Quicksights <- Serverless
• AWS S3 , SQS , SNS , Dynamo DB , RDS Postgres
• Confluent Kafka
• Elasticsearch
• Kibana
• GitLab
All Features are AT SCALE
§ Parallelism in Data Pipelines for Large-Scale Processing
– 30 Kafka Partitions, 5 Broker Kafka Cluster
§ Horizontal scaling for storage (S3, Redshift -> Petabytes)
§ Delivering Data to Consumers at Scale
– Bulk Exports -> Gigabyte Slices of Data
§ Cloud managed Serverless services for analytics
Current Status
§ Status: In Production
§ Infrastructure Build out completed in ~5 days
§ Initial production deployment in ~10 days
§ Data Streams Logistical stabilization: ~15 days
§ O&M Started 30 days from Start Date
§ Full stack release cycle every 3 days (twice a week) -> now down to once per week
§ Data Products and Analytic Products are in “Real Time”
§ Data Consumers
– HHS Protect
– CDC
CELR
For more information, contact CDC
1-800-CDC-INFO (232-4636)
TTY: 1-888-232-6348 www.cdc.gov
The findings and conclusions in this report are those of the authors and do not necessarily represent the
official position of the Centers for Disease Control and Prevention.
Jason Hall, NCEZID, (zfr9@cdc.gov)
Rishi Tarar, Enterprise Architect and Fellow, Northrop Grumman (rrt8@cdc.gov)
Terminology
§ Disease surveillance is an epidemiological practice by which the spread of
disease is monitored in order to establish patterns of progression.
– The main role of disease surveillance is to
• predict, observe, and
• minimize the harm caused by outbreak, epidemic, and pandemic
situations, as well as
• increase knowledge about which factors contribute to such
circumstances.
“Surveillance data is a series of natural and spontaneous
raw data streams.
Don't resist them; that only creates sorrow and silos.
Let reality be the reality.
Let data streams flow naturally forward in whatever way it likes.”
-- Adapted from Lao Tzu

More Related Content

PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PDF
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
PDF
Time Series Analysis Using an Event Streaming Platform
PDF
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
PDF
Enterprise Metadata Integration
PDF
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Time Series Analysis Using an Event Streaming Platform
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
Enterprise Metadata Integration
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)

What's hot (20)

PDF
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
PDF
Events Everywhere: Enabling Digital Transformation in the Public Sector
PDF
Tale of two streaming frameworks (Karthik D - Walmart)
PDF
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
PDF
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
PDF
Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...
PDF
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
PDF
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
PDF
Reliable and Scalable Data Ingestion at Airbnb
PDF
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
PPTX
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
PDF
Can Apache Kafka Replace a Database?
PDF
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
PDF
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
PDF
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
PDF
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
PPTX
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
PDF
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
IoT Sensor Analytics with Kafka, ksqlDB and TensorFlow
Events Everywhere: Enabling Digital Transformation in the Public Sector
Tale of two streaming frameworks (Karthik D - Walmart)
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Reliable and Scalable Data Ingestion at Airbnb
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Leveraging Mainframe Data for Modern Analytics
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Can Apache Kafka Replace a Database?
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
Ad

Similar to Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka Summit 2020 (20)

PPTX
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
PPTX
Predicting Patient Outcomes in Real-Time at HCA
PPTX
Big Data at Geisinger Health System: Big Wins in a Short Time
PPTX
Hadoop Enabled Healthcare
PPTX
HPE and Hortonworks join forces to Deliver Healthcare Transformation
PDF
Patient-Like-Mine
PDF
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
PDF
G. Poste. Managing the Data Deluge: Critical Issues in the Integration and An...
PPTX
Improving Healthcare Operations Using Process Data Mining
PDF
Discover how Covid-19 is accelerating the need for healthcare interoperabilit...
PPTX
Improving Healthcare Operations Using Process Data Mining
PPTX
Big data in IoT for healthcare - www.pepgra.com
PPTX
Demand connected medical devices to improve military EHRs
PPTX
Wolters Kluwer Improves Patient Outcomes with GigaSpaces XAP
PPTX
WebAction In-Memory Computing Summit 2015
PPTX
Using The Hadoop Ecosystem to Drive Healthcare Innovation
PDF
The Epicenter Of The Pandemic: Driving Transformation At Northwell Health
PDF
Revenue opportunities in the management of healthcare data deluge
PDF
Key Steps to Building an Effective Patient-centric Healthcare System
PPTX
Working With Large-Scale Clinical Datasets
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Predicting Patient Outcomes in Real-Time at HCA
Big Data at Geisinger Health System: Big Wins in a Short Time
Hadoop Enabled Healthcare
HPE and Hortonworks join forces to Deliver Healthcare Transformation
Patient-Like-Mine
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
G. Poste. Managing the Data Deluge: Critical Issues in the Integration and An...
Improving Healthcare Operations Using Process Data Mining
Discover how Covid-19 is accelerating the need for healthcare interoperabilit...
Improving Healthcare Operations Using Process Data Mining
Big data in IoT for healthcare - www.pepgra.com
Demand connected medical devices to improve military EHRs
Wolters Kluwer Improves Patient Outcomes with GigaSpaces XAP
WebAction In-Memory Computing Summit 2015
Using The Hadoop Ecosystem to Drive Healthcare Innovation
The Epicenter Of The Pandemic: Driving Transformation At Northwell Health
Revenue opportunities in the management of healthcare data deluge
Key Steps to Building an Effective Patient-centric Healthcare System
Working With Large-Scale Clinical Datasets
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Modernizing your data center with Dell and AMD
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
Modernizing your data center with Dell and AMD
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”

Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka Summit 2020

  • 1. National Center for Emerging and Zoonotic Infectious Diseases Flattening the Curve with Covid-19 Electronic Lab Reporting Rishi Tarar, Northrop Grumman Jason Hall, CDC Kafka Summit , 2020
  • 2. Background § This architecture stemmed out of necessity for CDC’s EIP(Emerging Infections Program) programs, with an eye on ongoing agency efforts (CDC Data and IT Modernization) § Multiple national level use case Implementations proved out the architecture and exposed commonality that can extend enterprise wide… § And meet hard challenges like a Pandemic – head on
  • 3. COVID-19 Electronic Lab Reporting(CELR) - Scope § Agency initiative to collect COVID-19 line level lab testing data from alljurisdictions in United States § Goal to have most comprehensive testing data § Improve the quality and fidelity of line level data on an ongoing basis § Could be used for other conditions
  • 5. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2016 Future State is a Mirage – Transition State is reality Current Stream New Stream 2020 Another Current State Future StateTransition State
  • 6. Primary Citizen -> TESTING EVENT § Data: Each record is an TESTING EVENT § Producers organized adjacent to feed formats § Streaming data and shaping it record-by-record through the pipelines § Each record is a primary citizen – Each record flows through the set of stream processors – Metadata is added to each record – Makes “things” happening to “a” record rapidly observable – Each record conforms to an evolving schema capability § Data can be aggregated and streamed to any destination on the fly
  • 7. Event Pipelines Event sourcing Program Y Data Sources Program X Feed1 Data Lake New Producer Current Producer Feed2 Validate Redact Transform Translate Biz Rules Case Clasification S3 Sink Connector JDBC Sink Connector event event event event event Elastic Sink Connector Data Lake Kafka Data Streams Configuration driven workloads Data events sink-ed Storage (Blob,Relational, ElasticSearch) Frameworks Pipelines organized by Program and Pathogen
  • 8. The Platform High Level Architecture Kafka FLAT Pipelines Pipelines HL7 CDA Pipelines FHIR Pipelines CDA Labs Hospital FHIR SPHL CSV/ JSON XLXS HL7 Pipelines Registri es Data Lake S3 Dashboards Data Sets Data Science Tools Case Notifications Lab Reporting Healthcare Interoperability Use Cases Implemented Athena Redshift Schema Dictionary Partner Collaboration Tools Real Time Data Stream Custom Data Sets Bulk Exports Machine Learning Business User (Non Tech) Data Manager Data Science User
  • 9. Data Storefronts Merged Lab Data Athena Tables Redshift Tables Quick Sights Data Science Tools DCIPHER Curated Views All Data HHS CELR Portal Self Service Data Storefront Business User Data Manager Data Science User Automated Data Storefront Line Level Lab Data Aggregated Lab Data Data Products Provenance Validation Reports Dead letter Reports Audit Reports VAR Team Glue Crawler Glue Jobs Analytical Pipelines Data Lake Update Hourly Glue ETL Translation Exclusion Tagging Race and Age Calculation Fllter Schedule Trigger Trigger Schedule Data Catalogue
  • 10. Features in place TODAY § Ingest • Real time Staged Event Pipeline Processing or Manual Upload • HL7 Pipelines - Support HL7 (2.5.1, 2.5, 2.3.z, 2.3.1) • FLAT Pipelines - CSV/FLAT/JSON (Any Size) • FHIR Pipeline* § Validation • FLAT File and Record level Validations via Configurations (no code) • HL7 2.5.1, 2.5, 2.3.z, 2.3.1 Validations via Configurations § Transformations • FLAT to HL7 Hierarchy • HL7 to FHIR (per build.fhir.org) * • HL7 to FLAT via Configurations § Translations • Terminology transformations via Configurations § Data Lake Management Services • At scale ETL Workflows • SQL Style Querying on all Data • Data Replay and Data de-Duplication • Biz rules for calculating fields • Machine Learning for feature extraction from raw data and ETL for Data cleaning * • Configuration Management • Data Case Classifications • Data Catalogue (Schemas and Dictionaries) • Auditor Services for proactive issue detections § Data Policy and Governance • Data Use Agreement Filter • Data Enrichment • Auto Data Catalogue • Data Security • Data Redaction • Pseudonymization for linking* • De-Identification
  • 11. § Data Products (Reporting/Provision) • Merged Line level Data from all sources in single schema • On demand canned Data Products (extracts) • Bulk Data Exports - time stamped data sets at scale • Self Service Custom queries • De-duplications for resubmissions at the record level § Data Integration Products • Data Routing • Clinical Decision Support for Guidance Delivery * • Exposing Data as FHIR API * • SMART on FHIR App for integration with EHR * § Analytics • Real Time Dashboards for Lake Operations • Real Time Dashboards for Lake Data Quality and Provenance • Jupyter Notebooks with all tooling (R, Python, Scala) for Data Science • Spark Jobs for high volume batch processing • Canned ML algorithms § DEVSECOPS • DEV to PROD in hours not days • Full scans and deployment as part of CI/CD • HOSTED on FISMA Moderate Cloud Environment • CDC ATO Environment • HIPAA Compliant Environment § Data Apps • Portal Access for Partner Agencies based on Business needs More Features in place TODAY
  • 12. Tech Stack • AWS EKS – Kubernetes <- Microservices • Rancher • AWS Lambda <- Serverless • AWS Glue <- Serverless • AWS Athena <- Serverless • AWS Redshift • AWS Sagemaker /JupyterLab • AWS Quicksights <- Serverless • AWS S3 , SQS , SNS , Dynamo DB , RDS Postgres • Confluent Kafka • Elasticsearch • Kibana • GitLab
  • 13. All Features are AT SCALE § Parallelism in Data Pipelines for Large-Scale Processing – 30 Kafka Partitions, 5 Broker Kafka Cluster § Horizontal scaling for storage (S3, Redshift -> Petabytes) § Delivering Data to Consumers at Scale – Bulk Exports -> Gigabyte Slices of Data § Cloud managed Serverless services for analytics
  • 14. Current Status § Status: In Production § Infrastructure Build out completed in ~5 days § Initial production deployment in ~10 days § Data Streams Logistical stabilization: ~15 days § O&M Started 30 days from Start Date § Full stack release cycle every 3 days (twice a week) -> now down to once per week § Data Products and Analytic Products are in “Real Time” § Data Consumers – HHS Protect – CDC
  • 15. CELR
  • 16. For more information, contact CDC 1-800-CDC-INFO (232-4636) TTY: 1-888-232-6348 www.cdc.gov The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. Jason Hall, NCEZID, (zfr9@cdc.gov) Rishi Tarar, Enterprise Architect and Fellow, Northrop Grumman (rrt8@cdc.gov)
  • 17. Terminology § Disease surveillance is an epidemiological practice by which the spread of disease is monitored in order to establish patterns of progression. – The main role of disease surveillance is to • predict, observe, and • minimize the harm caused by outbreak, epidemic, and pandemic situations, as well as • increase knowledge about which factors contribute to such circumstances.
  • 18. “Surveillance data is a series of natural and spontaneous raw data streams. Don't resist them; that only creates sorrow and silos. Let reality be the reality. Let data streams flow naturally forward in whatever way it likes.” -- Adapted from Lao Tzu