SlideShare a Scribd company logo
4
Most read
8
Most read
10
Most read
A Journey To Modernization
Shawn Benjamin & Prabha Rajendran
Problems Faced
q Informatica ETL pipeline was brittle
q Lengthy Informatica ETL development
cycle
q Lengthy load time workflows for ingestion
of data
q Lack of ability for real time/ near real-time
data
q Lack of data science platform
2
Legacy Architecture
Data Scientists
Statisticians
Business Analysts
Extracted
Data
R / Python Code Business
Insight
3
Datawarehouse
Source Systems
4
C3 CON
eCIMS
C3 LANS
C4
NFTS
ELIS
NASS
CPMS
Pay.gov
AR-11
RAPS
MFAS
VSS
ATS
PCTS
SNAP
RNACS
Benefits
Files
Payment
Verification
Validation
Cust Svc
Scheduler
AdminFraudFOIA
Benefits
Mart
Scheduler
Mart
Payment
Mart
Validation
Mart
APSS
C3 Con
C4
CIS
ELIS2
MFAS
NFTS
PCTS
RNACS
VSS
C3 LAN
AAO x2
CSC x2
MSC x2
NSC x2
TSC x2
VSC x2
CHAMPS
ECHO
iCLAIMS
IDCMS
QUEUE
NPWR
CAMINO VIBE
SODA
Data Marts
SCCLAIMS
FDNS-DS
SMART Subject Areas
SAS LibrarieseCISCOR
Direct Connects
Data
Marts
Direct Connects
Active ODSes Decom. ODSes
Treasury
CIR
CIS
x7
x1
x2
x1
x1
x5
x2
x5
x2
x1
x2
x6 x1
x1
x1
x1
x2
x1
x2
x1
x1
LEGEND
ODSes
VIS x2
CPMS x1
SRMT x1
NFTS x1x1
x1
x1
FACCON x1
ePULSE x1
BI Tools
Data Marts
Users
4
2016SNAPSHOT
JANUARY
Data
Sources
SMART
Subject Areas
36
2
eCISCOR
ODS
FutureData MartDirect Connect
66
2,354
ETL28 Processes
Implemented
Databricks private
cloud
VPC (26 nodes) in the
AWS
Connected the Databricks
cluster to the Oracle database
Created all relevant
DB tables in HIVE
metadata pointing to
Oracle database
Copied relevant tables from Oracle
database to S3 using Scala code
Data is stored in Apache Parquet
columnar format. For context, the 120
million row 83 column can be dumped
to S3 in just 10 minutes.
Identified appropriate
partition scheme
large tables were partitioned to
optimize Spark query performance
Created multiple notebooks
Perform data analysis and visualize the
results, e.g. created histogram of case
life cycle duration
Successful Proof of Concept
5
Current Databricks Implementation
6
Statisticians
Business Analysts
Business
Insight
S3 Data
Lake
Data Scientists
LakeHouse
7
• 75 Data
Sources
• Xx Data
Interfaces
• 7 Data
Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject
Areas
• 56 SAS
Libraries
• 6,233 Users
• 75 Data
Sources
• 35 Application
Interfaces
• 7 Data Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject Areas
• 56 SAS
Libraries
• 6,233 Users
Databricks Accomplishments
IMPLEMENTATION
OF DELTA LAKE
EASY INTEGRATION
WITH OBIEE ,SAS
AND TABLEAU WITH
NATIVE
CONNECTORS
INTEGRATION WITH
GITHUB FOR
CONTINUOUS
INTEGRATION &
DEPLOYMENT
AUTOMATING
ACCOUNT
PROVISIONING
MACHINE LEARNING
(ML FLOW)
INTEGRATION
8
Change Data Capture using Delta Lake
Databricks Delta –Success Factors
v Faster Ingestion of CDC changes
v Resiliency
v Improved Data Quality , Reporting
Availability and runtime
performance
v Schema evolution - adding
additional columns without
rewriting the entire table
Databricks Delta-Lessons Learned
v Storage requirements increased
v Vacuum and Optimization is
mandatory to improve the
performance
9
Unified Data Analytical Platform -Tableau
10
Unified Data Analytical Platform –OBIEE/SAS
11
Data Science Experiments using ML
12
ML Graphs processes after running the models
Prediction Model Samples
Text and Log Mining
0
0.5
1
NegativeSentiment Positive Sentiment
Sentiment Analysis
13
Time Series Models and H2O Integration
Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using
the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations
14
Enabling Security & Governance
15
Access
Control (ACL)
Credentials
Passthrough
Secrets
Management
v Control users access to data using the
Databricks view-based access control
model (Table and schema level ACLs)
v Control users access to clusters that are
not enabled for table access control
v Enforced data object privileges at
onboarding phase
v Used Databricks secrets manager to store
credentials and reference in notebooks
Databricks Management API Usage
16
Cluster/Jobs management àCreate, delete,
manage clusters and get execution status of daily
scheduled jobs which helped automated
monitoring.
Library /Secret managementà Easy upload of
any third-party libraries and manage encrypted
scopes/credentials to connect to source and
target endpoints.
Integration and Deploymentsà API with Git
and Jenkins for continuous integration and
continuous deployment
Enabled MLFlow Tracking API for our Data
Science experiments
API
Integrated Databricks Management API with Jenkins and other scripting tools to
automate all our administration and management tasks.
Lessons learned through this Journey
Training plan Cloud based
experience
Subject Matter
Expertise
Automation
17
Success Strategy
Success Criteria Benefit
Performance
ü Auto-scalability leveraging on-demand and spot instances
ü Efficient processing of larger datasets comparable to RDMS systems
ü Scalable read/write performance on S3
Support for a variety of statistical
programming languages
ü Data Science Platform ( R, Python, Scala and SQL)
ü Supports MLIB : Machine Learning & Deep Learning
Integration with existing tools
ü Allows connections to industry standard technologies via ODBC/JDBC
connection and inbuilt connectors.
Easily integrate new data sources
ü Supports seamless integration with data streaming technologies like
Kafka/Kinesis using Spark Streaming. This supports both structured and
unstructured
ü Leverages S3 extensively
Secure
ü Supports integration with multiple Single-Sign-On platforms
ü Supports native encryption-decryption features (AES-256 and KMS)
ü Supports Access Control Layer (ACL)
ü Implemented in USCIS Private cloud
18
Questions!
19

More Related Content

PDF
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
PPTX
Accounting Information Systems
PDF
Data engineering in 10 years.pdf
PDF
Finit creative solutions for cash flow fx analysis through dashboarding
PPTX
Fundamentals of QuickBooks
PDF
DI&A Slides: Data Lake vs. Data Warehouse
PDF
Employee attendance details & medical expenses software project to TVS pvt ltd
PPTX
Salesforce Architecture framework, Martin Kona
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...
Accounting Information Systems
Data engineering in 10 years.pdf
Finit creative solutions for cash flow fx analysis through dashboarding
Fundamentals of QuickBooks
DI&A Slides: Data Lake vs. Data Warehouse
Employee attendance details & medical expenses software project to TVS pvt ltd
Salesforce Architecture framework, Martin Kona

What's hot (20)

PPTX
Systems analysis and design
PPTX
Role of a Solution Architect-1.pptx
PDF
AI-900: Microsoft Azure AI Fundamentals 2021
PDF
Best Practices in HFM Application Design
PDF
How to Prepare for a BI Migration
PPT
SAP HR - PPT
PPTX
Reinventing the Record-to-Report Process for Worry-Free Governance, Risk & Co...
PPTX
Cash Flow Series, Part I: 2-dimensional vs 3-dimensional
PDF
Rationalizing an Enterprise IT Architecture
PDF
Machine Learning in Banking Sector
PDF
Data-Driven Rules in HFM
PPTX
Power BI for Developers
PDF
Understanding HFM System Tables
PPTX
PPTX
How to design RPA performance metrics
DOC
Hyperion Implementation Questionaries
PDF
5 KPIs That Drive Accounts Payable Performance
PPTX
Data ops in practice
PDF
UiPath - IT Automation (1).pdf
PDF
Visualizing Data: Metodología para el diseño de visualización de datos.
Systems analysis and design
Role of a Solution Architect-1.pptx
AI-900: Microsoft Azure AI Fundamentals 2021
Best Practices in HFM Application Design
How to Prepare for a BI Migration
SAP HR - PPT
Reinventing the Record-to-Report Process for Worry-Free Governance, Risk & Co...
Cash Flow Series, Part I: 2-dimensional vs 3-dimensional
Rationalizing an Enterprise IT Architecture
Machine Learning in Banking Sector
Data-Driven Rules in HFM
Power BI for Developers
Understanding HFM System Tables
How to design RPA performance metrics
Hyperion Implementation Questionaries
5 KPIs That Drive Accounts Payable Performance
Data ops in practice
UiPath - IT Automation (1).pdf
Visualizing Data: Metodología para el diseño de visualización de datos.
Ad

Similar to Lessons Learned from Modernizing USCIS Data Analytics Platform (20)

PDF
Slides: Case Study — How J.B. Hunt is Driving Efficiency with AI and Real-Tim...
PPTX
MediaMath - Big Data Warehousing Meetup - 2/16/2016
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PDF
Moving Past Infrastructure Limitations
PPTX
Databricks on AWS.pptx
PDF
Modernizing to a Cloud Data Architecture
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PPTX
Unlock Data-driven Insights in Databricks Using Location Intelligence
PPTX
20191106 brasil it 2
PPTX
Liberate Legacy Data Sources with Precisely and Databricks
PPTX
Building a Big Data Pipeline
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PDF
OpenSistemas Corporate Presentation
PDF
Introducing Databricks Delta
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
How to Capitalize on Big Data with Oracle Analytics Cloud
PDF
Unlock Your Data for ML & AI using Data Virtualization
PPTX
Accelerate Innovation with Databricks and Legacy Data
Slides: Case Study — How J.B. Hunt is Driving Efficiency with AI and Real-Tim...
MediaMath - Big Data Warehousing Meetup - 2/16/2016
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Moving Past Infrastructure Limitations
Databricks on AWS.pptx
Modernizing to a Cloud Data Architecture
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Unlock Data-driven Insights in Databricks Using Location Intelligence
20191106 brasil it 2
Liberate Legacy Data Sources with Precisely and Databricks
Building a Big Data Pipeline
Building a Turbo-fast Data Warehousing Platform with Databricks
OpenSistemas Corporate Presentation
Introducing Databricks Delta
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
How to Capitalize on Big Data with Oracle Analytics Cloud
Unlock Your Data for ML & AI using Data Virtualization
Accelerate Innovation with Databricks and Legacy Data
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
A Complete Guide to Streamlining Business Processes
PDF
Microsoft Core Cloud Services powerpoint
PPT
Predictive modeling basics in data cleaning process
PPTX
IMPACT OF LANDSLIDE.....................
DOCX
Factor Analysis Word Document Presentation
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
New ISO 27001_2022 standard and the changes
PDF
Introduction to the R Programming Language
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Business Analytics and business intelligence.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Inferential Statistics.pptx
PDF
Global Data and Analytics Market Outlook Report
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
A Complete Guide to Streamlining Business Processes
Microsoft Core Cloud Services powerpoint
Predictive modeling basics in data cleaning process
IMPACT OF LANDSLIDE.....................
Factor Analysis Word Document Presentation
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
New ISO 27001_2022 standard and the changes
Introduction to the R Programming Language
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Analytics and business intelligence.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Inferential Statistics.pptx
Global Data and Analytics Market Outlook Report
Optimise Shopper Experiences with a Strong Data Estate.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...

Lessons Learned from Modernizing USCIS Data Analytics Platform

  • 1. A Journey To Modernization Shawn Benjamin & Prabha Rajendran
  • 2. Problems Faced q Informatica ETL pipeline was brittle q Lengthy Informatica ETL development cycle q Lengthy load time workflows for ingestion of data q Lack of ability for real time/ near real-time data q Lack of data science platform 2
  • 3. Legacy Architecture Data Scientists Statisticians Business Analysts Extracted Data R / Python Code Business Insight 3 Datawarehouse Source Systems
  • 4. 4 C3 CON eCIMS C3 LANS C4 NFTS ELIS NASS CPMS Pay.gov AR-11 RAPS MFAS VSS ATS PCTS SNAP RNACS Benefits Files Payment Verification Validation Cust Svc Scheduler AdminFraudFOIA Benefits Mart Scheduler Mart Payment Mart Validation Mart APSS C3 Con C4 CIS ELIS2 MFAS NFTS PCTS RNACS VSS C3 LAN AAO x2 CSC x2 MSC x2 NSC x2 TSC x2 VSC x2 CHAMPS ECHO iCLAIMS IDCMS QUEUE NPWR CAMINO VIBE SODA Data Marts SCCLAIMS FDNS-DS SMART Subject Areas SAS LibrarieseCISCOR Direct Connects Data Marts Direct Connects Active ODSes Decom. ODSes Treasury CIR CIS x7 x1 x2 x1 x1 x5 x2 x5 x2 x1 x2 x6 x1 x1 x1 x1 x2 x1 x2 x1 x1 LEGEND ODSes VIS x2 CPMS x1 SRMT x1 NFTS x1x1 x1 x1 FACCON x1 ePULSE x1 BI Tools Data Marts Users 4 2016SNAPSHOT JANUARY Data Sources SMART Subject Areas 36 2 eCISCOR ODS FutureData MartDirect Connect 66 2,354 ETL28 Processes
  • 5. Implemented Databricks private cloud VPC (26 nodes) in the AWS Connected the Databricks cluster to the Oracle database Created all relevant DB tables in HIVE metadata pointing to Oracle database Copied relevant tables from Oracle database to S3 using Scala code Data is stored in Apache Parquet columnar format. For context, the 120 million row 83 column can be dumped to S3 in just 10 minutes. Identified appropriate partition scheme large tables were partitioned to optimize Spark query performance Created multiple notebooks Perform data analysis and visualize the results, e.g. created histogram of case life cycle duration Successful Proof of Concept 5
  • 6. Current Databricks Implementation 6 Statisticians Business Analysts Business Insight S3 Data Lake Data Scientists LakeHouse
  • 7. 7 • 75 Data Sources • Xx Data Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users • 75 Data Sources • 35 Application Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users
  • 8. Databricks Accomplishments IMPLEMENTATION OF DELTA LAKE EASY INTEGRATION WITH OBIEE ,SAS AND TABLEAU WITH NATIVE CONNECTORS INTEGRATION WITH GITHUB FOR CONTINUOUS INTEGRATION & DEPLOYMENT AUTOMATING ACCOUNT PROVISIONING MACHINE LEARNING (ML FLOW) INTEGRATION 8
  • 9. Change Data Capture using Delta Lake Databricks Delta –Success Factors v Faster Ingestion of CDC changes v Resiliency v Improved Data Quality , Reporting Availability and runtime performance v Schema evolution - adding additional columns without rewriting the entire table Databricks Delta-Lessons Learned v Storage requirements increased v Vacuum and Optimization is mandatory to improve the performance 9
  • 10. Unified Data Analytical Platform -Tableau 10
  • 11. Unified Data Analytical Platform –OBIEE/SAS 11
  • 12. Data Science Experiments using ML 12 ML Graphs processes after running the models Prediction Model Samples
  • 13. Text and Log Mining 0 0.5 1 NegativeSentiment Positive Sentiment Sentiment Analysis 13
  • 14. Time Series Models and H2O Integration Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations 14
  • 15. Enabling Security & Governance 15 Access Control (ACL) Credentials Passthrough Secrets Management v Control users access to data using the Databricks view-based access control model (Table and schema level ACLs) v Control users access to clusters that are not enabled for table access control v Enforced data object privileges at onboarding phase v Used Databricks secrets manager to store credentials and reference in notebooks
  • 16. Databricks Management API Usage 16 Cluster/Jobs management àCreate, delete, manage clusters and get execution status of daily scheduled jobs which helped automated monitoring. Library /Secret managementà Easy upload of any third-party libraries and manage encrypted scopes/credentials to connect to source and target endpoints. Integration and Deploymentsà API with Git and Jenkins for continuous integration and continuous deployment Enabled MLFlow Tracking API for our Data Science experiments API Integrated Databricks Management API with Jenkins and other scripting tools to automate all our administration and management tasks.
  • 17. Lessons learned through this Journey Training plan Cloud based experience Subject Matter Expertise Automation 17
  • 18. Success Strategy Success Criteria Benefit Performance ü Auto-scalability leveraging on-demand and spot instances ü Efficient processing of larger datasets comparable to RDMS systems ü Scalable read/write performance on S3 Support for a variety of statistical programming languages ü Data Science Platform ( R, Python, Scala and SQL) ü Supports MLIB : Machine Learning & Deep Learning Integration with existing tools ü Allows connections to industry standard technologies via ODBC/JDBC connection and inbuilt connectors. Easily integrate new data sources ü Supports seamless integration with data streaming technologies like Kafka/Kinesis using Spark Streaming. This supports both structured and unstructured ü Leverages S3 extensively Secure ü Supports integration with multiple Single-Sign-On platforms ü Supports native encryption-decryption features (AES-256 and KMS) ü Supports Access Control Layer (ACL) ü Implemented in USCIS Private cloud 18