Lessons Learned from Modernizing USCIS Data Analytics Platform

A Journey To Modernization
Shawn Benjamin & Prabha Rajendran

Problems Faced
q Informatica ETL pipeline was brittle
q Lengthy Informatica ETL development
cycle
q Lengthy load time workflows for ingestion
of data
q Lack of ability for real time/ near real-time
data
q Lack of data science platform
2

Legacy Architecture
Data Scientists
Statisticians
Business Analysts
Extracted
Data
R / Python Code Business
Insight
3
Datawarehouse
Source Systems

4
C3 CON
eCIMS
C3 LANS
C4
NFTS
ELIS
NASS
CPMS
Pay.gov
AR-11
RAPS
MFAS
VSS
ATS
PCTS
SNAP
RNACS
Benefits
Files
Payment
Verification
Validation
Cust Svc
Scheduler
AdminFraudFOIA
Benefits
Mart
Scheduler
Mart
Payment
Mart
Validation
Mart
APSS
C3 Con
C4
CIS
ELIS2
MFAS
NFTS
PCTS
RNACS
VSS
C3 LAN
AAO x2
CSC x2
MSC x2
NSC x2
TSC x2
VSC x2
CHAMPS
ECHO
iCLAIMS
IDCMS
QUEUE
NPWR
CAMINO VIBE
SODA
Data Marts
SCCLAIMS
FDNS-DS
SMART Subject Areas
SAS LibrarieseCISCOR
Direct Connects
Data
Marts
Direct Connects
Active ODSes Decom. ODSes
Treasury
CIR
CIS
x7
x1
x2
x1
x1
x5
x2
x5
x2
x1
x2
x6 x1
x1
x1
x1
x2
x1
x2
x1
x1
LEGEND
ODSes
VIS x2
CPMS x1
SRMT x1
NFTS x1x1
x1
x1
FACCON x1
ePULSE x1
BI Tools
Data Marts
Users
4
2016SNAPSHOT
JANUARY
Data
Sources
SMART
Subject Areas
36
2
eCISCOR
ODS
FutureData MartDirect Connect
66
2,354
ETL28 Processes

Implemented
Databricks private
cloud
VPC (26 nodes) in the
AWS
Connected the Databricks
cluster to the Oracle database
Created all relevant
DB tables in HIVE
metadata pointing to
Oracle database
Copied relevant tables from Oracle
database to S3 using Scala code
Data is stored in Apache Parquet
columnar format. For context, the 120
million row 83 column can be dumped
to S3 in just 10 minutes.
Identified appropriate
partition scheme
large tables were partitioned to
optimize Spark query performance
Created multiple notebooks
Perform data analysis and visualize the
results, e.g. created histogram of case
life cycle duration
Successful Proof of Concept
5

Current Databricks Implementation
6
Statisticians
Business Analysts
Business
Insight
S3 Data
Lake
Data Scientists
LakeHouse

7
• 75 Data
Sources
• Xx Data
Interfaces
• 7 Data
Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject
Areas
• 56 SAS
Libraries
• 6,233 Users
• 75 Data
Sources
• 35 Application
Interfaces
• 7 Data Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject Areas
• 56 SAS
Libraries
• 6,233 Users

Databricks Accomplishments
IMPLEMENTATION
OF DELTA LAKE
EASY INTEGRATION
WITH OBIEE ,SAS
AND TABLEAU WITH
NATIVE
CONNECTORS
INTEGRATION WITH
GITHUB FOR
CONTINUOUS
INTEGRATION &
DEPLOYMENT
AUTOMATING
ACCOUNT
PROVISIONING
MACHINE LEARNING
(ML FLOW)
INTEGRATION
8

Change Data Capture using Delta Lake
Databricks Delta –Success Factors
v Faster Ingestion of CDC changes
v Resiliency
v Improved Data Quality , Reporting
Availability and runtime
performance
v Schema evolution - adding
additional columns without
rewriting the entire table
Databricks Delta-Lessons Learned
v Storage requirements increased
v Vacuum and Optimization is
mandatory to improve the
performance
9

Unified Data Analytical Platform -Tableau
10

Unified Data Analytical Platform –OBIEE/SAS
11

Data Science Experiments using ML
12
ML Graphs processes after running the models
Prediction Model Samples

Text and Log Mining
0
0.5
1
NegativeSentiment Positive Sentiment
Sentiment Analysis
13

Time Series Models and H2O Integration
Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using
the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations
14

Enabling Security & Governance
15
Access
Control (ACL)
Credentials
Passthrough
Secrets
Management
v Control users access to data using the
Databricks view-based access control
model (Table and schema level ACLs)
v Control users access to clusters that are
not enabled for table access control
v Enforced data object privileges at
onboarding phase
v Used Databricks secrets manager to store
credentials and reference in notebooks

Databricks Management API Usage
16
Cluster/Jobs management àCreate, delete,
manage clusters and get execution status of daily
scheduled jobs which helped automated
monitoring.
Library /Secret managementà Easy upload of
any third-party libraries and manage encrypted
scopes/credentials to connect to source and
target endpoints.
Integration and Deploymentsà API with Git
and Jenkins for continuous integration and
continuous deployment
Enabled MLFlow Tracking API for our Data
Science experiments
API
Integrated Databricks Management API with Jenkins and other scripting tools to
automate all our administration and management tasks.

Lessons learned through this Journey
Training plan Cloud based
experience
Subject Matter
Expertise
Automation
17

Success Strategy
Success Criteria Benefit
Performance
ü Auto-scalability leveraging on-demand and spot instances
ü Efficient processing of larger datasets comparable to RDMS systems
ü Scalable read/write performance on S3
Support for a variety of statistical
programming languages
ü Data Science Platform ( R, Python, Scala and SQL)
ü Supports MLIB : Machine Learning & Deep Learning
Integration with existing tools
ü Allows connections to industry standard technologies via ODBC/JDBC
connection and inbuilt connectors.
Easily integrate new data sources
ü Supports seamless integration with data streaming technologies like
Kafka/Kinesis using Spark Streaming. This supports both structured and
unstructured
ü Leverages S3 extensively
Secure
ü Supports integration with multiple Single-Sign-On platforms
ü Supports native encryption-decryption features (AES-256 and KMS)
ü Supports Access Control Layer (ACL)
ü Implemented in USCIS Private cloud
18

Lessons Learned from Modernizing USCIS Data Analytics Platform

More Related Content

What's hot (20)

Similar to Lessons Learned from Modernizing USCIS Data Analytics Platform (20)

More from Databricks (20)

Recently uploaded (20)

Lessons Learned from Modernizing USCIS Data Analytics Platform