SlideShare a Scribd company logo
ChakraView
A 360° approach to data quality
Shankar Manian
Keerthika Thiyagarajan
Background
● ~15 years in Big Data...
● ...as Data Janitors
● Can we do better ?
Data Quality - Missing Focus
● Afterthought
● Needle in a haystack
● Huge cost
Detection - Missing Dimensions
● Completeness
● Consistency
● Auditability
Cleansing - The Hidden Cost
● Trace the issue to source
● No SOP on how to fix
● Hard to Automate
Visibility - Or the lack of it
● Impact - Cost of bad data
● Breakdown and Prioritization
● Push quality upstream
State before
● Stakeholder driven
● Reactive process
● Business metrics
● Huge monetary impact
● Iterative Discovery
Validations Framework
● Granular Validations -> Business metrics
● Self serve onboarding
● Tigger on data refresh
● System health dashboard
TransactionI
d
OrderId Amount B.Amount InvoiceId L.Amount
TX1 OD1 100 100 I1 10
TX2 OD2 50 50 I2 50
TX3 OD3 75 75 I3 75
TX4 OD4 200 200
TX5 OD5 50 I5 50
Bad Records
PaymentGateway * BankStatement * Ledger
Amount Mismatch
Entry missing in Ledger
Entry missing in Bank statement
Salient features
● Abstract templates
○ Null check
○ Datatype compliance
○ Aggregated check
○ Range check
○ Cross comparison check
● Filter and transformation support
○ Exclude few records
○ Case-insensitive conversion
● Construct target dataframe
● Row level results
Validations UI
Sample Validation
{
"fact": [{
"fact_1": "payment_gateway",
"fact_2": "ledger",
"join_type":
"full_outer_join",
"join_columns": [{
"fact_1_column":
"transaction_id",
"fact_2_column":
"transaction_id",
"operator":
"equal"
}]
}],
"group_by_columns": ["transaction_id"],
"idempotency_columns": [
"transaction_id"
],
"validation_configurations": [{
"name": "amount_recon",
"operator": "equal",
"expression_list": [{
"expression": {
"operator": "amount",
"terminal": "pg_amount"
}
},
{
"expression": {
"operator": "l.amount",
"terminal": "ledger_amount"
}
}
]
}]
}
Data Flow
Trigger from
Azkaban
Run spark job Publish validation
failures
Fact refresh
Dashboard Datastore
Template Library
Validation
Configuration
ChakraView – A 360° Approach to Data Quality
Until now we were blissfully ignorant, Now we spend multiple man hours
categorising the bad records
TransactionId OrderId B.Amount InvoiceId L.Amount Category
TX1 OD1 100 I1 10
Amount wrong in
Ledger entry
TX5 OD4 200
Upstream Failure-
Payments
TX6 OD6 I6 50 File upload issue
Root Cause Analysis(RCA)
Bank Statement * Ledger
ChakraView – A 360° Approach to Data Quality
Combinatorial explosion
● The cycle is longer for big data due to
● Complexity of the system
● Time consuming
● Error prone
● Humanly impossible
● Real-time systems has ELK kind of tools
● No tools available for Big data to RCA
How do we make this operation cheap?
Auto-RCA
● Enrich logs and data from main pipeline
Enrichments
{
"commerce_activity": {
"activityType": "create_ledger",
"activityId": "TX12345",
"payload": "{"event":"create_ledger","entity_id":"TX12345"}",
"eventStatus": "ERRORED",
"retryCount": 0
},
"error_details": {
"activityType": "create_ledger",
"activityId": "TX12345",
"errorCode": "503",
"errorDescription": "Error: EnricherException{statusCode=503}",
"sourceSystem": "IRN",
"upstreamUriSignature": "/payment/<transaction>",
"upstreamUrl": "/payment/TX12345",
"upstreamHttpMethod": "GET",
"upstreamHeader": null,
"upstreamPayload": null,
"errorStatus": "OPEN",
"failureCount": null,
}
}
Auto-RCA
● Perform 5 Why RCA
● Hierarchical categorisation
● Leaf category -> Unique issues
Unclassified
Amount mismatch Missing entries
Missing entries in Bank
statement
Missing entries in ledger
Issue in invoice creation
Issue in Bank statement
Event processing failure
Event not arrived
Wrong value in file
File upload issue
Data not pushed to
analytical store
Unclassified
ChakraView – A 360° Approach to Data Quality
Fixture
● Can we automate cleaning the data?
Fixture
Event processing failure
Event not arrived
Wrong value in file
File upload issue
Data not pushed to
analytical store
reprocess_event
replay_event
reprocess_file republish_ledger_entry
Fixture
{
"flowName": "debtor_flow",
"categoryName": "Event processing failure",
"recipeName": "reprocess_event"
}
Fixture
● Recipes - Library of functions that automate the cleansing
● Leaf Category -> Recipe
● Sample Recipes
○ Reverse
○ Retry
○ Restore
Architecture
● Man-days reduced to few hours.
● Reactive to proactive
● Dev-friendly
● People independent
● Complete visibility
Next Steps
● Open source
● Data observability
● Performance optimisation
Questions?

More Related Content

PPTX
ironSource Atom BigData Berlin
PDF
[WSO2Con EU 2018] Patterns for Building Streaming Apps
PPTX
Big Data, Analytics and Real Time Event Processing
PDF
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
PDF
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
PDF
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
PPTX
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
ironSource Atom BigData Berlin
[WSO2Con EU 2018] Patterns for Building Streaming Apps
Big Data, Analytics and Real Time Event Processing
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
[WSO2Con EU 2018] The Rise of Streaming SQL
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...

What's hot (18)

PDF
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
PDF
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
PDF
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
PDF
Embedding Insight through Prediction Driven Logistics
PDF
Building a Distributed Collaborative Data Pipeline with Apache Spark
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
PPTX
Importance of Big Data Analytics
PDF
Google BigQuery for Everyday Developer
PDF
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
PPTX
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
PDF
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
PDF
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
PDF
Big Query - Utilizing Google Data Warehouse for Media Analytics
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
PDF
Fraud prevention is better with TigerGraph inside
PDF
How to Build Fast Data Applications: Evaluating the Top Contenders
PPTX
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
PPTX
Obfuscating LinkedIn Member Data
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
Embedding Insight through Prediction Driven Logistics
Building a Distributed Collaborative Data Pipeline with Apache Spark
Counting Unique Users in Real-Time: Here's a Challenge for You!
Importance of Big Data Analytics
Google BigQuery for Everyday Developer
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Big Query - Utilizing Google Data Warehouse for Media Analytics
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Fraud prevention is better with TigerGraph inside
How to Build Fast Data Applications: Evaluating the Top Contenders
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Obfuscating LinkedIn Member Data
Ad

Similar to ChakraView – A 360° Approach to Data Quality (20)

PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
Kafka For Financial Data Processing - The Flipkart Way (Shankar Manian and Ra...
PPT
Overview of the financial architecture in oracle e business suite release 12
PPT
Overview of the financial architecture in oracle e business suite release 12
PPT
Overview of the financial architecture in oracle e business suite release 12
PDF
The State of Stream Processing
PPT
The Evolution of Big Data Pipelines at Intuit
PDF
NetSuite Reporting for High Transaction Volume & Self-Serve Businesses
PPTX
Analysis, data & process modeling
PDF
How Intelligent Document Processing is Driving Accounts Receivable (AR) and A...
PPT
oracle Presntation.ppt
PDF
Danish Business Authority: Explainability and causality in relation to ML Ops
PPT
Overview of the financial architecture in oracle e business suite release 12
PPT
Overview of the financial architecture in oracle e business suite release 12
PPT
Overview of the financial architecture in oracle e business suite release 12
PDF
Engineering data quality
PPTX
NTGapps NTG LowCode Platform
PPTX
How Eastern Bank Uses Big Data to Better Serve and Protect its Customers
PDF
When Data Visualizations and Data Imports Just Don’t Work
PDF
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Kafka For Financial Data Processing - The Flipkart Way (Shankar Manian and Ra...
Overview of the financial architecture in oracle e business suite release 12
Overview of the financial architecture in oracle e business suite release 12
Overview of the financial architecture in oracle e business suite release 12
The State of Stream Processing
The Evolution of Big Data Pipelines at Intuit
NetSuite Reporting for High Transaction Volume & Self-Serve Businesses
Analysis, data & process modeling
How Intelligent Document Processing is Driving Accounts Receivable (AR) and A...
oracle Presntation.ppt
Danish Business Authority: Explainability and causality in relation to ML Ops
Overview of the financial architecture in oracle e business suite release 12
Overview of the financial architecture in oracle e business suite release 12
Overview of the financial architecture in oracle e business suite release 12
Engineering data quality
NTGapps NTG LowCode Platform
How Eastern Bank Uses Big Data to Better Serve and Protect its Customers
When Data Visualizations and Data Imports Just Don’t Work
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
New ISO 27001_2022 standard and the changes
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Introduction to the R Programming Language
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPT
Predictive modeling basics in data cleaning process
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
DOCX
Factor Analysis Word Document Presentation
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Leprosy and NLEP programme community medicine
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
New ISO 27001_2022 standard and the changes
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to the R Programming Language
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Predictive modeling basics in data cleaning process
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Inferential Statistics.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Factor Analysis Word Document Presentation
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Pilar Kemerdekaan dan Identi Bangsa.pptx
CYBER SECURITY the Next Warefare Tactics
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Leprosy and NLEP programme community medicine
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx

ChakraView – A 360° Approach to Data Quality

  • 1. ChakraView A 360° approach to data quality Shankar Manian Keerthika Thiyagarajan
  • 2. Background ● ~15 years in Big Data... ● ...as Data Janitors ● Can we do better ?
  • 3. Data Quality - Missing Focus ● Afterthought ● Needle in a haystack ● Huge cost
  • 4. Detection - Missing Dimensions ● Completeness ● Consistency ● Auditability
  • 5. Cleansing - The Hidden Cost ● Trace the issue to source ● No SOP on how to fix ● Hard to Automate
  • 6. Visibility - Or the lack of it ● Impact - Cost of bad data ● Breakdown and Prioritization ● Push quality upstream
  • 7. State before ● Stakeholder driven ● Reactive process ● Business metrics ● Huge monetary impact ● Iterative Discovery
  • 8. Validations Framework ● Granular Validations -> Business metrics ● Self serve onboarding ● Tigger on data refresh ● System health dashboard
  • 9. TransactionI d OrderId Amount B.Amount InvoiceId L.Amount TX1 OD1 100 100 I1 10 TX2 OD2 50 50 I2 50 TX3 OD3 75 75 I3 75 TX4 OD4 200 200 TX5 OD5 50 I5 50 Bad Records PaymentGateway * BankStatement * Ledger Amount Mismatch Entry missing in Ledger Entry missing in Bank statement
  • 10. Salient features ● Abstract templates ○ Null check ○ Datatype compliance ○ Aggregated check ○ Range check ○ Cross comparison check ● Filter and transformation support ○ Exclude few records ○ Case-insensitive conversion ● Construct target dataframe ● Row level results
  • 12. Sample Validation { "fact": [{ "fact_1": "payment_gateway", "fact_2": "ledger", "join_type": "full_outer_join", "join_columns": [{ "fact_1_column": "transaction_id", "fact_2_column": "transaction_id", "operator": "equal" }] }], "group_by_columns": ["transaction_id"], "idempotency_columns": [ "transaction_id" ], "validation_configurations": [{ "name": "amount_recon", "operator": "equal", "expression_list": [{ "expression": { "operator": "amount", "terminal": "pg_amount" } }, { "expression": { "operator": "l.amount", "terminal": "ledger_amount" } } ] }] }
  • 13. Data Flow Trigger from Azkaban Run spark job Publish validation failures Fact refresh Dashboard Datastore Template Library Validation Configuration
  • 15. Until now we were blissfully ignorant, Now we spend multiple man hours categorising the bad records
  • 16. TransactionId OrderId B.Amount InvoiceId L.Amount Category TX1 OD1 100 I1 10 Amount wrong in Ledger entry TX5 OD4 200 Upstream Failure- Payments TX6 OD6 I6 50 File upload issue Root Cause Analysis(RCA) Bank Statement * Ledger
  • 18. Combinatorial explosion ● The cycle is longer for big data due to ● Complexity of the system ● Time consuming ● Error prone ● Humanly impossible
  • 19. ● Real-time systems has ELK kind of tools ● No tools available for Big data to RCA How do we make this operation cheap?
  • 20. Auto-RCA ● Enrich logs and data from main pipeline
  • 21. Enrichments { "commerce_activity": { "activityType": "create_ledger", "activityId": "TX12345", "payload": "{"event":"create_ledger","entity_id":"TX12345"}", "eventStatus": "ERRORED", "retryCount": 0 }, "error_details": { "activityType": "create_ledger", "activityId": "TX12345", "errorCode": "503", "errorDescription": "Error: EnricherException{statusCode=503}", "sourceSystem": "IRN", "upstreamUriSignature": "/payment/<transaction>", "upstreamUrl": "/payment/TX12345", "upstreamHttpMethod": "GET", "upstreamHeader": null, "upstreamPayload": null, "errorStatus": "OPEN", "failureCount": null, } }
  • 22. Auto-RCA ● Perform 5 Why RCA ● Hierarchical categorisation ● Leaf category -> Unique issues
  • 23. Unclassified Amount mismatch Missing entries Missing entries in Bank statement Missing entries in ledger Issue in invoice creation Issue in Bank statement Event processing failure Event not arrived Wrong value in file File upload issue Data not pushed to analytical store Unclassified
  • 25. Fixture ● Can we automate cleaning the data?
  • 26. Fixture Event processing failure Event not arrived Wrong value in file File upload issue Data not pushed to analytical store reprocess_event replay_event reprocess_file republish_ledger_entry
  • 27. Fixture { "flowName": "debtor_flow", "categoryName": "Event processing failure", "recipeName": "reprocess_event" }
  • 28. Fixture ● Recipes - Library of functions that automate the cleansing ● Leaf Category -> Recipe ● Sample Recipes ○ Reverse ○ Retry ○ Restore
  • 30. ● Man-days reduced to few hours. ● Reactive to proactive ● Dev-friendly ● People independent ● Complete visibility
  • 31. Next Steps ● Open source ● Data observability ● Performance optimisation