SlideShare a Scribd company logo
Enriching the data vs Filtering in Spark
Gokul Prabagaren
Master Software Engineer
CapitalOne
About Me
● Master Software Engineer @ Capitalone
● Building Spark Applications since Spark 1.2
● Contributor of @CapitalOneTech Medium
blogs on Big Data processing
● @gocoolp on Twitter
● @gocool_p on LinkedIn
Agenda
● Rewards use case in CapitalOne
● Filtering Approach
● Issues with Filtering Approach
● How Enriching approach solves the issue
● Conclusion & Questions
Rewards Use case in CapitalOne
▪ CapitalOne develops its software as Open
source first in cloud.
▪ We will be operating fully on Cloud soon!
▪ We use Apache Spark extensively for variety
of batch,streaming and machine-learning
workloads.
Rewards Use case in CapitalOne
▪ Use case
▪ One of Core Credit Card Rewards Spark
Application.
▪ Consumes daily credit card transactions
and computes the Rewards
Filtering the data Approach
● This approach uses Spark inner-join
at each stage
Filtering Approach Example
Issues with Filtering Approach
▪ Hard to debug the application post deployment
▪ Back tracing of data is not possible as computation happens in-memory
▪ Counts at each stage can only provide how many got processed.But not why the remaining got
dropped in that stage.
How did we overcome these issues ?
Enriching the data approach
● This approach uses Spark left-outer join
● Instead of filtering the data from dataset at
each stage.Enriching approach keeps
enriching the data from right side dataset
Enriching Approach Example
Enriching Approach Example…….
Advantage of Enriching over filtering
● Data from each stage is enriched into original dataset.It captures the state information,makes it
easy to debug/analyse later
● Same data columns/flags captured at each stage gives more granular details to know why
particular data got dropped at that stage
● No need of additional costly counts action at each stage.
Conclusion
● We made the switch to use Enriching approach in our Spark job in production.
● It is successfully processing millions of credit card transaction daily.
● Awarding millions of miles,cash and points as Rewards to Capital One customers.
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark

More Related Content

PPTX
Introducing r3 corda™ a distributed ledger designed for financial services
PDF
Intro to Time Series
PDF
Flink SQL: The Challenges to Build a Streaming SQL Engine
PPSX
Apache Flink, AWS Kinesis, Analytics
PPTX
BLOCKCHAIN TECHNOLOGY in HINDI
PPTX
Blockchain - HyperLedger Fabric
PDF
An introduction to Elasticsearch's advanced relevance ranking toolbox
PDF
Scaling up uber's real time data analytics
Introducing r3 corda™ a distributed ledger designed for financial services
Intro to Time Series
Flink SQL: The Challenges to Build a Streaming SQL Engine
Apache Flink, AWS Kinesis, Analytics
BLOCKCHAIN TECHNOLOGY in HINDI
Blockchain - HyperLedger Fabric
An introduction to Elasticsearch's advanced relevance ranking toolbox
Scaling up uber's real time data analytics

What's hot (20)

PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Advantages and disadvantages of permissionless blockchain
PDF
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
PPTX
Flink Batch Processing and Iterations
PPTX
Introduction to Corda Blockchain for Developers
 
PDF
Developing applications with Hyperledger Fabric SDK
PDF
Hyperledger Fabric in a Nutshell
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PPTX
Blockchain for IoT Security and Privacy: The Case Study of a Smart Home
PDF
On Improving Broadcast Joins in Apache Spark SQL
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
Blockchain
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PPTX
Hyperledger Fabric
PDF
Apache Spark Core – Practical Optimization
PPTX
Deep Dive into Apache Kafka
PPTX
Apache Flink Deep Dive
Batch Processing at Scale with Flink & Iceberg
Advantages and disadvantages of permissionless blockchain
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Flink Batch Processing and Iterations
Introduction to Corda Blockchain for Developers
 
Developing applications with Hyperledger Fabric SDK
Hyperledger Fabric in a Nutshell
Apache Arrow Flight: A New Gold Standard for Data Transport
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Blockchain for IoT Security and Privacy: The Case Study of a Smart Home
On Improving Broadcast Joins in Apache Spark SQL
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Blockchain
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Hyperledger Fabric
Apache Spark Core – Practical Optimization
Deep Dive into Apache Kafka
Apache Flink Deep Dive
Ad

Similar to Filtering vs Enriching Data in Apache Spark (20)

PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PDF
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
PDF
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
PDF
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
PDF
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
PPTX
[Rakuten TechConf2014] [A-4] Rakuten Ichiba
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
AI projects - Lifecyle & Best Practices
PDF
Transforming B2B Sales with Spark Powered Sales Intelligence
PDF
Ad109 - XPages Performance and Scalability
PDF
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
PPTX
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
PDF
Excalibur: best practices for virtual desktop operations leveraging Citrix Di...
PDF
Next-Generation Kubernetes Optimization: Optimize Live 2.0
PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
J sai subrahmanyam_Resume
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PDF
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
PDF
Improving Mobile Payments With Real time Spark
PDF
QUES#19 Automation and Quality 2022.pdf
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
[Rakuten TechConf2014] [A-4] Rakuten Ichiba
Best Practices for Enabling Speculative Execution on Large Scale Platforms
AI projects - Lifecyle & Best Practices
Transforming B2B Sales with Spark Powered Sales Intelligence
Ad109 - XPages Performance and Scalability
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Excalibur: best practices for virtual desktop operations leveraging Citrix Di...
Next-Generation Kubernetes Optimization: Optimize Live 2.0
Machine learning at scale - Webinar By zekeLabs
J sai subrahmanyam_Resume
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
Improving Mobile Payments With Real time Spark
QUES#19 Automation and Quality 2022.pdf
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
1_Introduction to advance data techniques.pptx
PDF
Lecture1 pattern recognition............
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Clinical guidelines as a resource for EBP(1).pdf
[EN] Industrial Machine Downtime Prediction
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Fluorescence-microscope_Botany_detailed content
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
Lecture1 pattern recognition............
SAP 2 completion done . PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
ISS -ESG Data flows What is ESG and HowHow
Miokarditis (Inflamasi pada Otot Jantung)
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
annual-report-2024-2025 original latest.
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx

Filtering vs Enriching Data in Apache Spark

  • 1. Enriching the data vs Filtering in Spark Gokul Prabagaren Master Software Engineer CapitalOne
  • 2. About Me ● Master Software Engineer @ Capitalone ● Building Spark Applications since Spark 1.2 ● Contributor of @CapitalOneTech Medium blogs on Big Data processing ● @gocoolp on Twitter ● @gocool_p on LinkedIn
  • 3. Agenda ● Rewards use case in CapitalOne ● Filtering Approach ● Issues with Filtering Approach ● How Enriching approach solves the issue ● Conclusion & Questions
  • 4. Rewards Use case in CapitalOne ▪ CapitalOne develops its software as Open source first in cloud. ▪ We will be operating fully on Cloud soon! ▪ We use Apache Spark extensively for variety of batch,streaming and machine-learning workloads.
  • 5. Rewards Use case in CapitalOne ▪ Use case ▪ One of Core Credit Card Rewards Spark Application. ▪ Consumes daily credit card transactions and computes the Rewards
  • 6. Filtering the data Approach ● This approach uses Spark inner-join at each stage
  • 8. Issues with Filtering Approach ▪ Hard to debug the application post deployment ▪ Back tracing of data is not possible as computation happens in-memory ▪ Counts at each stage can only provide how many got processed.But not why the remaining got dropped in that stage. How did we overcome these issues ?
  • 9. Enriching the data approach ● This approach uses Spark left-outer join ● Instead of filtering the data from dataset at each stage.Enriching approach keeps enriching the data from right side dataset
  • 12. Advantage of Enriching over filtering ● Data from each stage is enriched into original dataset.It captures the state information,makes it easy to debug/analyse later ● Same data columns/flags captured at each stage gives more granular details to know why particular data got dropped at that stage ● No need of additional costly counts action at each stage.
  • 13. Conclusion ● We made the switch to use Enriching approach in our Spark job in production. ● It is successfully processing millions of credit card transaction daily. ● Awarding millions of miles,cash and points as Rewards to Capital One customers.