Filtering vs Enriching Data in Apache Spark

Enriching the data vs Filtering in Spark
Gokul Prabagaren
Master Software Engineer
CapitalOne

About Me
● Master Software Engineer @ Capitalone
● Building Spark Applications since Spark 1.2
● Contributor of @CapitalOneTech Medium
blogs on Big Data processing
● @gocoolp on Twitter
● @gocool_p on LinkedIn

Agenda
● Rewards use case in CapitalOne
● Filtering Approach
● Issues with Filtering Approach
● How Enriching approach solves the issue
● Conclusion & Questions

Rewards Use case in CapitalOne
▪ CapitalOne develops its software as Open
source first in cloud.
▪ We will be operating fully on Cloud soon!
▪ We use Apache Spark extensively for variety
of batch,streaming and machine-learning
workloads.

Rewards Use case in CapitalOne
▪ Use case
▪ One of Core Credit Card Rewards Spark
Application.
▪ Consumes daily credit card transactions
and computes the Rewards

Filtering the data Approach
● This approach uses Spark inner-join
at each stage

Issues with Filtering Approach
▪ Hard to debug the application post deployment
▪ Back tracing of data is not possible as computation happens in-memory
▪ Counts at each stage can only provide how many got processed.But not why the remaining got
dropped in that stage.
How did we overcome these issues ?

Enriching the data approach
● This approach uses Spark left-outer join
● Instead of filtering the data from dataset at
each stage.Enriching approach keeps
enriching the data from right side dataset

Enriching Approach Example…….

Advantage of Enriching over filtering
● Data from each stage is enriched into original dataset.It captures the state information,makes it
easy to debug/analyse later
● Same data columns/flags captured at each stage gives more granular details to know why
particular data got dropped at that stage
● No need of additional costly counts action at each stage.

Conclusion
● We made the switch to use Enriching approach in our Spark job in production.
● It is successfully processing millions of credit card transaction daily.
● Awarding millions of miles,cash and points as Rewards to Capital One customers.

Filtering vs Enriching Data in Apache Spark

Filtering vs Enriching Data in Apache Spark

More Related Content

What's hot (20)

Similar to Filtering vs Enriching Data in Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Filtering vs Enriching Data in Apache Spark