SlideShare a Scribd company logo
Caryl Yuhas, Databricks
Real-Time Attribution with
Structured Streaming and
Databricks Delta
#ExpSAIS13
Introduction
2#ExpSAIS13
• Goal:
Provide tools and information
that can help you build more
real-time / lower latency
attribution pipelines
• Crawl, Walk, Run: Pull Model
Carylpreviously MediaMath / SE / PM
for Attribution, SA for Databricks
Getting Started
3#ExpSAIS13
• What is Attribution?
Image Source: www.mediamath.com
Introduction
What is Databricks Delta?
Delta is a data management capability that
brings data reliability and performance
optimizations to the cloud data lake.
4#ExpSAIS13
Stream-to-Sink BEFORE
5#ExpSAIS13
Reprocessing
Data Lake
λ-arch
λ-arch
Streaming
Analytics
Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Compaction
Partitioned
Compact
Small Files
Scheduled to
Avoid Compaction
1
2
3
1
1
3
4
4
4
2
Stream-to-Sink AFTER
6#ExpSAIS13
Reprocessing
λ-arch
Streaming
Analytics
Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Compaction
Partitioned
1
2
Compact
Small Files
3
1
4
4
2
DELTA
Optimize
3
ZOrder
3
Attribution in Practice
7
impressions conversionsJOIN
#ExpSAIS13
attributed impressions
Attribution Challenges
Scale
• Often dealing with millions to billions of data
points per attribution window
Complexity
• Simple, last-click model is still common
• MTA and more sophisticated attribution on rise
8#ExpSAIS13
High Level Attribution Pipeline
9#ExpSAIS13
Attribution in Practice
10
impressions conversionsJOIN
#ExpSAIS13
attributed impressions
Data Architecture
11#ExpSAIS13
impression stream
conversion stream conversions table
impressions table
attributed table
last touch
attributed table
weighted
attribution views
(filters, logic, etc.)
System Architecture
12#ExpSAIS13
STRUCTURED
STREAMING
Amazon
Kinesis
Unification of Streaming + Batch
DEMO
13#ExpSAIS13
• How can we optimize performance?
• Levers:
– Delta Tools
• Optimize
• ZOrder
• Caching
• Data Skipping
– Join on Stream
– Cluster Size
Managing Performance
14#ExpSAIS13
Handling Complexity
• Flexibility with Complex Logic
– Forking streams
– Logic on query vs. in-stream
• Late or Corrected Data
– Upserts
– Views automatically update when raw data changed
15#ExpSAIS13
Conclusion
• Unification of Batch & Streaming
• Easy APIs for Managing Performance
• Flexible and Scalable Analytics on Near
Real-Time Data
16#ExpSAIS13

More Related Content

PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Why is My Stream Processing Job Slow? with Xavier Leaute
PDF
Apache Spark for Library Developers with William Benton and Erik Erlandson
PDF
Delta from a Data Engineer's Perspective
PDF
Predictive Maintenance at the Dutch Railways with Ivo Everts
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Why is My Stream Processing Job Slow? with Xavier Leaute
Apache Spark for Library Developers with William Benton and Erik Erlandson
Delta from a Data Engineer's Perspective
Predictive Maintenance at the Dutch Railways with Ivo Everts
Designing Structured Streaming Pipelines—How to Architect Things Right
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

What's hot (20)

PPTX
Omid: A Transactional Framework for HBase
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PDF
Parallelization of Structured Streaming Jobs Using Delta Lake
PPTX
HPBigData2015 PSTL kafka spark vertica
PDF
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
PDF
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
Big Data Tools in AWS
PDF
Streaming Big Data & Analytics For Scale
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Deep Dive into the New Features of Apache Spark 3.1
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
PDF
Lambda architecture
PPTX
Google cloud Dataflow & Apache Flink
PDF
Change Data Capture with Data Collector @OVH
Omid: A Transactional Framework for HBase
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Speed up UDFs with GPUs using the RAPIDS Accelerator
Parallelization of Structured Streaming Jobs Using Delta Lake
HPBigData2015 PSTL kafka spark vertica
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Big Data Tools in AWS
Streaming Big Data & Analytics For Scale
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Lambda architecture
Google cloud Dataflow & Apache Flink
Change Data Capture with Data Collector @OVH
Ad

Similar to Real-Time Attribution with Structured Streaming and Databricks Delta with Caryl Yuhas (20)

PDF
Building Reliable Data Lakes at Scale with Delta Lake
PDF
The Great Lakes: How to Approach a Big Data Implementation
PDF
First in Class: Optimizing the Data Lake for Tighter Integration
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
C19013010 the tutorial to build shared ai services session 2
PDF
Meetup: Streaming Data Pipeline Development
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PDF
Data Warehouse or Data Lake, Which Do I Choose?
PDF
Off-Label Data Mesh: A Prescription for Healthier Data
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
PDF
A Tale of Two BI Standards
PDF
Azure data analytics platform - A reference architecture
PDF
What's New in Upcoming Apache Spark 2.3
PDF
Unconference Round Table Notes
PDF
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
PDF
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
Building Reliable Data Lakes at Scale with Delta Lake
The Great Lakes: How to Approach a Big Data Implementation
First in Class: Optimizing the Data Lake for Tighter Integration
SQL Analytics Powering Telemetry Analysis at Comcast
C19013010 the tutorial to build shared ai services session 2
Meetup: Streaming Data Pipeline Development
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
2018 02-08-what's-new-in-apache-spark-2.3
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Accelerating Data Lakes and Streams with Real-time Analytics
Data Warehouse or Data Lake, Which Do I Choose?
Off-Label Data Mesh: A Prescription for Healthier Data
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
A Tale of Two BI Standards
Azure data analytics platform - A reference architecture
What's New in Upcoming Apache Spark 2.3
Unconference Round Table Notes
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to Data Science and Data Analysis
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Database Infoormation System (DBIS).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Lecture1 pattern recognition............
Computer network topology notes for revision
Quality review (1)_presentation of this 21
Introduction to Data Science and Data Analysis
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
ISS -ESG Data flows What is ESG and HowHow
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Clinical guidelines as a resource for EBP(1).pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Database Infoormation System (DBIS).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Real-Time Attribution with Structured Streaming and Databricks Delta with Caryl Yuhas