SlideShare a Scribd company logo
Building Event Collection
SDKs and Data Models
OSA Con ‘22
Paul Boocock
@paul_boocock
2022 www.snowplow.io
Table of
contents
What is Snowplow?
A Quick Introduction
Tracking SDKs
How we build them and our decisions
Data Models
Considerations for modeling the raw data
_01
_02
_03
2022 www.snowplow.io
What is Snowplow?
A quick intro
_01
www.snowplow.io
Snowplow:
we build
tech to
enable
companies
to Create
Data
www.snowplow.io
Trackers
_02
www.snowplow.io
But first, a
small detour…
www.snowplow.io
{
"$schema":
"http://iglucentral.com/schemas/com.snowplowana
lytics.self-desc/schema/jsonschema/1-0-0#",
"description": "JSON schema for
a button click event",
"self": {
"vendor":
"com.acme",
"name": "click",
"format":
"jsonschema",
"version": "1-0-
0"
},
"type": "object",
"properties": {
"button": {
"type":
["string", "null"],
"maxLength":
255
}
},
"additionalProperties": false
}
www.snowplow.io
The data created looks a bit like this
user_id
page_url
timestamp
device
city
mkt_campaign
Default
content_id
content_title
author
date_created
button_click
Custom Custom
Entity
Event
But can crucially be evolved over time
www.snowplow.io
{
"$schema":
"http://iglucentral.com/schemas/com.snowplowana
lytics.self-desc/schema/jsonschema/1-0-0#",
"description": "JSON schema for
a button click event",
"self": {
"vendor":
"com.acme",
"name": "click",
"format":
"jsonschema",
"version": "1-0-
0"
},
"type": "object",
"properties": {
"button": {
"type":
["string", "null"],
"maxLength":
255
}
},
"additionalProperties": false
}
{
"$schema":
"http://iglucentral.com/schemas/com.snowplowana
lytics.self-desc/schema/jsonschema/1-0-0#",
"description": "JSON schema for
a button click event",
"self": {
"vendor":
"com.acme",
"name": "click",
"format":
"jsonschema",
"version": "1-0-
1"
},
"type": "object",
"properties": {
"button": {
"type":
["string", "null"],
"maxLength":
255
},
"index": {
"type":
["integer", "null"]
}
},
"additionalProperties": false
}
www.snowplow.io
The real benefit of that approach is how the
data then looks
Default
fields (1/130)
Custom Event data
event_name unstruct_event_com_acme_click_1
click {
"button": "open_article"
}
Default fields (1/130) Custom Event data
event_name unstruct_event_com_acme_click_1
click {
"button": "open_article"
}
click {
"button": "open_article",
"index": "2"
}
Tracking SDKs
Lorem ipsum
_02.1
www.snowplow.io
Snowplow Tracking SDKs
An Introduction
2022 www.snowplow.io
www.snowplow.io
Multi-language Support
1.
Quickly developed beyond a
web only platform
Web Tracker is unique as
different challenges on the
web vs other platforms
So, what do we do?
2.
Tempting to auto
generate
- Works well for
some trackers
(server)
- Less well for
others (client)
https://quicktype.io/?
www.snowplow.io
Building SDKs
Hand craft SDKs Autogenerate SDKs Can we do both?
www.snowplow.io
Server vs Client SDKs
How do they differ?
2022 www.snowplow.io
www.snowplow.io
Configuration
Challenges
2022 www.snowplow.io
www.snowplow.io
Sensible defaults
Whilst the SDKs are incredibly
configurable, opting for sensible
defaults keeps users happy
Where possible we now try to be more
opinionated and don’t offer
configuration
Works well until someone asks us if
they can configure it on Github!
www.snowplow.io
Data Models
_03
www.snowplow.io
Data modeling is the
process of using
business logic to
aggregate or
otherwise transform
raw data.
www.snowplow.io
1
Process clickstream data from raw events to
create aggregate tables of views, sessions and
users, reducing the volume of data massively
while adding quality
2
Deal with user stitching across sessions and
devices
3
Compatible with schema evolution, so
when you add new fields to your
contexts they immediately show up in
your data models
users
sessions
views
raw events
3%
6%
20%
Rows relative to raw
events
www.snowplow.io
Modeling is not an afterthought
This end-to-end approach results in
Clean, structured
data always available
in your warehouse
Consistent business logic
defined once means no
qualms about what metrics
mean
More time spent building
out ML/AI models instead of
toiling with data problems
www.snowplow.io
Raw
Data
What does the data look like
Structure data to match your product
video_play
view
video_pause
heartbeat
click
content
id: ‘abc123’
type: ‘video’
date updated: 2019-06...
title: ‘Birthday…’
creator: ‘Nicholas Witchell’
theme: ‘royals’
native: FALSE
personalities: [‘Kate’,
‘Wills’,
‘George’]
Events Entity
search content content content
Going Incremental
_03.2
www.snowplow.io
Consolidation
Similarities between queries:
● Common levels of aggregation
● Repeated logic, like joins
Consolidate ad-hoc queries into a set of
derived tables.
These generalised tables can be used to
solve a variety of use cases.
raw events
page views
sessions
users
20%
6%
3%
Rows relative to raw
events
Value of a page or screen view
PAGE VIEW
HEARTBEAT
ENGAGEMENT
time_engaged: 25s
scroll_depth(x): 100%
scroll_depth(y): 37%
clicks:
1
shares:
0
time_engaged: 15s
scroll_depth(x): 100%
scroll_depth(y): 20%
clicks:
2
shares:
0
time_engaged: 35s
scroll_depth(x): 100%
scroll_depth(y): 86%
clicks:
0
shares:
1
One Step Further - Incremental Models
As the business continues to grow:
● The size of the events table grows.
● Behavioural data is more business critical
and needs to be processed more
regularly.
Meaning:
● Running the derived tables in a drop and
compute manner is becoming
increasingly costly and taking
considerable time.
● Processing events in an incremental
manner becomes a necessity.
#page_views.sql
{{ config(
materialized='table'
)
}}
select
page_view_id,
...
row_number() over (
partition by session_id
order by derived_tstamp
) AS page_view_in_session_index
from {{ ref('events') }}
where event_name = 'page_view'
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and derived_tstamp > (
select max(derived_tstamp) from
{{this}}
)
{% endif %}
)
select
e.page_view_id,
...
row_number() over (
partition by e.session_id
order by e.derived_tstamp
) AS page_view_in_session_index
from {{ ref('events') }} e
inner join sessions_with_new_events s
on e.session_id = s.session_id
where e.event_name = 'page_view'
Drop & Recompute
Incremental
Page views - Incremental
www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and derived_tstamp > (
select max(derived_tstamp) from
{{this}}
)
{% endif %}
)
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and collector_tstamp > (
select max(collector_tstamp)
from {{this}}
)
{% endif %}
)
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select ...
)
select
e.page_view_id,
...
row_number() over (
partition by e.session_id
order by e.derived_tstamp
) AS page_view_in_session_index
from {{ ref('events') }} e
inner join sessions_with_new_events s
on e.session_id = s.session_id
where e.event_name = 'page_view'
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
● Restrict table scans on all source
tables if possible
www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select ...
)
select
e.page_view_id,
...
from {{ ref('events') }} e
inner join sessions_with_new_events s
on e.session_id = s.session_id
where e.event_name = 'page_view'
-- limit table scans
{% if is_incremental() %}
and collector_tstamp > (
select
dateadd(
day,
-3,
max(collector_tstamp))
from {{this}}
)
{% endif %}
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
● Restrict table scans on all source
tables if possible
www.snowplow.io
Process the least data possible
#page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id'
)
}}
with sessions_with_new_events as (
select distinct
session_id
from {{ ref('events') }}
where event_name = 'page_view'
{% if is_incremental() %}
and derived_tstamp > (
select max(derived_tstamp) from
{{this}}
)
{% endif %}
)
Incremental
Reduce the amount of data by:
● Ensure you filter on the
partition/sort keys of the source
● Restrict table scans on all source
tables if possible
● Understanding your warehouse
Wrap up Considered the Snowplow Tracking SDKs
- Why we hand craft them
- Differences in Server vs Client SDKs
How do we model billions of rows of atomic data?
- Incrementally!
- Aggregation brings many benefits for analysis
Configuration is painful for everyone
- Easy to get carried away with configurable trackers
but then Data Models need to support it too
Schema’d events make it easy to make type safe objects
for us with the Tracking SDKs
- Great for tracking engineers
Understand your full pipeline to extract the most
- How the data is tracked and processed impacts
how you can model the data

More Related Content

PPTX
Implementation of GUI Framework part3
PPTX
Viki Big Data Meetup 2013_10
PDF
Apache Spark Side of Funnels
PDF
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
PDF
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
PPTX
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
PPTX
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
PPT
GHC Participant Training
Implementation of GUI Framework part3
Viki Big Data Meetup 2013_10
Apache Spark Side of Funnels
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
GHC Participant Training

Similar to OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock - Snowplow.pdf (20)

PPTX
A miało być tak... bez wycieków
PDF
112 portfpres.pdf
ODP
Beyond PHP - It's not (just) about the code
PDF
Salesforce Lightning Data Services- Hands on Training
PPTX
Know your SQL Server - DMVs
PDF
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
PDF
Agile Database Development with JSON
PDF
Snowplow: evolve your analytics stack with your business
PPT
Backbone.js
PDF
[WSO2Con EU 2018] Patterns for Building Streaming Apps
PDF
CQRS and Event Sourcing with MongoDB and PHP
PDF
MySQL under the siege
PDF
fundamentalsofeventdrivenmicroservices11728489736099.pdf
PPTX
Protractor framework – how to make stable e2e tests for Angular applications
PDF
MongoDB Performance Tuning
PDF
Streaming Analytics for Financial Enterprises
PPTX
PyCon SG x Jublia - Building a simple-to-use Database Management tool
PDF
When you need more data in less time...
PPT
Database Development Replication Security Maintenance Report
A miało być tak... bez wycieków
112 portfpres.pdf
Beyond PHP - It's not (just) about the code
Salesforce Lightning Data Services- Hands on Training
Know your SQL Server - DMVs
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Agile Database Development with JSON
Snowplow: evolve your analytics stack with your business
Backbone.js
[WSO2Con EU 2018] Patterns for Building Streaming Apps
CQRS and Event Sourcing with MongoDB and PHP
MySQL under the siege
fundamentalsofeventdrivenmicroservices11728489736099.pdf
Protractor framework – how to make stable e2e tests for Angular applications
MongoDB Performance Tuning
Streaming Analytics for Financial Enterprises
PyCon SG x Jublia - Building a simple-to-use Database Management tool
When you need more data in less time...
Database Development Replication Security Maintenance Report
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Fun with ClickHouse Window Functions-2021-08-19.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
ClickHouse ReplacingMergeTree in Telecom Apps
Adventures with the ClickHouse ReplacingMergeTree Engine
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Ad

Recently uploaded (20)

PPTX
A Complete Guide to Streamlining Business Processes
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Introduction to the R Programming Language
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Introduction to Data Science and Data Analysis
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
annual-report-2024-2025 original latest.
A Complete Guide to Streamlining Business Processes
importance of Data-Visualization-in-Data-Science. for mba studnts
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Microsoft Core Cloud Services powerpoint
Introduction to the R Programming Language
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Introduction-to-Cloud-ComputingFinal.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Introduction to Data Science and Data Analysis
Optimise Shopper Experiences with a Strong Data Estate.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Qualitative Qantitative and Mixed Methods.pptx
[EN] Industrial Machine Downtime Prediction
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
annual-report-2024-2025 original latest.

OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock - Snowplow.pdf

  • 1. Building Event Collection SDKs and Data Models OSA Con ‘22 Paul Boocock @paul_boocock 2022 www.snowplow.io
  • 2. Table of contents What is Snowplow? A Quick Introduction Tracking SDKs How we build them and our decisions Data Models Considerations for modeling the raw data _01 _02 _03 2022 www.snowplow.io
  • 3. What is Snowplow? A quick intro _01 www.snowplow.io
  • 6. But first, a small detour… www.snowplow.io { "$schema": "http://iglucentral.com/schemas/com.snowplowana lytics.self-desc/schema/jsonschema/1-0-0#", "description": "JSON schema for a button click event", "self": { "vendor": "com.acme", "name": "click", "format": "jsonschema", "version": "1-0- 0" }, "type": "object", "properties": { "button": { "type": ["string", "null"], "maxLength": 255 } }, "additionalProperties": false }
  • 7. www.snowplow.io The data created looks a bit like this user_id page_url timestamp device city mkt_campaign Default content_id content_title author date_created button_click Custom Custom Entity Event
  • 8. But can crucially be evolved over time www.snowplow.io { "$schema": "http://iglucentral.com/schemas/com.snowplowana lytics.self-desc/schema/jsonschema/1-0-0#", "description": "JSON schema for a button click event", "self": { "vendor": "com.acme", "name": "click", "format": "jsonschema", "version": "1-0- 0" }, "type": "object", "properties": { "button": { "type": ["string", "null"], "maxLength": 255 } }, "additionalProperties": false } { "$schema": "http://iglucentral.com/schemas/com.snowplowana lytics.self-desc/schema/jsonschema/1-0-0#", "description": "JSON schema for a button click event", "self": { "vendor": "com.acme", "name": "click", "format": "jsonschema", "version": "1-0- 1" }, "type": "object", "properties": { "button": { "type": ["string", "null"], "maxLength": 255 }, "index": { "type": ["integer", "null"] } }, "additionalProperties": false }
  • 9. www.snowplow.io The real benefit of that approach is how the data then looks Default fields (1/130) Custom Event data event_name unstruct_event_com_acme_click_1 click { "button": "open_article" } Default fields (1/130) Custom Event data event_name unstruct_event_com_acme_click_1 click { "button": "open_article" } click { "button": "open_article", "index": "2" }
  • 11. Snowplow Tracking SDKs An Introduction 2022 www.snowplow.io
  • 13. Multi-language Support 1. Quickly developed beyond a web only platform Web Tracker is unique as different challenges on the web vs other platforms So, what do we do? 2. Tempting to auto generate - Works well for some trackers (server) - Less well for others (client) https://quicktype.io/? www.snowplow.io
  • 14. Building SDKs Hand craft SDKs Autogenerate SDKs Can we do both? www.snowplow.io
  • 15. Server vs Client SDKs How do they differ? 2022 www.snowplow.io
  • 19. Sensible defaults Whilst the SDKs are incredibly configurable, opting for sensible defaults keeps users happy Where possible we now try to be more opinionated and don’t offer configuration Works well until someone asks us if they can configure it on Github! www.snowplow.io
  • 21. Data modeling is the process of using business logic to aggregate or otherwise transform raw data. www.snowplow.io
  • 22. 1 Process clickstream data from raw events to create aggregate tables of views, sessions and users, reducing the volume of data massively while adding quality 2 Deal with user stitching across sessions and devices 3 Compatible with schema evolution, so when you add new fields to your contexts they immediately show up in your data models users sessions views raw events 3% 6% 20% Rows relative to raw events www.snowplow.io Modeling is not an afterthought
  • 23. This end-to-end approach results in Clean, structured data always available in your warehouse Consistent business logic defined once means no qualms about what metrics mean More time spent building out ML/AI models instead of toiling with data problems www.snowplow.io
  • 25. What does the data look like
  • 26. Structure data to match your product video_play view video_pause heartbeat click content id: ‘abc123’ type: ‘video’ date updated: 2019-06... title: ‘Birthday…’ creator: ‘Nicholas Witchell’ theme: ‘royals’ native: FALSE personalities: [‘Kate’, ‘Wills’, ‘George’] Events Entity search content content content
  • 28. Consolidation Similarities between queries: ● Common levels of aggregation ● Repeated logic, like joins Consolidate ad-hoc queries into a set of derived tables. These generalised tables can be used to solve a variety of use cases. raw events page views sessions users 20% 6% 3% Rows relative to raw events
  • 29. Value of a page or screen view PAGE VIEW HEARTBEAT ENGAGEMENT time_engaged: 25s scroll_depth(x): 100% scroll_depth(y): 37% clicks: 1 shares: 0 time_engaged: 15s scroll_depth(x): 100% scroll_depth(y): 20% clicks: 2 shares: 0 time_engaged: 35s scroll_depth(x): 100% scroll_depth(y): 86% clicks: 0 shares: 1
  • 30. One Step Further - Incremental Models As the business continues to grow: ● The size of the events table grows. ● Behavioural data is more business critical and needs to be processed more regularly. Meaning: ● Running the derived tables in a drop and compute manner is becoming increasingly costly and taking considerable time. ● Processing events in an incremental manner becomes a necessity.
  • 31. #page_views.sql {{ config( materialized='table' ) }} select page_view_id, ... row_number() over ( partition by session_id order by derived_tstamp ) AS page_view_in_session_index from {{ ref('events') }} where event_name = 'page_view' #page_views.sql {{ config( materialized='incremental', unique_key='page_view_id' ) }} with sessions_with_new_events as ( select distinct session_id from {{ ref('events') }} where event_name = 'page_view' {% if is_incremental() %} and derived_tstamp > ( select max(derived_tstamp) from {{this}} ) {% endif %} ) select e.page_view_id, ... row_number() over ( partition by e.session_id order by e.derived_tstamp ) AS page_view_in_session_index from {{ ref('events') }} e inner join sessions_with_new_events s on e.session_id = s.session_id where e.event_name = 'page_view' Drop & Recompute Incremental Page views - Incremental
  • 32. www.snowplow.io Process the least data possible #page_views.sql {{ config( materialized='incremental', unique_key='page_view_id' ) }} with sessions_with_new_events as ( select distinct session_id from {{ ref('events') }} where event_name = 'page_view' {% if is_incremental() %} and derived_tstamp > ( select max(derived_tstamp) from {{this}} ) {% endif %} ) Reduce the amount of data by: ● Ensure you filter on the partition/sort keys of the source
  • 33. www.snowplow.io Process the least data possible #page_views.sql {{ config( materialized='incremental', unique_key='page_view_id' ) }} with sessions_with_new_events as ( select distinct session_id from {{ ref('events') }} where event_name = 'page_view' {% if is_incremental() %} and collector_tstamp > ( select max(collector_tstamp) from {{this}} ) {% endif %} ) Reduce the amount of data by: ● Ensure you filter on the partition/sort keys of the source
  • 34. www.snowplow.io Process the least data possible #page_views.sql {{ config( materialized='incremental', unique_key='page_view_id' ) }} with sessions_with_new_events as ( select ... ) select e.page_view_id, ... row_number() over ( partition by e.session_id order by e.derived_tstamp ) AS page_view_in_session_index from {{ ref('events') }} e inner join sessions_with_new_events s on e.session_id = s.session_id where e.event_name = 'page_view' Reduce the amount of data by: ● Ensure you filter on the partition/sort keys of the source ● Restrict table scans on all source tables if possible
  • 35. www.snowplow.io Process the least data possible #page_views.sql {{ config( materialized='incremental', unique_key='page_view_id' ) }} with sessions_with_new_events as ( select ... ) select e.page_view_id, ... from {{ ref('events') }} e inner join sessions_with_new_events s on e.session_id = s.session_id where e.event_name = 'page_view' -- limit table scans {% if is_incremental() %} and collector_tstamp > ( select dateadd( day, -3, max(collector_tstamp)) from {{this}} ) {% endif %} Reduce the amount of data by: ● Ensure you filter on the partition/sort keys of the source ● Restrict table scans on all source tables if possible
  • 36. www.snowplow.io Process the least data possible #page_views.sql {{ config( materialized='incremental', unique_key='page_view_id' ) }} with sessions_with_new_events as ( select distinct session_id from {{ ref('events') }} where event_name = 'page_view' {% if is_incremental() %} and derived_tstamp > ( select max(derived_tstamp) from {{this}} ) {% endif %} ) Incremental Reduce the amount of data by: ● Ensure you filter on the partition/sort keys of the source ● Restrict table scans on all source tables if possible ● Understanding your warehouse
  • 37. Wrap up Considered the Snowplow Tracking SDKs - Why we hand craft them - Differences in Server vs Client SDKs How do we model billions of rows of atomic data? - Incrementally! - Aggregation brings many benefits for analysis Configuration is painful for everyone - Easy to get carried away with configurable trackers but then Data Models need to support it too Schema’d events make it easy to make type safe objects for us with the Tracking SDKs - Great for tracking engineers Understand your full pipeline to extract the most - How the data is tracked and processed impacts how you can model the data