SlideShare a Scribd company logo
Snowplow Analytics – From NoSQL 
to SQL and back 
London NoSQL, 17th November 2014
Introducing myself 
• Alex Dean 
• Co-founder and technical lead at Snowplow, 
the open-source event analytics platform 
based here in London [1] 
• Weekend writer of Unified Log Processing, 
available on the Manning Early Access Program 
[2] 
[1] https://github.com/snowplow/snowplow 
[2] http://manning.com/dean
So what’s Snowplow?
Snowplow is an event analytics platform 
Collect 
event data 
Warehouse 
event data Data warehouse 
Unified log 
Unified log 
Unified log 
Publish event 
data to a 
unified log 
Perform the high value 
analyses that drive the 
bottom line 
Act on your data in 
real-time
Snowplow was created as a response to the limitations of 
traditional web analytics programs: 
Data collection Data processing Data access 
• Sample-based (e.g. 
Google Analytics) 
• Limited set of events e.g. 
page views, goals, 
transactions 
• Limited set of ways of 
describing events 
(custom dim 1, custom 
dim 2…) 
• Data is processed ‘once’ 
• No validation 
• No opportunity to 
reprocess e.g. following 
update to business rules 
• Data is aggregated 
prematurely 
• Only particular 
combinations of metrics 
/ dimensions can be 
pivoted together 
(Google Analytics) 
• Only particular type of 
analysis are possible on 
different types of 
dimension (e.g. sProps, 
eVars, conversion goals 
in SiteCatalyst 
• Data is either aggregated 
(e.g. Google Analytics), 
or available as a 
complete log file for a 
fee (e.g. Adobe 
SiteCatalyst) 
• As a result, data is siloed: 
hard to join with other 
data sets
We took a fresh approach to digital analytics 
Other vendors tell you 
what to do with your data 
We give you your data so you can do 
whatever you want
How do users leverage their Snowplow event warehouse? 
Agile aka ad 
hoc analytics 
Enables… 
Marketing 
attribution 
modelling 
Customer 
lifetime value 
calculations 
Customer churn 
detection 
RTB fraud 
detection 
Product rec’ s 
Event warehouse
Early on, we made a crucial decision: Snowplow should be 
composed of a set of loosely coupled subsystems 
1. Trackers A 2. Collectors B 3. Enrich C 4. Storage D 5. Analytics 
Generate event 
data from any 
environment 
Log raw events 
from trackers 
Validate and 
enrich raw 
events 
D = Standardised data protocols 
Store enriched 
events ready 
for analysis 
Analyze 
enriched events 
These turned out to be critical to allowing us 
to evolve the above stack
Our data storage journey: 
starting with NoSQL
Our initial skunkworks version of Snowplow used Amazon S3 to 
store events, and then Hive to query them 
Website / webapp 
Snowplow data pipeline v1 
CloudFront-based 
pixel 
collector 
HiveQL + 
Java UDF 
“ETL” 
Amazon S3 
JavaScript 
event tracker 
• Batch-based 
• Normally run overnight; 
sometimes every 4-6 hours
We used a sparsely populated, de-normalized “fat table” 
approach for our events stored in Amazon S3
This got us started, but “time to report” was frustratingly slow 
for business analysts 
Amazon S3 
How many 
unique visitors 
did we have in 
October? 
What’s our 
average order 
value this year? 
What royalty 
payments 
should we 
invoice for this 
month? 
• Spin up transient EMR 
cluster 
• Log in to master node via 
SSH 
• Write HiveQL query (or 
adapt from our 
cookbook of recipes) 
• Hive kicks off 
MapReduce job 
• MapReduce job reads 
events stored in S3 
(slower than direct HDFS 
access) 
• Result is printed out in 
SSH terminal
From NoSQL to high-performance 
SQL
So we extended Snowplow to support columnar databases – after 
a first fling with Infobright, we integrated Amazon Redshift* 
Website, server, 
application or 
mobile app 
Hadoop-based 
enrichment 
Snowplow 
event 
tracking SDK 
Amazon S3 
Amazon 
Redshift 
HTTP-based 
event 
collector 
Infobright 
* For small users we also added PostgreSQL support, because Redshift and 
PostgreSQL have extremely similar APIs
Our existing sparsely populated, de-normalized “fat tables” 
turned out to be a great fit for columnar storage 
• In columnar databases, compression is done on individual 
columns across many different rows, so the wide rows don’t 
have a negative impact on storage/compression 
• Having all the potential events de-normalized in a single fat row 
meant we didn’t need to worry about JOIN performance in 
Redshift 
• The main downside was the brittleness of the events table: 
1. We found ourselves regularly ALTERing the table to add 
new event types 
2. Snowplow users and customers ended up with 
customized versions of the event table to meet their own 
requirements
We experimented with Redshift JOINs and found they 
could be performant 
• As long as two tables in Redshift have the same DISTKEY (for 
sharding data around the cluster) and SORTKEY (for sorting the 
row on disk), Redshift JOINs can be performant 
• Yes, even mega-to-huge joins! 
• This led us to a new relational architecture: 
• A parent table, atomic.events, containing our old legacy 
“full-fat” definition 
• Child tables containing individual JSONs representing new 
event types or bundles of context describing the event
Our new relational approach for Redshift 
• A typical Snowplow deployment in Redshift now looks like this: 
• In fact, the first thing a Snowplow analyst often does is “re-build” 
in a SQL view a company-specific “full-fat” table by 
JOINing in all their child tables
We built a custom process to perform safe shredding of 
JSONs into dedicated Redshift tables
This is working well – but there is a lot of room for 
improvement 
• Our shredding process is closely tied to Redshift’s innovative 
COPY FROM JSON functionality: 
• This is Redshift-specific – so we can’t extend our shredding 
process to other columnar databases e.g. Vertica, Netezza 
• The syntax doesn’t support nested shredding – which 
would allow us to e.g. intelligently shred an order into line 
items, products, customer etc 
• We have to maintain copies of the JSON Paths files required 
by COPY FROM JSON in all AWS regions 
• So, we plan to port the Redshift-specific aspects of our 
shredding process out of COPY FROM JSON into Snowplow and 
Iglu
Our data storage journey: to 
a mixed SQL / noSQL model
Snowplow is re-architecting around the unified log 
CLOUD VENDOR / OWN DATA CENTER 
Search 
Silo 
SOME LOW LATENCY LOCAL LOOPS 
E-comm 
Silo 
CRM 
SAAS VENDOR #2 
Email 
marketing 
ERP 
Silo 
CMS 
Silo 
SAAS VENDOR #1 
NARROW DATA SILOES 
Streaming APIs / 
web hooks 
LOW LATENCY WIDE DATA 
Unified log 
COVERAGE 
Archiving 
Hadoop 
< WIDE DATA 
COVERAGE > 
< FULL DATA 
HISTORY > 
FEW DAYS’ 
DATA HISTORY 
Systems 
monitoring 
Eventstream 
HIGH LATENCY LOW LATENCY 
Product rec’s 
Ad hoc 
analytics 
Management 
reporting 
Fraud 
detection 
Churn 
prevention 
APIs
The unified log is Amazon Kinesis, or Apache Kafka 
CLOUD VENDOR / OWN DATA CENTER 
Search 
Silo 
SOME LOW LATENCY LOCAL LOOPS 
E-comm 
Silo 
CRM 
SAAS VENDOR #2 
Email 
marketing 
ERP 
Silo 
CMS 
Silo 
SAAS VENDOR #1 
NARROW DATA SILOES 
Streaming APIs / 
web hooks 
Unified log 
Archiving 
Hadoop 
< WIDE DATA 
COVERAGE > 
< FULL DATA 
HISTORY > 
Systems 
monitoring 
Eventstream 
HIGH LATENCY LOW LATENCY 
Product rec’s 
Ad hoc 
analytics 
Management 
reporting 
Fraud 
detection 
Churn 
prevention 
APIs 
• Amazon Kinesis, a 
hosted AWS service 
• Extremely similar 
semantics to Kafka 
• Apache Kafka, an append-only, 
distributed, ordered 
commit log 
• Developed at LinkedIn to 
serve as their 
organization’s unified log
“Kafka is designed to allow a 
single cluster to serve as the 
central data backbone for a 
large organization” [1] 
[1] http://kafka.apache.org/
“if you squint a bit, you can see the 
whole of your organization's systems and 
data flows as a single distributed 
database. You can view all the individual 
query-oriented systems (Redis, SOLR, 
Hive tables, and so on) as just particular 
indexes on your data. ” [1] 
[1] http://engineering.linkedin.com/distributed-systems/ 
log-what-every-software-engineer-should-know-about-real-time-datas-unifying
In a unified log world, Snowplow will be feeding a mix of 
different SQL, NoSQL and stream databases 
Scala 
Stream 
Collector 
Raw 
event 
stream 
Enrich 
Kinesis 
app 
Bad raw 
events 
stream 
Enriched 
event 
stream 
S3 
Redshift 
S3 sink 
Kinesis app 
Redshift 
sink 
Kinesis app 
Snowplow 
Trackers 
= not yet released 
Elastic- 
Search sink 
Kinesis app 
DynamoDB 
Elastic- 
Search 
Event 
aggregator 
Kinesis app 
Analytics on 
Read (for agile 
exploration of 
event stream, 
ML, auditing, 
applying 
alternate 
models, 
reprocessing 
etc) 
Analytics on Write (for dashboarding, 
audience segmentation, RTB, etc)
We have already experimented with Neo4J for customer 
flow/path analysis [1] 
[1] http://snowplowanalytics.com/blog/2014/07/31/ 
using-graph-databases-to-perform-pathing-analysis-initial-experimentation-with-neo4j/
During our current work integrating Elasticsearch we discovered 
that common “NoSQL” databases need schemas too 
• A simple example of schemas in Elasticsearch: 
$ curl -XPUT 'http://localhost:9200/blog/contra/4' -d 
'{"t": ["u", 999]}' 
{"_index":"blog","_type":"contra","_id":"4","_version":1,"c 
reated":true} 
$ curl -XPUT 'http://localhost:9200/blog/contra/4' -d 
'{"p": [11, "q"]}' 
{"error":"MapperParsingException[failed to parse [p]]; 
nested: NumberFormatException[For input string: "q"]; 
","status":400} 
• Elasticsearch is doing automated “shredding” of incoming JSONs to 
index that data in Lucene
We are now working on our second shredder  
• Our Elasticsearch loader contains code to shred our events’ 
heterogeneous JSON arrays and dictionaries into a format that is 
compatible with Elasticsearch 
• This is conceptually a much simpler shredder than the one we had 
to build for Redshift 
• When we add Google BigQuery support, we will need to write yet 
another shredder to handle the specifics of that data store 
• Hopefully we can unify and generalize our shredding technology 
so it works across columnar, relational, document and graph 
databases – a big undertaking but super powerful!
Questions? 
Discount code: ulogprugcf (43% off 
Unified Log Processing eBook) 
http://snowplowanalytics.com 
https://github.com/snowplow/snowplow 
@snowplowdata 
To meet up or chat, @alexcrdean on Twitter or 
alex@snowplowanalytics.com

More Related Content

PDF
Big data meetup budapest adding data schemas to snowplow
PDF
AWS運用における最適パターンの徹底活用
PPTX
MOVで実践したサーバーAPI実装の超最適化について [MOBILITY:dev]
PDF
AWS Black Belt Techシリーズ AWS Direct Connect
PDF
Serverless Application Security on AWS
PPTX
SQL Server 使いのための Azure Synapse Analytics - Spark 入門
PDF
AWS로 게임 런칭 준비하기 ::: 장준성, 채민관, AWS Game Master 온라인 시리즈 #4
PDF
オンプレミスからクラウドへ:Oracle Databaseの移行ベストプラクティスを解説 (Oracle Cloudウェビナーシリーズ: 2021年2月18日)
Big data meetup budapest adding data schemas to snowplow
AWS運用における最適パターンの徹底活用
MOVで実践したサーバーAPI実装の超最適化について [MOBILITY:dev]
AWS Black Belt Techシリーズ AWS Direct Connect
Serverless Application Security on AWS
SQL Server 使いのための Azure Synapse Analytics - Spark 入門
AWS로 게임 런칭 준비하기 ::: 장준성, 채민관, AWS Game Master 온라인 시리즈 #4
オンプレミスからクラウドへ:Oracle Databaseの移行ベストプラクティスを解説 (Oracle Cloudウェビナーシリーズ: 2021年2月18日)

Viewers also liked (12)

PPTX
Modelling event data in look ml
PPTX
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
PPTX
Snowplow Analytics and Looker at Oyster.com
PPTX
Why use big data tools to do web analytics? And how to do it using Snowplow a...
PPTX
How we use Hive at SnowPlow, and how the role of HIve is changing
PPTX
Snowplow is at the core of everything we do
PPTX
Snowplow: where we came from and where we are going - March 2016
PDF
Snowplow at Sigfig
PPTX
Understanding event data
PDF
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
PDF
Using Snowplow for A/B testing and user journey analysis at CustomMade
PDF
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Modelling event data in look ml
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Snowplow Analytics and Looker at Oyster.com
Why use big data tools to do web analytics? And how to do it using Snowplow a...
How we use Hive at SnowPlow, and how the role of HIve is changing
Snowplow is at the core of everything we do
Snowplow: where we came from and where we are going - March 2016
Snowplow at Sigfig
Understanding event data
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Using Snowplow for A/B testing and user journey analysis at CustomMade
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Ad

Similar to Snowplow Analytics: from NoSQL to SQL and back again (20)

PPTX
Big Data Beers - Introducing Snowplow
PDF
Snowplow presentation for Amsterdam Meetup #3
PDF
Span Conference: Why your company needs a unified log
PDF
Metail at Cambridge AWS User Group Main Meetup #3
PPTX
AWS User Group UK: Why your company needs a unified log
PPTX
Implementing improved and consistent arbitrary event tracking company-wide us...
PDF
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
PDF
Building a Sustainable Data Platform on AWS
PPTX
From raw data to business insights. A modern data lake
PDF
AWS Floor 28 - Building Data lake on AWS
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PPTX
Architecting an Open Source AI Platform 2018 edition
PDF
Building a modern data platform on AWS. Utrecht AWS Dev Day
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Snowplow the evolving data pipeline
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PPTX
Big Data Warehousing Meetup with Riak
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Big Data Beers - Introducing Snowplow
Snowplow presentation for Amsterdam Meetup #3
Span Conference: Why your company needs a unified log
Metail at Cambridge AWS User Group Main Meetup #3
AWS User Group UK: Why your company needs a unified log
Implementing improved and consistent arbitrary event tracking company-wide us...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
Building a Sustainable Data Platform on AWS
From raw data to business insights. A modern data lake
AWS Floor 28 - Building Data lake on AWS
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Big Data, Ingeniería de datos, y Data Lakes en AWS
Architecting an Open Source AI Platform 2018 edition
Building a modern data platform on AWS. Utrecht AWS Dev Day
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Snowplow the evolving data pipeline
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Big Data Warehousing Meetup with Riak
Dirty data? Clean it up! - Datapalooza Denver 2016
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Ad

More from Alexander Dean (8)

PPTX
Asynchronous micro-services and the unified log
PPT
What Crimean War gunboats teach us about the need for schema registries
PPTX
Snowplow New York City Meetup #2
PPTX
Introducing Tupilak, Snowplow's unified log fabric
PPTX
Unified Log London (May 2015) - Why your company needs a unified log
PPTX
Scala eXchange: Building robust data pipelines in Scala
PPTX
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
PPTX
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Asynchronous micro-services and the unified log
What Crimean War gunboats teach us about the need for schema registries
Snowplow New York City Meetup #2
Introducing Tupilak, Snowplow's unified log fabric
Unified Log London (May 2015) - Why your company needs a unified log
Scala eXchange: Building robust data pipelines in Scala
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PDF
Nekopoi APK 2025 free lastest update
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
assetexplorer- product-overview - presentation
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
history of c programming in notes for students .pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPT
Introduction Database Management System for Course Database
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Understanding Forklifts - TECH EHS Solution
Introduction to Artificial Intelligence
Nekopoi APK 2025 free lastest update
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Which alternative to Crystal Reports is best for small or large businesses.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
L1 - Introduction to python Backend.pptx
PTS Company Brochure 2025 (1).pdf.......
assetexplorer- product-overview - presentation
wealthsignaloriginal-com-DS-text-... (1).pdf
history of c programming in notes for students .pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
Introduction Database Management System for Course Database
How to Choose the Right IT Partner for Your Business in Malaysia
Understanding Forklifts - TECH EHS Solution

Snowplow Analytics: from NoSQL to SQL and back again

  • 1. Snowplow Analytics – From NoSQL to SQL and back London NoSQL, 17th November 2014
  • 2. Introducing myself • Alex Dean • Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1] • Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2] [1] https://github.com/snowplow/snowplow [2] http://manning.com/dean
  • 4. Snowplow is an event analytics platform Collect event data Warehouse event data Data warehouse Unified log Unified log Unified log Publish event data to a unified log Perform the high value analyses that drive the bottom line Act on your data in real-time
  • 5. Snowplow was created as a response to the limitations of traditional web analytics programs: Data collection Data processing Data access • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) • Data is processed ‘once’ • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • As a result, data is siloed: hard to join with other data sets
  • 6. We took a fresh approach to digital analytics Other vendors tell you what to do with your data We give you your data so you can do whatever you want
  • 7. How do users leverage their Snowplow event warehouse? Agile aka ad hoc analytics Enables… Marketing attribution modelling Customer lifetime value calculations Customer churn detection RTB fraud detection Product rec’ s Event warehouse
  • 8. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers A 2. Collectors B 3. Enrich C 4. Storage D 5. Analytics Generate event data from any environment Log raw events from trackers Validate and enrich raw events D = Standardised data protocols Store enriched events ready for analysis Analyze enriched events These turned out to be critical to allowing us to evolve the above stack
  • 9. Our data storage journey: starting with NoSQL
  • 10. Our initial skunkworks version of Snowplow used Amazon S3 to store events, and then Hive to query them Website / webapp Snowplow data pipeline v1 CloudFront-based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker • Batch-based • Normally run overnight; sometimes every 4-6 hours
  • 11. We used a sparsely populated, de-normalized “fat table” approach for our events stored in Amazon S3
  • 12. This got us started, but “time to report” was frustratingly slow for business analysts Amazon S3 How many unique visitors did we have in October? What’s our average order value this year? What royalty payments should we invoice for this month? • Spin up transient EMR cluster • Log in to master node via SSH • Write HiveQL query (or adapt from our cookbook of recipes) • Hive kicks off MapReduce job • MapReduce job reads events stored in S3 (slower than direct HDFS access) • Result is printed out in SSH terminal
  • 13. From NoSQL to high-performance SQL
  • 14. So we extended Snowplow to support columnar databases – after a first fling with Infobright, we integrated Amazon Redshift* Website, server, application or mobile app Hadoop-based enrichment Snowplow event tracking SDK Amazon S3 Amazon Redshift HTTP-based event collector Infobright * For small users we also added PostgreSQL support, because Redshift and PostgreSQL have extremely similar APIs
  • 15. Our existing sparsely populated, de-normalized “fat tables” turned out to be a great fit for columnar storage • In columnar databases, compression is done on individual columns across many different rows, so the wide rows don’t have a negative impact on storage/compression • Having all the potential events de-normalized in a single fat row meant we didn’t need to worry about JOIN performance in Redshift • The main downside was the brittleness of the events table: 1. We found ourselves regularly ALTERing the table to add new event types 2. Snowplow users and customers ended up with customized versions of the event table to meet their own requirements
  • 16. We experimented with Redshift JOINs and found they could be performant • As long as two tables in Redshift have the same DISTKEY (for sharding data around the cluster) and SORTKEY (for sorting the row on disk), Redshift JOINs can be performant • Yes, even mega-to-huge joins! • This led us to a new relational architecture: • A parent table, atomic.events, containing our old legacy “full-fat” definition • Child tables containing individual JSONs representing new event types or bundles of context describing the event
  • 17. Our new relational approach for Redshift • A typical Snowplow deployment in Redshift now looks like this: • In fact, the first thing a Snowplow analyst often does is “re-build” in a SQL view a company-specific “full-fat” table by JOINing in all their child tables
  • 18. We built a custom process to perform safe shredding of JSONs into dedicated Redshift tables
  • 19. This is working well – but there is a lot of room for improvement • Our shredding process is closely tied to Redshift’s innovative COPY FROM JSON functionality: • This is Redshift-specific – so we can’t extend our shredding process to other columnar databases e.g. Vertica, Netezza • The syntax doesn’t support nested shredding – which would allow us to e.g. intelligently shred an order into line items, products, customer etc • We have to maintain copies of the JSON Paths files required by COPY FROM JSON in all AWS regions • So, we plan to port the Redshift-specific aspects of our shredding process out of COPY FROM JSON into Snowplow and Iglu
  • 20. Our data storage journey: to a mixed SQL / noSQL model
  • 21. Snowplow is re-architecting around the unified log CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks LOW LATENCY WIDE DATA Unified log COVERAGE Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > FEW DAYS’ DATA HISTORY Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  • 22. The unified log is Amazon Kinesis, or Apache Kafka CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs • Amazon Kinesis, a hosted AWS service • Extremely similar semantics to Kafka • Apache Kafka, an append-only, distributed, ordered commit log • Developed at LinkedIn to serve as their organization’s unified log
  • 23. “Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization” [1] [1] http://kafka.apache.org/
  • 24. “if you squint a bit, you can see the whole of your organization's systems and data flows as a single distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. ” [1] [1] http://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 25. In a unified log world, Snowplow will be feeding a mix of different SQL, NoSQL and stream databases Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers = not yet released Elastic- Search sink Kinesis app DynamoDB Elastic- Search Event aggregator Kinesis app Analytics on Read (for agile exploration of event stream, ML, auditing, applying alternate models, reprocessing etc) Analytics on Write (for dashboarding, audience segmentation, RTB, etc)
  • 26. We have already experimented with Neo4J for customer flow/path analysis [1] [1] http://snowplowanalytics.com/blog/2014/07/31/ using-graph-databases-to-perform-pathing-analysis-initial-experimentation-with-neo4j/
  • 27. During our current work integrating Elasticsearch we discovered that common “NoSQL” databases need schemas too • A simple example of schemas in Elasticsearch: $ curl -XPUT 'http://localhost:9200/blog/contra/4' -d '{"t": ["u", 999]}' {"_index":"blog","_type":"contra","_id":"4","_version":1,"c reated":true} $ curl -XPUT 'http://localhost:9200/blog/contra/4' -d '{"p": [11, "q"]}' {"error":"MapperParsingException[failed to parse [p]]; nested: NumberFormatException[For input string: "q"]; ","status":400} • Elasticsearch is doing automated “shredding” of incoming JSONs to index that data in Lucene
  • 28. We are now working on our second shredder  • Our Elasticsearch loader contains code to shred our events’ heterogeneous JSON arrays and dictionaries into a format that is compatible with Elasticsearch • This is conceptually a much simpler shredder than the one we had to build for Redshift • When we add Google BigQuery support, we will need to write yet another shredder to handle the specifics of that data store • Hopefully we can unify and generalize our shredding technology so it works across columnar, relational, document and graph databases – a big undertaking but super powerful!
  • 29. Questions? Discount code: ulogprugcf (43% off Unified Log Processing eBook) http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com