SlideShare a Scribd company logo
1
The fashion shopping future
Metail's Data Pipeline and
AWS
OCTOBER 2015
2
Introduction
• Introduction to Metail (from BD shiny)
• Architecture Overview
• Event Tracking and Collection
• Extract Transform and Load (ETL)
• Getting Insights
• Managing The Pipeline
3
The Metail Experience allows customer to…
Discover clothes on
your body shape
Create, save outfits
and share
Shop with
confidence of size
and fit
4
1.6m MeModels created
Size & scale
5
+
-
88 Countries
Size & scale
6
Architecture Overview
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
7
Architecture Overview
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
8
New Data and Collection
9
New Data and Collection
Batch Layer
10
New Data and Collection
Batch Layer
Serving Layer
11
Data Collection
• A significant part of our pipeline is powered by Snowplow:
http://snowplowanalytics.com
• We use their technology for tracking and setup for collection
– They have specified a tracking protocol, implementing it in many languages
– We’re using the JavaScript tracker
– Implementation very similar to Google Analytics (GA):
http://www.google.co.uk/analytics/
– But you have all the raw data 
12
Data Collection
• Where does AWS come in?
– Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting-
up-the-Cloudfront-collector
– Snowplow’s GIF, called i, we uploaded to an S3 bucket
– Cloudfront serves the content of the bucket
– To collect the events the tracker performs a GET request
– Query parameters of the GET request contain the payload
– E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
– Cloudfront configured for http and https for only GET and HEAD with logging enabled
– Cloudfront requests, the events, are logged to our S3 bucket 
– In Lambda Architecture terms these Cloudfront logs are our master record and are
the raw data
13
Extract Transform and Load (ETL)
• This is the batch layer of our architecture
• Runs over the raw (and enriched) data producing (further) enriched data sets
• Implemented using MapReduce technologies:
– Snowplow ETL written in Scalding
– Cascading (Java higher level MapReduce libraries) in Scala
https://github.com/twitter/scalding + http://www.cascading.org/
– Looks like Scala and Cascading
– Metail ETL written in Cascalog: http://cascalog.org
– Cascalog has been described as logic programming over Hadoop
– Cascading + Datalog = Cascalog
– Ridiculously compact and expressive – one of the steepest learning curve I’ve
encountered in software engineering but no hidden traps
– AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/
– AWS has done the hard/tedious work of deploying Hadoop to EC2
14
Extract Transform and Load (ETL)
• Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner
– Initial step executed outside of EMR
– Copy data in Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
– To that cluster you add steps
15
Extract Transform and Load (ETL)
• Metail’s ETL
– We run directly on the data in S3
– We store our JARs in S3 and have a process to deploy them
– We have several enrichment steps
– Our enrichment runs on Snowplow’s enriched events
– And further enrich our enriched events
– This is what is building our batch views for the serving layer
16
Extract Transform and Load (ETL)
• EMR and S3 get on very well
– AWS have engineered S3 so that it can behave as a native HDFS file system with very
little loss of performance
– They recommend using S3 as permanent data store
– EMR cluster’s HDFS file system in my mind is a giant /tmp
– Encourages immutable infrastructure
– You don’t need your compute cluster running to hold your data
– Snowplow and Metail output directly to S3
– The only reason Snowplow copies to local HDFS is because they’re aggregating
the Cloudfront logs
– That’s transitory data
– You can archive S3 data to Glacier
17
Getting Insights
• The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/
– I’d like it to be Cascalog but even I’d hate that :P
• Redshift is a “petabyte-scale data warehouse”
– Offers a Postgres like SQL dialect to query the data
– Uses a columnar distributed data store
– It’s very quick
– Currently we have a nine node compute cluster (9*160GB = 1.44TB)
– Thinking of switching to dense storage node or re-architecting
– Growing at 10GB a day
18
Getting Insights
SELECT DATE_TRUNC('mon', collector_tstamp),
COUNT(event_id)
FROM events
GROUP BY DATE_TRUNC('mon', collector_tstamp)
ORDER BY DATE_TRUNC('mon', collector_tstamp);
19
Getting Insights
• The Snowplow pipeline is setup to have Redshift as an endpoint:
https://github.com/snowplow/snowplow/wiki/setting-up-redshift
• The Snowplow events table is loaded into Redshift directly from S3
• The events we enrich in EMR are also loaded into Redshift again directly from S3
20
Getting Insights
• A technology called Looker …
– This provides a powerful Excel like interface to the data
– While providing software engineering tools to manage the SQL used explore the data
• .. and R for the heavier stats
– Starting to interface directly to Redshift through a PostgreSQL driver
The analysis of this data is done using a combination of
21
Managing the Pipeline
• I’ve almost certainly run out of time and not reached this slide 
• Lemur to submit ad-hoc Cascalog jobs
– The initial manual pipeline
– Clojure based
• Snowplow have written their configuration tools in Ruby and bash
• We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/
– More flaws than advantages

More Related Content

PPTX
Metail and Elastic MapReduce
PDF
Spark Summit EU talk by Nick Pentreath
PDF
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
PPTX
AWS_Data_Pipeline
PDF
Presto Summit 2018 - 07 - Lyft
PPTX
Intro to Apache Spark
Metail and Elastic MapReduce
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Modern ETL Pipelines with Change Data Capture
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
AWS_Data_Pipeline
Presto Summit 2018 - 07 - Lyft
Intro to Apache Spark

What's hot (20)

PDF
Uber Business Metrics Generation and Management Through Apache Flink
PPTX
Lambda architecture: from zero to One
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PPTX
Volta: Logging, Metrics, and Monitoring as a Service
PPTX
presto-at-netflix-hadoop-summit-15
PDF
Spark Summit EU talk by Brij Bhushan Ravat
PPTX
Scaling Graphite At Yelp
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
PDF
Top 5 mistakes when writing Streaming applications
PPTX
Building a unified data pipeline in Apache Spark
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PDF
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
PPTX
Distributed Data Quality - Technical Solutions for Organizational Scaling
PDF
Streaming sql w kafka and flink
PPTX
Challenges in Building a Data Pipeline
PDF
On-boarding with JanusGraph Performance
PDF
From Batch to Streaming ET(L) with Apache Apex
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Uber Business Metrics Generation and Management Through Apache Flink
Lambda architecture: from zero to One
Introduction to Data Engineer and Data Pipeline at Credit OK
Volta: Logging, Metrics, and Monitoring as a Service
presto-at-netflix-hadoop-summit-15
Spark Summit EU talk by Brij Bhushan Ravat
Scaling Graphite At Yelp
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Top 5 mistakes when writing Streaming applications
Building a unified data pipeline in Apache Spark
Lambda architecture on Spark, Kafka for real-time large scale ML
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Distributed Data Quality - Technical Solutions for Organizational Scaling
Streaming sql w kafka and flink
Challenges in Building a Data Pipeline
On-boarding with JanusGraph Performance
From Batch to Streaming ET(L) with Apache Apex
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Ad

Similar to Metail at Cambridge AWS User Group Main Meetup #3 (20)

PPT
AWS (Hadoop) Meetup 30.04.09
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PPTX
AWS Certified Solutions Architect Professional Course S15-S18
PPTX
Paris Data Geek - Spark Streaming
PDF
Introduction to apache spark
PDF
Introduction to apache spark
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
PPTX
Centralized log-management-with-elastic-stack
PPTX
Snowplow Analytics: from NoSQL to SQL and back again
PDF
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
PPTX
Big Data_Architecture.pptx
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PPTX
Data Analysis on AWS
PDF
Managing data analytics in a hybrid cloud
PDF
AWS Big Data Landscape
PDF
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
PPTX
Cloud and Big Data trends
AWS (Hadoop) Meetup 30.04.09
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
AWS Certified Solutions Architect Professional Course S15-S18
Paris Data Geek - Spark Streaming
Introduction to apache spark
Introduction to apache spark
Intro to Apache Spark by CTO of Twingo
Spark and Couchbase: Augmenting the Operational Database with Spark
Centralized log-management-with-elastic-stack
Snowplow Analytics: from NoSQL to SQL and back again
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Big Data_Architecture.pptx
Qubole @ AWS Meetup Bangalore - July 2015
Data Analysis on AWS
Managing data analytics in a hybrid cloud
AWS Big Data Landscape
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Cloud and Big Data trends
Ad

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ai tools demonstartion for schools and inter college
PDF
medical staffing services at VALiNTRY
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Nekopoi APK 2025 free lastest update
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
System and Network Administraation Chapter 3
PPTX
history of c programming in notes for students .pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
L1 - Introduction to python Backend.pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
CHAPTER 2 - PM Management and IT Context
Upgrade and Innovation Strategies for SAP ERP Customers
ai tools demonstartion for schools and inter college
medical staffing services at VALiNTRY
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Nekopoi APK 2025 free lastest update
Digital Systems & Binary Numbers (comprehensive )
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Computer Software and OS of computer science of grade 11.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Reimagine Home Health with the Power of Agentic AI​
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PTS Company Brochure 2025 (1).pdf.......
Designing Intelligence for the Shop Floor.pdf
System and Network Administraation Chapter 3
history of c programming in notes for students .pptx

Metail at Cambridge AWS User Group Main Meetup #3

  • 1. 1 The fashion shopping future Metail's Data Pipeline and AWS OCTOBER 2015
  • 2. 2 Introduction • Introduction to Metail (from BD shiny) • Architecture Overview • Event Tracking and Collection • Extract Transform and Load (ETL) • Getting Insights • Managing The Pipeline
  • 3. 3 The Metail Experience allows customer to… Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  • 6. 6 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  • 7. 7 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  • 8. 8 New Data and Collection
  • 9. 9 New Data and Collection Batch Layer
  • 10. 10 New Data and Collection Batch Layer Serving Layer
  • 11. 11 Data Collection • A significant part of our pipeline is powered by Snowplow: http://snowplowanalytics.com • We use their technology for tracking and setup for collection – They have specified a tracking protocol, implementing it in many languages – We’re using the JavaScript tracker – Implementation very similar to Google Analytics (GA): http://www.google.co.uk/analytics/ – But you have all the raw data 
  • 12. 12 Data Collection • Where does AWS come in? – Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting- up-the-Cloudfront-collector – Snowplow’s GIF, called i, we uploaded to an S3 bucket – Cloudfront serves the content of the bucket – To collect the events the tracker performs a GET request – Query parameters of the GET request contain the payload – E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&... – Cloudfront configured for http and https for only GET and HEAD with logging enabled – Cloudfront requests, the events, are logged to our S3 bucket  – In Lambda Architecture terms these Cloudfront logs are our master record and are the raw data
  • 13. 13 Extract Transform and Load (ETL) • This is the batch layer of our architecture • Runs over the raw (and enriched) data producing (further) enriched data sets • Implemented using MapReduce technologies: – Snowplow ETL written in Scalding – Cascading (Java higher level MapReduce libraries) in Scala https://github.com/twitter/scalding + http://www.cascading.org/ – Looks like Scala and Cascading – Metail ETL written in Cascalog: http://cascalog.org – Cascalog has been described as logic programming over Hadoop – Cascading + Datalog = Cascalog – Ridiculously compact and expressive – one of the steepest learning curve I’ve encountered in software engineering but no hidden traps – AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/ – AWS has done the hard/tedious work of deploying Hadoop to EC2
  • 14. 14 Extract Transform and Load (ETL) • Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner – Initial step executed outside of EMR – Copy data in Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster – To that cluster you add steps
  • 15. 15 Extract Transform and Load (ETL) • Metail’s ETL – We run directly on the data in S3 – We store our JARs in S3 and have a process to deploy them – We have several enrichment steps – Our enrichment runs on Snowplow’s enriched events – And further enrich our enriched events – This is what is building our batch views for the serving layer
  • 16. 16 Extract Transform and Load (ETL) • EMR and S3 get on very well – AWS have engineered S3 so that it can behave as a native HDFS file system with very little loss of performance – They recommend using S3 as permanent data store – EMR cluster’s HDFS file system in my mind is a giant /tmp – Encourages immutable infrastructure – You don’t need your compute cluster running to hold your data – Snowplow and Metail output directly to S3 – The only reason Snowplow copies to local HDFS is because they’re aggregating the Cloudfront logs – That’s transitory data – You can archive S3 data to Glacier
  • 17. 17 Getting Insights • The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/ – I’d like it to be Cascalog but even I’d hate that :P • Redshift is a “petabyte-scale data warehouse” – Offers a Postgres like SQL dialect to query the data – Uses a columnar distributed data store – It’s very quick – Currently we have a nine node compute cluster (9*160GB = 1.44TB) – Thinking of switching to dense storage node or re-architecting – Growing at 10GB a day
  • 18. 18 Getting Insights SELECT DATE_TRUNC('mon', collector_tstamp), COUNT(event_id) FROM events GROUP BY DATE_TRUNC('mon', collector_tstamp) ORDER BY DATE_TRUNC('mon', collector_tstamp);
  • 19. 19 Getting Insights • The Snowplow pipeline is setup to have Redshift as an endpoint: https://github.com/snowplow/snowplow/wiki/setting-up-redshift • The Snowplow events table is loaded into Redshift directly from S3 • The events we enrich in EMR are also loaded into Redshift again directly from S3
  • 20. 20 Getting Insights • A technology called Looker … – This provides a powerful Excel like interface to the data – While providing software engineering tools to manage the SQL used explore the data • .. and R for the heavier stats – Starting to interface directly to Redshift through a PostgreSQL driver The analysis of this data is done using a combination of
  • 21. 21 Managing the Pipeline • I’ve almost certainly run out of time and not reached this slide  • Lemur to submit ad-hoc Cascalog jobs – The initial manual pipeline – Clojure based • Snowplow have written their configuration tools in Ruby and bash • We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/ – More flaws than advantages