Data Engineering at Udemy

Data Engineering at Udemy
Keeyong Han
Principal Data Architect @Udemy
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015

About Me
• 20+ years of experience from 9 different
companies
• Currently manages Data team at Udemy
• Prior to joining Udemy
– Manager of data/search team at Polyvore
– Director of Engineering at Yahoo Search
– Started career from Samsung Electronics in Korea
August 5, 2015

Agenda
• Typical Evolution of Data Processing
• Data Engineering at Udemy
• Lessons Learned
August 5, 2015

TYPICAL EVOLUTION OF
DATA PROCESSING
From a small start-up
August 5, 2015

In the beginning
• You don’t have any data 
• So no data infrastructure or data science
– The most important thing is to survive and to keep
iterating
August 5, 2015

After a struggle you have some data
• Now you survived and now you have some
data to work with
– Data analysts are hired
– They want to analyze the data
August 5, 2015

Then …
• You don’t know where the data is exactly
• You find your data but
– It is not clean and is missing key information
– Data is likely not in the format you want
• You store them in non-optimal storage
– MySQL is likely used to store all kinds of data
• But MySQL doesn’t scale
– You ask analysts to query MySQL
• They will kill the web site a few times 
August 5, 2015

August 5, 2015

Now what to do? (I)
• You have to find a scalable and separate storage for
data analysis
– This is called Data Warehouse or Data Analytics
– This will be the central storage for your important data
– Udemy uses AWS Redshift
• Migrate some data from MySQL
– Key/Value data to NoSQL solution (Cassandra/Hbase,
MongoDB, …)
– Log type of data (use Nginx log for example)
– MySQL should only have key data which is needed from
Web service
August 5, 2015

Now what to do? (II)
• The goal is to put every data into a single
storage
– This is the most important and the very first step
toward becoming “true” data organization
– This storage should be separated from runtime
storage (MySQL for example)
– This storage should be scalable
– Being consistent is more important than being
correct in the beginning
August 5, 2015

Now You Add More Data
• Different Ways of Collecting Data
– This is called ETL (Extract, Transform and Load)
– Different Aspects to Consider
• Size: 1KB to 20GB
• Frequency: Hourly, Daily, Weekly, Monthly
• How to collect data:
– FTP, API, Webhook, S3, HTTP, mysql commandline
• You will have multiple data collection workflows
– Use cronjob (or some scheduler) to manage
– Udemy uses Pinball (Open Source from Pinterest)
August 5, 2015

How It Will Look Like
Your
Cool
Web
Service
Log Files
MySQL
Key/Value
Data
Warehouse
External
Data SourcesETL
ETL
August 5, 2015

Simple Data import
• Just use some script language
– Many data sources are small and simple enough
to use a script language
• Udemy uses Python for this purpose
– Implemented a set of Python classes to handle
different types of data import
– Plan to open source this in 1st half of 2016
August 5, 2015

Large Data Batch Import
• Large data import and processing will require
more scalable solution
• Hadoop can be used for this purpose
– SQL on Hadoop: Hive, Tajo, Presto and so on
– Pig, Java MapReduce
• Spark is getting a lot of attention and we plan
to evaluate
August 5, 2015

Realtime Data import
• Some of data better be imported as it
happens
• This requires different type of technology
– Realtime Data Message Queue: Kafka, Kinesis
– Realtime Data Consumer: Storm, Samza, Spark
Streaming
August 5, 2015

What’s Next? (I)
• Build Summary Tables
– Having raw data tables is good but it can be too
detailed and too much information
– Build these tables in your Data Warehouse
• Track the performance of key metrics
– This should be from summary tables above
– You need dashboard tool (build one or use 3rd
party solution – Birst, chartIO, Tableau and so on)
August 5, 2015

What’s Next? (II)
• Provide this data to Data Science team
– Draw insight and create feedback loop
– Build machine learned models for recommendation,
search ranking and so on
– The topic for the next session (Thanks Larry!)
• Supporting Data Science from Infrastructure
– This will require scalable infrastructure
– Example: Scoring every pairs of user/course in Udemy
• 7M users X 30K courses = 210B pairs of computation
– You need scalable Serving Layer (Cassandra, Hbase, …)
August 5, 2015

DATA ENGINEERING AT UDEMY
August 5, 2015

Data Warehouse/Analytics
• We use AWS Redshift as our data warehouse
(or data analytics backend)
• What is AWS Redshift?
– Scalable Postgresql Engine up to 1.6PB of data
– Roughly it is 600 USD per TB per month
– Mainly for offline batch processing
– Supports bulk update (through AWS S3)
– Two type of options: Compute vs. Storage
August 5, 2015

Kind of Data Stored in Redshift
• 800+ tables with 2.4TB of data
• Key tables from MySQL
• Email Marketing Data
• Ads Campaign Performance Data
• SEO data from Google
• Data from Web access log
• Support Ticket Data
• A/B Test Data (Mobile, Web)
• Human curated data from Google Spreadsheets
August 5, 2015

Details on ETL Pipelines
• All data pipelines are scheduled through
Pinball
– Every 5 minutes, hourly, daily, weekly and monthly
• Most pipelines are purely in Python
• Some uses Hadoop/Hive and Hadoop/Pig for
Batch Processing
• Start using Kinesis for Realtime Processing
August 5, 2015

Pinball Screenshot
August 5, 2015

Batching Processing Infrastructure
• We use Hadoop 2.6 with Hive and Pig
– CDH 5.4 (community version)
• We use our own hadoop cluster and AWS EMR
(ElasticMapReduce) at the same time
– This is used to do ETL on massive data
– This is also used to run massive model/scoring
pipelines from Data Science team
• Plan to evaluate Spark potentially as an
alternative
August 5, 2015

Realtime Processing
• Applications
– The first application is to process web access log
– Eventually we plan to use this to generate
personalized recommendation on-the-fly
• Plan to use AWS Kinesis
– Evaluated Apache Kafka and AWS Kinesis
• They are very similar but Kafka is an open source while
Kinesis is a managed service from AWS
• Decided to use AWS Kinesis
• Plan to evaluate Realtime Consumer
– Samza, Storm, Spark Streaming
August 5, 2015

What is Kinesis (Kafka)?
• Realtime data processing service in AWS
– Publisher-Subscriber message broker
– Very similar to Kafka
• It has two components
– One is message queue where stream of data is stored
• 24 hours of retention period
• Pay hourly by the read/write rate of the queue
– The other is KCL (Kinesis Client Library)
• Using this, build Data Producer application or Data
Consumer Application
• This can be combined with Storm, Spark Streaming, …
August 5, 2015

Data Serving Layer
• Redshift isn’t a good fit to read out the data in
realtime fashion so you need something else
• We are using (or plan to use) the followings:
– Cassandra
– Redis
– ElasticSearch
– MySQL
August 5, 2015

How It Looks Like
Udemy
Log Files
(Nginx)
MySQL
Key/Value
(Cassandra)
Data
Warehouse
(Redshift)
External
Data Sources
Data Science Pipeline
ETL
ETL
Data Science Pipeline
August 5, 2015

LESSONS LEARNED
August 5, 2015

• As a small start-up survive first and then work
on data
• Starting point is to store all data in a single
location (data warehouse)
• Start with batch processing and then realtime
• Consider the type of data you store
– Log vs. Key/Value vs. Transactional Record
• Store data in the form of log (change history)
– So that you can always go back and debug/replay
• Cloud is good unless you have really massive
data
August 5, 2015

Q & A
Udemy is Hiring
August 5, 2015

Data Engineering at Udemy

More Related Content

What's hot (20)

Similar to Data Engineering at Udemy (20)

Recently uploaded (20)

Data Engineering at Udemy

Editor's Notes