SlideShare a Scribd company logo
Data Engineering at Udemy
Keeyong Han
Principal Data Architect @Udemy
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
About Me
• 20+ years of experience from 9 different
companies
• Currently manages Data team at Udemy
• Prior to joining Udemy
– Manager of data/search team at Polyvore
– Director of Engineering at Yahoo Search
– Started career from Samsung Electronics in Korea
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Agenda
• Typical Evolution of Data Processing
• Data Engineering at Udemy
• Lessons Learned
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
TYPICAL EVOLUTION OF
DATA PROCESSING
From a small start-up
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
In the beginning
• You don’t have any data 
• So no data infrastructure or data science
– The most important thing is to survive and to keep
iterating
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
After a struggle you have some data
• Now you survived and now you have some
data to work with
– Data analysts are hired
– They want to analyze the data
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Then …
• You don’t know where the data is exactly
• You find your data but
– It is not clean and is missing key information
– Data is likely not in the format you want
• You store them in non-optimal storage
– MySQL is likely used to store all kinds of data
• But MySQL doesn’t scale
– You ask analysts to query MySQL
• They will kill the web site a few times 
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now what to do? (I)
• You have to find a scalable and separate storage for
data analysis
– This is called Data Warehouse or Data Analytics
– This will be the central storage for your important data
– Udemy uses AWS Redshift
• Migrate some data from MySQL
– Key/Value data to NoSQL solution (Cassandra/Hbase,
MongoDB, …)
– Log type of data (use Nginx log for example)
– MySQL should only have key data which is needed from
Web service
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now what to do? (II)
• The goal is to put every data into a single
storage
– This is the most important and the very first step
toward becoming “true” data organization
– This storage should be separated from runtime
storage (MySQL for example)
– This storage should be scalable
– Being consistent is more important than being
correct in the beginning
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now You Add More Data
• Different Ways of Collecting Data
– This is called ETL (Extract, Transform and Load)
– Different Aspects to Consider
• Size: 1KB to 20GB
• Frequency: Hourly, Daily, Weekly, Monthly
• How to collect data:
– FTP, API, Webhook, S3, HTTP, mysql commandline
• You will have multiple data collection workflows
– Use cronjob (or some scheduler) to manage
– Udemy uses Pinball (Open Source from Pinterest)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
How It Will Look Like
Your
Cool
Web
Service
Log Files
MySQL
Key/Value
Data
Warehouse
External
Data SourcesETL
ETL
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Simple Data import
• Just use some script language
– Many data sources are small and simple enough
to use a script language
• Udemy uses Python for this purpose
– Implemented a set of Python classes to handle
different types of data import
– Plan to open source this in 1st half of 2016
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Large Data Batch Import
• Large data import and processing will require
more scalable solution
• Hadoop can be used for this purpose
– SQL on Hadoop: Hive, Tajo, Presto and so on
– Pig, Java MapReduce
• Spark is getting a lot of attention and we plan
to evaluate
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Realtime Data import
• Some of data better be imported as it
happens
• This requires different type of technology
– Realtime Data Message Queue: Kafka, Kinesis
– Realtime Data Consumer: Storm, Samza, Spark
Streaming
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What’s Next? (I)
• Build Summary Tables
– Having raw data tables is good but it can be too
detailed and too much information
– Build these tables in your Data Warehouse
• Track the performance of key metrics
– This should be from summary tables above
– You need dashboard tool (build one or use 3rd
party solution – Birst, chartIO, Tableau and so on)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What’s Next? (II)
• Provide this data to Data Science team
– Draw insight and create feedback loop
– Build machine learned models for recommendation,
search ranking and so on
– The topic for the next session (Thanks Larry!)
• Supporting Data Science from Infrastructure
– This will require scalable infrastructure
– Example: Scoring every pairs of user/course in Udemy
• 7M users X 30K courses = 210B pairs of computation
– You need scalable Serving Layer (Cassandra, Hbase, …)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
DATA ENGINEERING AT UDEMY
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Data Warehouse/Analytics
• We use AWS Redshift as our data warehouse
(or data analytics backend)
• What is AWS Redshift?
– Scalable Postgresql Engine up to 1.6PB of data
– Roughly it is 600 USD per TB per month
– Mainly for offline batch processing
– Supports bulk update (through AWS S3)
– Two type of options: Compute vs. Storage
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Kind of Data Stored in Redshift
• 800+ tables with 2.4TB of data
• Key tables from MySQL
• Email Marketing Data
• Ads Campaign Performance Data
• SEO data from Google
• Data from Web access log
• Support Ticket Data
• A/B Test Data (Mobile, Web)
• Human curated data from Google Spreadsheets
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Details on ETL Pipelines
• All data pipelines are scheduled through
Pinball
– Every 5 minutes, hourly, daily, weekly and monthly
• Most pipelines are purely in Python
• Some uses Hadoop/Hive and Hadoop/Pig for
Batch Processing
• Start using Kinesis for Realtime Processing
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Pinball Screenshot
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Batching Processing Infrastructure
• We use Hadoop 2.6 with Hive and Pig
– CDH 5.4 (community version)
• We use our own hadoop cluster and AWS EMR
(ElasticMapReduce) at the same time
– This is used to do ETL on massive data
– This is also used to run massive model/scoring
pipelines from Data Science team
• Plan to evaluate Spark potentially as an
alternative
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Realtime Processing
• Applications
– The first application is to process web access log
– Eventually we plan to use this to generate
personalized recommendation on-the-fly
• Plan to use AWS Kinesis
– Evaluated Apache Kafka and AWS Kinesis
• They are very similar but Kafka is an open source while
Kinesis is a managed service from AWS
• Decided to use AWS Kinesis
• Plan to evaluate Realtime Consumer
– Samza, Storm, Spark Streaming
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What is Kinesis (Kafka)?
• Realtime data processing service in AWS
– Publisher-Subscriber message broker
– Very similar to Kafka
• It has two components
– One is message queue where stream of data is stored
• 24 hours of retention period
• Pay hourly by the read/write rate of the queue
– The other is KCL (Kinesis Client Library)
• Using this, build Data Producer application or Data
Consumer Application
• This can be combined with Storm, Spark Streaming, …
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Data Serving Layer
• Redshift isn’t a good fit to read out the data in
realtime fashion so you need something else
• We are using (or plan to use) the followings:
– Cassandra
– Redis
– ElasticSearch
– MySQL
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
How It Looks Like
Udemy
Log Files
(Nginx)
MySQL
Key/Value
(Cassandra)
Data
Warehouse
(Redshift)
External
Data Sources
Data Science Pipeline
ETL
ETL
Data Science Pipeline
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
LESSONS LEARNED
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
• As a small start-up survive first and then work
on data
• Starting point is to store all data in a single
location (data warehouse)
• Start with batch processing and then realtime
• Consider the type of data you store
– Log vs. Key/Value vs. Transactional Record
• Store data in the form of log (change history)
– So that you can always go back and debug/replay
• Cloud is good unless you have really massive
data
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Q & A
Udemy is Hiring
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015

More Related Content

PDF
Data Science at Udemy
PDF
Rijuta Wagh Resume
PPTX
2015 Data Science Summit @ dato Review
PDF
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
PDF
Workshop_CITA2015
PDF
H2O for Medicine and Intro to H2O in Python
PPTX
No sql and sql - open analytics summit
PPTX
Optier presentation for open analytics event
Data Science at Udemy
Rijuta Wagh Resume
2015 Data Science Summit @ dato Review
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
Workshop_CITA2015
H2O for Medicine and Intro to H2O in Python
No sql and sql - open analytics summit
Optier presentation for open analytics event

What's hot (20)

PPTX
Big data bi-mature-oanyc summit
PPTX
Big data-science-oanyc
PDF
Model Monitoring at Scale with Apache Spark and Verta
PDF
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
PDF
Machine Learning with PyCaret
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PDF
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
PDF
Learn to Use Databricks for Data Science
PDF
Graph-Powered Machine Learning
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
Rakuten - Recommendation Platform
PPTX
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
PDF
Metrics for Web Applications - Netcamp 2012
PDF
Rakshit (Rocky) Bhatt Resume - 2022
PDF
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
PDF
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
PDF
Pm.ais ummit 180917 final
PDF
Bootstrapping of PySpark Models for Factorial A/B Tests
ODP
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
PDF
Distributed Time Travel for Feature Generation at Netflix
Big data bi-mature-oanyc summit
Big data-science-oanyc
Model Monitoring at Scale with Apache Spark and Verta
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Machine Learning with PyCaret
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Learn to Use Databricks for Data Science
Graph-Powered Machine Learning
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
Rakuten - Recommendation Platform
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Metrics for Web Applications - Netcamp 2012
Rakshit (Rocky) Bhatt Resume - 2022
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Pm.ais ummit 180917 final
Bootstrapping of PySpark Models for Factorial A/B Tests
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Distributed Time Travel for Feature Generation at Netflix
Ad

Similar to Data Engineering at Udemy (20)

PPTX
Demystifying data engineering
PDF
Future of Data Engineering
PDF
DoneDeal - AWS Data Analytics Platform
PDF
AWS Analytics Experience Argentina - Intro
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PPTX
The Future of Data Engineering - 2019 InfoQ QConSF
PDF
Data Care, Feeding, and Maintenance
PDF
Data Engineering Challenges - DSE Day at Bandung Institute of Technology
PDF
AWS Floor 28 - Building Data lake on AWS
PPTX
Session 10 handling bigger data
PPTX
Session 10 handling bigger data
PPTX
Cruising in data lake from zero to scale
PDF
How OLTP to OLAP Archival Demystified
PPTX
From raw data to business insights. A modern data lake
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
PPTX
Introduction to Data Engineering
PDF
Building a modern data platform in the cloud. AWS DevDay Nordics
PDF
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
PDF
Building a modern data platform on AWS. Utrecht AWS Dev Day
Demystifying data engineering
Future of Data Engineering
DoneDeal - AWS Data Analytics Platform
AWS Analytics Experience Argentina - Intro
Big Data, Ingeniería de datos, y Data Lakes en AWS
The Future of Data Engineering - 2019 InfoQ QConSF
Data Care, Feeding, and Maintenance
Data Engineering Challenges - DSE Day at Bandung Institute of Technology
AWS Floor 28 - Building Data lake on AWS
Session 10 handling bigger data
Session 10 handling bigger data
Cruising in data lake from zero to scale
How OLTP to OLAP Archival Demystified
From raw data to business insights. A modern data lake
SQL Analytics Powering Telemetry Analysis at Comcast
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
Introduction to Data Engineering
Building a modern data platform in the cloud. AWS DevDay Nordics
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Building a modern data platform on AWS. Utrecht AWS Dev Day
Ad

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Analytics and business intelligence.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
.pdf is not working space design for the following data for the following dat...
1_Introduction to advance data techniques.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
[EN] Industrial Machine Downtime Prediction
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Supervised vs unsupervised machine learning algorithms
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Data Engineering at Udemy

  • 1. Data Engineering at Udemy Keeyong Han Principal Data Architect @Udemy Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 2. About Me • 20+ years of experience from 9 different companies • Currently manages Data team at Udemy • Prior to joining Udemy – Manager of data/search team at Polyvore – Director of Engineering at Yahoo Search – Started career from Samsung Electronics in Korea Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 3. Agenda • Typical Evolution of Data Processing • Data Engineering at Udemy • Lessons Learned Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 4. TYPICAL EVOLUTION OF DATA PROCESSING From a small start-up Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 5. In the beginning • You don’t have any data  • So no data infrastructure or data science – The most important thing is to survive and to keep iterating Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 6. After a struggle you have some data • Now you survived and now you have some data to work with – Data analysts are hired – They want to analyze the data Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 7. Then … • You don’t know where the data is exactly • You find your data but – It is not clean and is missing key information – Data is likely not in the format you want • You store them in non-optimal storage – MySQL is likely used to store all kinds of data • But MySQL doesn’t scale – You ask analysts to query MySQL • They will kill the web site a few times  Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 8. Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 9. Now what to do? (I) • You have to find a scalable and separate storage for data analysis – This is called Data Warehouse or Data Analytics – This will be the central storage for your important data – Udemy uses AWS Redshift • Migrate some data from MySQL – Key/Value data to NoSQL solution (Cassandra/Hbase, MongoDB, …) – Log type of data (use Nginx log for example) – MySQL should only have key data which is needed from Web service Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 10. Now what to do? (II) • The goal is to put every data into a single storage – This is the most important and the very first step toward becoming “true” data organization – This storage should be separated from runtime storage (MySQL for example) – This storage should be scalable – Being consistent is more important than being correct in the beginning Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 11. Now You Add More Data • Different Ways of Collecting Data – This is called ETL (Extract, Transform and Load) – Different Aspects to Consider • Size: 1KB to 20GB • Frequency: Hourly, Daily, Weekly, Monthly • How to collect data: – FTP, API, Webhook, S3, HTTP, mysql commandline • You will have multiple data collection workflows – Use cronjob (or some scheduler) to manage – Udemy uses Pinball (Open Source from Pinterest) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 12. How It Will Look Like Your Cool Web Service Log Files MySQL Key/Value Data Warehouse External Data SourcesETL ETL Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 13. Simple Data import • Just use some script language – Many data sources are small and simple enough to use a script language • Udemy uses Python for this purpose – Implemented a set of Python classes to handle different types of data import – Plan to open source this in 1st half of 2016 Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 14. Large Data Batch Import • Large data import and processing will require more scalable solution • Hadoop can be used for this purpose – SQL on Hadoop: Hive, Tajo, Presto and so on – Pig, Java MapReduce • Spark is getting a lot of attention and we plan to evaluate Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 15. Realtime Data import • Some of data better be imported as it happens • This requires different type of technology – Realtime Data Message Queue: Kafka, Kinesis – Realtime Data Consumer: Storm, Samza, Spark Streaming Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 16. What’s Next? (I) • Build Summary Tables – Having raw data tables is good but it can be too detailed and too much information – Build these tables in your Data Warehouse • Track the performance of key metrics – This should be from summary tables above – You need dashboard tool (build one or use 3rd party solution – Birst, chartIO, Tableau and so on) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 17. What’s Next? (II) • Provide this data to Data Science team – Draw insight and create feedback loop – Build machine learned models for recommendation, search ranking and so on – The topic for the next session (Thanks Larry!) • Supporting Data Science from Infrastructure – This will require scalable infrastructure – Example: Scoring every pairs of user/course in Udemy • 7M users X 30K courses = 210B pairs of computation – You need scalable Serving Layer (Cassandra, Hbase, …) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 18. DATA ENGINEERING AT UDEMY Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 19. Data Warehouse/Analytics • We use AWS Redshift as our data warehouse (or data analytics backend) • What is AWS Redshift? – Scalable Postgresql Engine up to 1.6PB of data – Roughly it is 600 USD per TB per month – Mainly for offline batch processing – Supports bulk update (through AWS S3) – Two type of options: Compute vs. Storage Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 20. Kind of Data Stored in Redshift • 800+ tables with 2.4TB of data • Key tables from MySQL • Email Marketing Data • Ads Campaign Performance Data • SEO data from Google • Data from Web access log • Support Ticket Data • A/B Test Data (Mobile, Web) • Human curated data from Google Spreadsheets Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 21. Details on ETL Pipelines • All data pipelines are scheduled through Pinball – Every 5 minutes, hourly, daily, weekly and monthly • Most pipelines are purely in Python • Some uses Hadoop/Hive and Hadoop/Pig for Batch Processing • Start using Kinesis for Realtime Processing Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 22. Pinball Screenshot Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 23. Batching Processing Infrastructure • We use Hadoop 2.6 with Hive and Pig – CDH 5.4 (community version) • We use our own hadoop cluster and AWS EMR (ElasticMapReduce) at the same time – This is used to do ETL on massive data – This is also used to run massive model/scoring pipelines from Data Science team • Plan to evaluate Spark potentially as an alternative Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 24. Realtime Processing • Applications – The first application is to process web access log – Eventually we plan to use this to generate personalized recommendation on-the-fly • Plan to use AWS Kinesis – Evaluated Apache Kafka and AWS Kinesis • They are very similar but Kafka is an open source while Kinesis is a managed service from AWS • Decided to use AWS Kinesis • Plan to evaluate Realtime Consumer – Samza, Storm, Spark Streaming Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 25. What is Kinesis (Kafka)? • Realtime data processing service in AWS – Publisher-Subscriber message broker – Very similar to Kafka • It has two components – One is message queue where stream of data is stored • 24 hours of retention period • Pay hourly by the read/write rate of the queue – The other is KCL (Kinesis Client Library) • Using this, build Data Producer application or Data Consumer Application • This can be combined with Storm, Spark Streaming, … Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 26. Data Serving Layer • Redshift isn’t a good fit to read out the data in realtime fashion so you need something else • We are using (or plan to use) the followings: – Cassandra – Redis – ElasticSearch – MySQL Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 27. How It Looks Like Udemy Log Files (Nginx) MySQL Key/Value (Cassandra) Data Warehouse (Redshift) External Data Sources Data Science Pipeline ETL ETL Data Science Pipeline Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 28. LESSONS LEARNED Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 29. • As a small start-up survive first and then work on data • Starting point is to store all data in a single location (data warehouse) • Start with batch processing and then realtime • Consider the type of data you store – Log vs. Key/Value vs. Transactional Record • Store data in the form of log (change history) – So that you can always go back and debug/replay • Cloud is good unless you have really massive data Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 30. Q & A Udemy is Hiring Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Editor's Notes

  • #8: Logging format. Don’t try to take a snapshot and do aggregation
  • #10: Add diagram MySQL was likely to be used to store all data and used by data analysts
  • #11: Add diagram MySQL was likely to be used to store all data and used by data analysts What happens when you don’t have this – everyone does their own analysis and derive their own conclusion – waste of resource from a lot of one-off efforts
  • #13: Add diagram MySQL was likely to be used to store all data and used by data analysts What happens when you don’t have this – everyone does their own analysis and derive their own conclusion – waste of resource from a lot of one-off efforts
  • #16: Realtime recommendation