SlideShare a Scribd company logo
Hadoop at Lookout
Aug 13, 2014
Yash Ranadive
@yashranadive
Thursday, August 14, 14
BIO
• Data Engineer
• From Mumbai, India
• Lived in 7 different cities in US
• @yashranadive
• etl.svbtle.com
Thursday, August 14, 14
AGENDA
• What we do @Lookout
• Data warehouse
• Evolution from monolithic to micro-services
• Protocol Buffers
• Areas we are exploring
Thursday, August 14, 14
WHAT WE DO
@LOOKOUT
Thursday, August 14, 14
Over 50 million registered users
Thursday, August 14, 14
DATA TEAM
• 3 Data Engineers
• 6 data analysts
• Hadoop
• 64 hosts
• 300 TB capacity
Thursday, August 14, 14
DATA WAREHOUSE
INTERNAL AND EXTERNAL DATA SOURCES
MySQL Star
Schema
Warehouse
HDFS
HIVE HBase Impala
Chunker
Mudskipper
R Hue Shiny Tableau Custom
Apps
WAREHOUSE
Thursday, August 14, 14
FROM MONOLITHIC TO
MICROSERVICES
Thursday, August 14, 14
MONOLITHIC APPLICATION
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Thursday, August 14, 14
DATA INGESTION - MONOLITHIC
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Ingestion is batch-oriented
Thursday, August 14, 14
PROBLEM
• Rails has fast TTM but challenges in scaling
• One code base
• Slower Deployments
• Too complex and large to manage
• Solution
• Microservices / service oriented architecture
• Break out the app in to smaller services
Thursday, August 14, 14
MICROSERVICES ARCHITECTURE
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Settings
Service
Photo
Backup
We frequently add new services
Thursday, August 14, 14
DATA INGESTION - MICROSERVICES
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Settings
Service
Backup
Service
Locate
Service
Messaging
Layer
Consumer
Thursday, August 14, 14
DATA INGESTION -
MONOLITIHIC VS MICROSERVICES
select * from user_settings;
id | setting_id | user_id | modified_at
===========================
1 backup 2629 20140709T0400Z
3 locate 2682 20140709T0402Z
8 wipe 2629 20140709T0403Z
9 theft_alert 2629 20140709T0407Z
{guid: 1, event_type: “modify_setting”,
setting_id: “backup”, setting_status:
“ON”, user_id: “2629”, timestamp:
“20140709T0400Z”}
{guid: 3, event_type: “start_backup”,
user_id: “2629”, timestamp:
“20140709T0400Z”}
...
Monolithic - Snapshot of a
point in time
Microservices - Events
Thursday, August 14, 14
DESIGN
• We wanted to create an always-on event
ingestion framework that:
• Would scale workers on demand
• Would be easy to monitor
Thursday, August 14, 14
FIRST STAB - WORKER
Service ActiveMQ Ruby Worker HIVE
• Upstart script that daemonized Ruby process
• Monitoring using Zenoss
• Very easy to set up
• Mapping Files for JSON -> CSV
• Ruby is terse and clean
Thursday, August 14, 14
PROBLEMS
• ActiveMQ
• ActiveMQ did not scale well - even with
multiple machines in the AMQ cluster
• ActiveMQ creates a separate queue for every
consumer of the topic
• Monitoring using Zenoss is not ideal especially for
multi-process consumers
• The worker ran on a single machine- not fault
tolerant
Thursday, August 14, 14
CURRENT ARCHITECTURE - WORKER
Service Kafka Storm HIVE
• Monitoring using Storm’s thrift API
• Scaling number of workers is easy
• Kafka has better scalability than Kafka
Service ActiveMQ
Thursday, August 14, 14
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Processing
Bolt
Storm-hdfs
bolt
Landing
Directory
Hive
Directory
Thursday, August 14, 14
JSON PROBLEMS
• Problems with JSON
• No predefined schema
• No enforcement of backward compatibility
• Solution
• Protocol Buffers (also Avro/Thrift)
Thursday, August 14, 14
PROTOBUFS
• What?
• Way of encoding structured data
• Binary
• Why?
• Schema
• Backward compatibility
• Smaller in size than JSON
Thursday, August 14, 14
VERSIONING
• backward compatible changes only
,proto ,proto
Version 1.4 Version 1.1
Producer ConsumerQueue
Thursday, August 14, 14
SHARING PROTOBUF SCHEMAS
Artifactory
(Schema Repo)
Data Team
Storm
Project
Producers
Push
Java jars
Ruby gems
Pull
Java jars
Thursday, August 14, 14
BUT HOW DO YOU STORE
PROTOBUFS IN HDFS?
Thursday, August 14, 14
HOW WE STORE PROTOBUFS
• Store raw version
• Raw dump of kafka topic in to HDFS
• Convert them to a tuple using Storm
• Inflate then convert to TSV
• Can query raw protobufs directly from HIVE but we
don’t yet
• elephant-bird (difficult to get it working)
Thursday, August 14, 14
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Deserialize
Protobuf
Storm-hdfs
bolt
Landing
Directory
Hive
Directory
Thursday, August 14, 14
AREAS WE ARE
EXPLORING
Thursday, August 14, 14
SPARK
• ETL
• Wordcount ~5 lines of scala code vs. 58 lines of
Java Map reduce code
• Spark Streaming can achieve similar results as of
storm through micro-batching
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
• Machine Learning
• Online learning using MLLIB
• Logistic Regression and SVM
Thursday, August 14, 14
H20
• In-memory machine learning
• Tight integration with R
• Preferred by Data Scientists
Thursday, August 14, 14
OPEN SOURCE PROJECTS
• Currently open sourced
• Pipefish - write from MySQL to HDFS
github.com/lookout/pipefish
• Future
• Mudskipper - capture change-data
events from MySQL binlogs.
• Chunker - download mysql table data
in chunks
Thursday, August 14, 14
Questions
Thursday, August 14, 14

More Related Content

PDF
Presto changes
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PDF
Five Years of EC2 Distilled
PDF
Spark Streamingによるリアルタイムユーザ属性推定
PPTX
Log analysis using Logstash,ElasticSearch and Kibana
PDF
Technologies for Data Analytics Platform
PDF
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
ODP
Building Complex Data Workflows with Cascading on Hadoop
Presto changes
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Five Years of EC2 Distilled
Spark Streamingによるリアルタイムユーザ属性推定
Log analysis using Logstash,ElasticSearch and Kibana
Technologies for Data Analytics Platform
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Building Complex Data Workflows with Cascading on Hadoop

What's hot (20)

PDF
October 2013 HUG: HBase 0.96
PDF
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
PDF
Empowering developers to deploy their own data stores
PDF
Spotify services (SDC 2013)
PDF
ストリーミングデータのアドホック分析エンジンの比較
PPT
Big Data DC - BenchPress
PDF
2013-cloudconnect-OpenStack@BT
PDF
Apache Kafka 0.11 の Exactly Once Semantics
PDF
Queryable State for Kafka Streamsを使ってみた
PDF
tdtechtalk20160330johan
PDF
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
PDF
Planet-scale Data Ingestion Pipeline: Bigdam
PDF
Olivier_Tisserand_projects
PPTX
Campus days Azure HDInsight automation
PDF
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
PDF
High Availability from the DevOps side - OpenStack Summit Portland
PDF
Apache Pulsar Community-Jennifer
PDF
Ruby and Distributed Storage Systems
PDF
DrupalCampLA 2014 - Drupal backend performance and scalability
PPTX
goployer, 코드 기반의 배포 도구 - 송주영 (beNX) :: AWS Community Day 2020
October 2013 HUG: HBase 0.96
Apache Kafkaとグラフデータベースによる成長するネットワークグラフを分析・可視化する基盤
Empowering developers to deploy their own data stores
Spotify services (SDC 2013)
ストリーミングデータのアドホック分析エンジンの比較
Big Data DC - BenchPress
2013-cloudconnect-OpenStack@BT
Apache Kafka 0.11 の Exactly Once Semantics
Queryable State for Kafka Streamsを使ってみた
tdtechtalk20160330johan
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
Planet-scale Data Ingestion Pipeline: Bigdam
Olivier_Tisserand_projects
Campus days Azure HDInsight automation
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
High Availability from the DevOps side - OpenStack Summit Portland
Apache Pulsar Community-Jennifer
Ruby and Distributed Storage Systems
DrupalCampLA 2014 - Drupal backend performance and scalability
goployer, 코드 기반의 배포 도구 - 송주영 (beNX) :: AWS Community Day 2020
Ad

Viewers also liked (10)

PPTX
Conformed Dimension and Data Mining
PPTX
Data Mining Scoring Engine development process
PDF
AhmedEltanahy
PPT
Cullen Presentation
PPTX
Capital raising
PPTX
Incorta Data Security
PPTX
BI Apps Architecture
DOC
Paralelizacion
PDF
Apache storm vs. Spark Streaming
PPTX
Incorta spark integration
Conformed Dimension and Data Mining
Data Mining Scoring Engine development process
AhmedEltanahy
Cullen Presentation
Capital raising
Incorta Data Security
BI Apps Architecture
Paralelizacion
Apache storm vs. Spark Streaming
Incorta spark integration
Ad

Similar to SF Hadoop Users Group August 2014 Meetup Slides (20)

PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
PDF
Red Dirt Ruby Conference
PDF
Boston Hadoop Meetup: Presto for the Enterprise
PPTX
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
PDF
Workflow Engines for Hadoop
PPTX
ASTQB washington-sept-2015
PDF
Spotify: Data center & Backend buildout
PDF
App Engine Meetup
PDF
Active Cloud DB at CloudComp '10
PDF
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
PDF
Why and How to integrate Hadoop and NoSQL?
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PPTX
Scala eXchange: Building robust data pipelines in Scala
PDF
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
PPT
AWS (Hadoop) Meetup 30.04.09
PDF
Geosense Geoportal
KEY
iServe: A Linked Services Publishing Platform
PPTX
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
PDF
Treasure Data Cloud Strategy
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Red Dirt Ruby Conference
Boston Hadoop Meetup: Presto for the Enterprise
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Workflow Engines for Hadoop
ASTQB washington-sept-2015
Spotify: Data center & Backend buildout
App Engine Meetup
Active Cloud DB at CloudComp '10
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
Why and How to integrate Hadoop and NoSQL?
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Scala eXchange: Building robust data pipelines in Scala
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
AWS (Hadoop) Meetup 30.04.09
Geosense Geoportal
iServe: A Linked Services Publishing Platform
Introduction to PaleoWeb by Arwen Vaughan, Rothwell - 2014 PaleoGIS & PaleoCl...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Treasure Data Cloud Strategy

Recently uploaded (20)

PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Digital Logic Computer Design lecture notes
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
web development for engineering and engineering
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Lecture Notes Electrical Wiring System Components
DOCX
573137875-Attendance-Management-System-original
PPTX
Construction Project Organization Group 2.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
composite construction of structures.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
UNIT 4 Total Quality Management .pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Digital Logic Computer Design lecture notes
Foundation to blockchain - A guide to Blockchain Tech
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
web development for engineering and engineering
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Lecture Notes Electrical Wiring System Components
573137875-Attendance-Management-System-original
Construction Project Organization Group 2.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
bas. eng. economics group 4 presentation 1.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
composite construction of structures.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...

SF Hadoop Users Group August 2014 Meetup Slides

  • 1. Hadoop at Lookout Aug 13, 2014 Yash Ranadive @yashranadive Thursday, August 14, 14
  • 2. BIO • Data Engineer • From Mumbai, India • Lived in 7 different cities in US • @yashranadive • etl.svbtle.com Thursday, August 14, 14
  • 3. AGENDA • What we do @Lookout • Data warehouse • Evolution from monolithic to micro-services • Protocol Buffers • Areas we are exploring Thursday, August 14, 14
  • 5. Over 50 million registered users Thursday, August 14, 14
  • 6. DATA TEAM • 3 Data Engineers • 6 data analysts • Hadoop • 64 hosts • 300 TB capacity Thursday, August 14, 14
  • 7. DATA WAREHOUSE INTERNAL AND EXTERNAL DATA SOURCES MySQL Star Schema Warehouse HDFS HIVE HBase Impala Chunker Mudskipper R Hue Shiny Tableau Custom Apps WAREHOUSE Thursday, August 14, 14
  • 9. MONOLITHIC APPLICATION Routing Controller Mobile/Web Clients Database RAILS APPLICATION HTTP ORM Views Tables Thursday, August 14, 14
  • 10. DATA INGESTION - MONOLITHIC Application master_db slave_db Data Warehouse MySQL Hive ETL ELT MySQL Replication External Sources Reporting Ingestion is batch-oriented Thursday, August 14, 14
  • 11. PROBLEM • Rails has fast TTM but challenges in scaling • One code base • Slower Deployments • Too complex and large to manage • Solution • Microservices / service oriented architecture • Break out the app in to smaller services Thursday, August 14, 14
  • 12. MICROSERVICES ARCHITECTURE Routing Controller Mobile/Web Clients Database RAILS APPLICATION HTTP ORM Views Tables Settings Service Photo Backup We frequently add new services Thursday, August 14, 14
  • 13. DATA INGESTION - MICROSERVICES Application master_db slave_db Data Warehouse MySQL Hive ETL ELT MySQL Replication External Sources Reporting Settings Service Backup Service Locate Service Messaging Layer Consumer Thursday, August 14, 14
  • 14. DATA INGESTION - MONOLITIHIC VS MICROSERVICES select * from user_settings; id | setting_id | user_id | modified_at =========================== 1 backup 2629 20140709T0400Z 3 locate 2682 20140709T0402Z 8 wipe 2629 20140709T0403Z 9 theft_alert 2629 20140709T0407Z {guid: 1, event_type: “modify_setting”, setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”} {guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”} ... Monolithic - Snapshot of a point in time Microservices - Events Thursday, August 14, 14
  • 15. DESIGN • We wanted to create an always-on event ingestion framework that: • Would scale workers on demand • Would be easy to monitor Thursday, August 14, 14
  • 16. FIRST STAB - WORKER Service ActiveMQ Ruby Worker HIVE • Upstart script that daemonized Ruby process • Monitoring using Zenoss • Very easy to set up • Mapping Files for JSON -> CSV • Ruby is terse and clean Thursday, August 14, 14
  • 17. PROBLEMS • ActiveMQ • ActiveMQ did not scale well - even with multiple machines in the AMQ cluster • ActiveMQ creates a separate queue for every consumer of the topic • Monitoring using Zenoss is not ideal especially for multi-process consumers • The worker ran on a single machine- not fault tolerant Thursday, August 14, 14
  • 18. CURRENT ARCHITECTURE - WORKER Service Kafka Storm HIVE • Monitoring using Storm’s thrift API • Scaling number of workers is easy • Kafka has better scalability than Kafka Service ActiveMQ Thursday, August 14, 14
  • 19. Storm STORM TOPOLOGY Service Kafka HDFS Kafka Spout ActiveMQ Spout Processing Bolt Storm-hdfs bolt Landing Directory Hive Directory Thursday, August 14, 14
  • 20. JSON PROBLEMS • Problems with JSON • No predefined schema • No enforcement of backward compatibility • Solution • Protocol Buffers (also Avro/Thrift) Thursday, August 14, 14
  • 21. PROTOBUFS • What? • Way of encoding structured data • Binary • Why? • Schema • Backward compatibility • Smaller in size than JSON Thursday, August 14, 14
  • 22. VERSIONING • backward compatible changes only ,proto ,proto Version 1.4 Version 1.1 Producer ConsumerQueue Thursday, August 14, 14
  • 23. SHARING PROTOBUF SCHEMAS Artifactory (Schema Repo) Data Team Storm Project Producers Push Java jars Ruby gems Pull Java jars Thursday, August 14, 14
  • 24. BUT HOW DO YOU STORE PROTOBUFS IN HDFS? Thursday, August 14, 14
  • 25. HOW WE STORE PROTOBUFS • Store raw version • Raw dump of kafka topic in to HDFS • Convert them to a tuple using Storm • Inflate then convert to TSV • Can query raw protobufs directly from HIVE but we don’t yet • elephant-bird (difficult to get it working) Thursday, August 14, 14
  • 26. Storm STORM TOPOLOGY Service Kafka HDFS Kafka Spout ActiveMQ Spout Deserialize Protobuf Storm-hdfs bolt Landing Directory Hive Directory Thursday, August 14, 14
  • 28. SPARK • ETL • Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code • Spark Streaming can achieve similar results as of storm through micro-batching http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming • Machine Learning • Online learning using MLLIB • Logistic Regression and SVM Thursday, August 14, 14
  • 29. H20 • In-memory machine learning • Tight integration with R • Preferred by Data Scientists Thursday, August 14, 14
  • 30. OPEN SOURCE PROJECTS • Currently open sourced • Pipefish - write from MySQL to HDFS github.com/lookout/pipefish • Future • Mudskipper - capture change-data events from MySQL binlogs. • Chunker - download mysql table data in chunks Thursday, August 14, 14