SF Hadoop Users Group August 2014 Meetup Slides

Hadoop at Lookout
Aug 13, 2014
Yash Ranadive
@yashranadive
Thursday, August 14, 14

BIO
• Data Engineer
• From Mumbai, India
• Lived in 7 diﬀerent cities in US
• @yashranadive
• etl.svbtle.com

AGENDA
• What we do @Lookout
• Data warehouse
• Evolution from monolithic to micro-services
• Protocol Buﬀers
• Areas we are exploring

WHAT WE DO
@LOOKOUT

Over 50 million registered users

DATA TEAM
• 3 Data Engineers
• 6 data analysts
• Hadoop
• 64 hosts
• 300 TB capacity

DATA WAREHOUSE
INTERNAL AND EXTERNAL DATA SOURCES
MySQL Star
Schema
Warehouse
HDFS
HIVE HBase Impala
Chunker
Mudskipper
R Hue Shiny Tableau Custom
Apps
WAREHOUSE

FROM MONOLITHIC TO
MICROSERVICES

MONOLITHIC APPLICATION
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables

DATA INGESTION - MONOLITHIC
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Ingestion is batch-oriented

PROBLEM
• Rails has fast TTM but challenges in scaling
• One code base
• Slower Deployments
• Too complex and large to manage
• Solution
• Microservices / service oriented architecture
• Break out the app in to smaller services

MICROSERVICES ARCHITECTURE
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Settings
Service
Photo
Backup
We frequently add new services

DATA INGESTION - MICROSERVICES
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Settings
Service
Backup
Service
Locate
Service
Messaging
Layer
Consumer

DATA INGESTION -
MONOLITIHIC VS MICROSERVICES
select * from user_settings;
id | setting_id | user_id | modiﬁed_at
===========================
1 backup 2629 20140709T0400Z
3 locate 2682 20140709T0402Z
8 wipe 2629 20140709T0403Z
9 theft_alert 2629 20140709T0407Z
{guid: 1, event_type: “modify_setting”,
setting_id: “backup”, setting_status:
“ON”, user_id: “2629”, timestamp:
“20140709T0400Z”}
{guid: 3, event_type: “start_backup”,
user_id: “2629”, timestamp:
“20140709T0400Z”}
...
Monolithic - Snapshot of a
point in time
Microservices - Events

DESIGN
• We wanted to create an always-on event
ingestion framework that:
• Would scale workers on demand
• Would be easy to monitor

FIRST STAB - WORKER
Service ActiveMQ Ruby Worker HIVE
• Upstart script that daemonized Ruby process
• Monitoring using Zenoss
• Very easy to set up
• Mapping Files for JSON -> CSV
• Ruby is terse and clean

PROBLEMS
• ActiveMQ
• ActiveMQ did not scale well - even with
multiple machines in the AMQ cluster
• ActiveMQ creates a separate queue for every
consumer of the topic
• Monitoring using Zenoss is not ideal especially for
multi-process consumers
• The worker ran on a single machine- not fault
tolerant

CURRENT ARCHITECTURE - WORKER
Service Kafka Storm HIVE
• Monitoring using Storm’s thrift API
• Scaling number of workers is easy
• Kafka has better scalability than Kafka
Service ActiveMQ

Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Processing
Bolt
Storm-hdfs
bolt
Landing
Directory
Hive
Directory

JSON PROBLEMS
• Problems with JSON
• No predeﬁned schema
• No enforcement of backward compatibility
• Solution
• Protocol Buﬀers (also Avro/Thrift)

PROTOBUFS
• What?
• Way of encoding structured data
• Binary
• Why?
• Schema
• Backward compatibility
• Smaller in size than JSON

VERSIONING
• backward compatible changes only
,proto ,proto
Version 1.4 Version 1.1
Producer ConsumerQueue

SHARING PROTOBUF SCHEMAS
Artifactory
(Schema Repo)
Data Team
Storm
Project
Producers
Push
Java jars
Ruby gems
Pull
Java jars

BUT HOW DO YOU STORE
PROTOBUFS IN HDFS?

HOW WE STORE PROTOBUFS
• Store raw version
• Raw dump of kafka topic in to HDFS
• Convert them to a tuple using Storm
• Inﬂate then convert to TSV
• Can query raw protobufs directly from HIVE but we
don’t yet
• elephant-bird (diﬃcult to get it working)

Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Deserialize
Protobuf
Storm-hdfs
bolt
Landing
Directory
Hive
Directory

AREAS WE ARE
EXPLORING

SPARK
• ETL
• Wordcount ~5 lines of scala code vs. 58 lines of
Java Map reduce code
• Spark Streaming can achieve similar results as of
storm through micro-batching
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
• Machine Learning
• Online learning using MLLIB
• Logistic Regression and SVM

H20
• In-memory machine learning
• Tight integration with R
• Preferred by Data Scientists

OPEN SOURCE PROJECTS
• Currently open sourced
• Pipeﬁsh - write from MySQL to HDFS
github.com/lookout/pipeﬁsh
• Future
• Mudskipper - capture change-data
events from MySQL binlogs.
• Chunker - download mysql table data
in chunks

Questions

SF Hadoop Users Group August 2014 Meetup Slides

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to SF Hadoop Users Group August 2014 Meetup Slides (20)

Recently uploaded (20)

SF Hadoop Users Group August 2014 Meetup Slides