Big Data Analytics Infrastructure

Data Infrastructure
Training
Aug 2015
A Big Data Startup
Infra Advisor: Min Zhou

Send
Tracking
Streaming
Batch Track
SDK Kafka
User’s
end
Reports
Load
Format Conversion
+ COPY
Tracking schema:
JSON format
Avro format
Query
My SQL
Cluster
bin log replication
AWS
Redshift
Batch
Query
Message
Server

Send
Tracking
Streaming
Real-time Track
SDK Kafka Storm
Streaming
User’s
end
Reports
Tracking schema:
JSON format
Avro format
My SQL
Cluster
bin log replication
MySQL
Cassandra
Message
Server

Send
Tracking
Streaming
Lambda Architecture
SDK Kafka Storm
Streaming
User’s
end
Reports
Load
Format Conversion
+ COPY
Tracking schema:
JSON format
Avro format
Query
My SQL
Cluster
bin log replication
MySQL
Cassandra
AWS
Redshift
Batch
Query
Message
Server

Tracking data
• Protocol
– HTTP
– TCP
– UDP
• Format
– JSON
– Apache Avro
– Apache Thrift
– Google Protobuf

Message Server
• Need be implemented by yourself
• Nginx / Netty
• Need Load balancer in high traffic

Kafka
• High throughput
• Scalable
• Persist
• Replayable
• Guarantee message order
• Need zookeeper

Real-time processing System
• Storm
• Samza
• Spark Streaming

Real-time Data store
• MySQL
• KV store
– Cassandra
– HBase

Batch processing System
• AWS Redshift
– Only available on Amazon cloud
– Easy to operate
– Easy to use
– Extra Format Conversion + COPY cost
• HDFS + Spark/Presto
– Need extra efforts to operate and user
– No Format Conversion + COPY cost

Cloud vs In house cluster
• Operation cost
• Hardware cost

Big Data Analytics Infrastructure

More Related Content

What's hot (16)

Viewers also liked (17)

Similar to Big Data Analytics Infrastructure (20)

Recently uploaded (20)

Big Data Analytics Infrastructure