SlideShare a Scribd company logo
Data Infrastructure
Training
Aug 2015
A Big Data Startup
Infra Advisor: Min Zhou
Send
Tracking
Streaming
Batch Track
SDK Kafka
User’s
end
Reports
Load
Format Conversion
+ COPY
Tracking schema:
JSON format
Avro format
Query
My SQL
Cluster
bin log replication
AWS
Redshift
Batch
Query
Message
Server
Send
Tracking
Streaming
Real-time Track
SDK Kafka Storm
Streaming
User’s
end
Reports
Tracking schema:
JSON format
Avro format
My SQL
Cluster
bin log replication
MySQL
Cassandra
Message
Server
Send
Tracking
Streaming
Lambda Architecture
SDK Kafka Storm
Streaming
User’s
end
Reports
Load
Format Conversion
+ COPY
Tracking schema:
JSON format
Avro format
Query
My SQL
Cluster
bin log replication
MySQL
Cassandra
AWS
Redshift
Batch
Query
Message
Server
Tracking data
• Protocol
– HTTP
– TCP
– UDP
• Format
– JSON
– Apache Avro
– Apache Thrift
– Google Protobuf
Message Server
• Need be implemented by yourself
• Nginx / Netty
• Need Load balancer in high traffic
Kafka
• High throughput
• Scalable
• Persist
• Replayable
• Guarantee message order
• Need zookeeper
Real-time processing System
• Storm
• Samza
• Spark Streaming
Real-time Data store
• MySQL
• KV store
– Cassandra
– HBase
Batch processing System
• AWS Redshift
– Only available on Amazon cloud
– Easy to operate
– Easy to use
– Extra Format Conversion + COPY cost
• HDFS + Spark/Presto
– Need extra efforts to operate and user
– No Format Conversion + COPY cost
Cloud vs In house cluster
• Operation cost
• Hardware cost

More Related Content

PDF
Bridging the Gap: Connecting AWS and Kafka
PDF
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
PDF
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
PDF
Benefícios e melhores práticas no uso do Amazon Redshift
PDF
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
PDF
Apache Kafka® at Dropbox
PDF
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
PDF
Kafka 탄생과 생태계
Bridging the Gap: Connecting AWS and Kafka
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture
Benefícios e melhores práticas no uso do Amazon Redshift
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
Apache Kafka® at Dropbox
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka 탄생과 생태계

What's hot (16)

PPTX
Distributed Kafka Architecture Taboola Scale
PDF
What We Learned From Building a Modern Messaging and Streaming System for Cloud
PPTX
Getting Started with Serverless PHP
PPTX
Seattle kafka meetup nov 2015 published siphon
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PPTX
Change Data Capture using Kafka
PDF
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandor...
PPTX
Using AWS Lambda for Infrastructure Automation and Beyond
PDF
On Track with Apache Kafka®: Building a Streaming ETL Solution with Rail Data
PDF
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
PDF
Infrastructure as code with troposphere on aws in 5 min
PPTX
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
PDF
A Tour of Apache Kafka
PDF
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
PDF
The Journey To Serverless At Home24 - reflections and insights
PPTX
Apache kafka
Distributed Kafka Architecture Taboola Scale
What We Learned From Building a Modern Messaging and Streaming System for Cloud
Getting Started with Serverless PHP
Seattle kafka meetup nov 2015 published siphon
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Change Data Capture using Kafka
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandor...
Using AWS Lambda for Infrastructure Automation and Beyond
On Track with Apache Kafka®: Building a Streaming ETL Solution with Rail Data
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Infrastructure as code with troposphere on aws in 5 min
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
A Tour of Apache Kafka
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
The Journey To Serverless At Home24 - reflections and insights
Apache kafka
Ad

Viewers also liked (17)

PPTX
淘宝Hadoop数据分析实践
PPT
Redpoll
PDF
Java程序员也需要了解CPU
PDF
Distributed Data Analytics at Taobao
PDF
Java Concurrent Optimization: Concurrent Queue
PPT
Anthill: A Distributed DBMS Based On MapReduce
PDF
Scala
PPTX
并发控制
PDF
准实时海量数据分析系统架构探究
KEY
Golang
PPT
MongoDB介绍
KEY
Scala
PDF
Java并发核心编程
PDF
Java trouble shooting
PDF
Hive
PDF
Message Queues for Web Applications
PPTX
Concurrency in Java
淘宝Hadoop数据分析实践
Redpoll
Java程序员也需要了解CPU
Distributed Data Analytics at Taobao
Java Concurrent Optimization: Concurrent Queue
Anthill: A Distributed DBMS Based On MapReduce
Scala
并发控制
准实时海量数据分析系统架构探究
Golang
MongoDB介绍
Scala
Java并发核心编程
Java trouble shooting
Hive
Message Queues for Web Applications
Concurrency in Java
Ad

Similar to Big Data Analytics Infrastructure (20)

PDF
Big data on aws
PPTX
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
PDF
JDD2014: Real Big Data - Scott MacGregor
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Real-Time Analytics with Confluent and MemSQL
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PDF
Amazon Elastic Map Reduce - Ian Meyers
PPTX
Data Architectures for Robust Decision Making
PDF
Infochimps: Cloud for Big Data
PPTX
Realtime Business Platform Architecture Review
PPTX
Using AWS To Build A Scalable Machine Data Analytics Service
PDF
Architecting Modern Data Platforms Jan Kunigk Ian Buss Paul Wilkinson
PPTX
Apache frameworks for Big and Fast Data
PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
PPTX
Realtime Business Platform Architecture Review
PPTX
Architecting Your First Big Data Implementation
PPTX
Building Data Analytics pipelines in the cloud using serverless technology
PDF
Data Streaming For Big Data
Big data on aws
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
JDD2014: Real Big Data - Scott MacGregor
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Real-Time Analytics with Confluent and MemSQL
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Amazon Elastic Map Reduce - Ian Meyers
Data Architectures for Robust Decision Making
Infochimps: Cloud for Big Data
Realtime Business Platform Architecture Review
Using AWS To Build A Scalable Machine Data Analytics Service
Architecting Modern Data Platforms Jan Kunigk Ian Buss Paul Wilkinson
Apache frameworks for Big and Fast Data
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Realtime Business Platform Architecture Review
Architecting Your First Big Data Implementation
Building Data Analytics pipelines in the cloud using serverless technology
Data Streaming For Big Data

Recently uploaded (20)

PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
composite construction of structures.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Geodesy 1.pptx...............................................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PPT on Performance Review to get promotions
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Well-logging-methods_new................
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
R24 SURVEYING LAB MANUAL for civil enggi
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Sustainable Sites - Green Building Construction
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Embodied AI: Ushering in the Next Era of Intelligent Systems
composite construction of structures.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Geodesy 1.pptx...............................................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Foundation to blockchain - A guide to Blockchain Tech
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT on Performance Review to get promotions
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Well-logging-methods_new................
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
R24 SURVEYING LAB MANUAL for civil enggi

Big Data Analytics Infrastructure