SlideShare a Scribd company logo
Hadoop Ecosystem and Low
Latency Streaming Architecture
InSemble Inc.
http://www.insemble.com
Agenda
What is Big Data and why it is relevant ?1
Flume, Kafka and Storm4
Reference Architecture for Low Latency Streaming3
Hadoop Ecosystem2
Demo5
Big Data Definitions
• Wikipedia defines it as “Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process
data within a tolerable elapsed time”
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “Big Data is a large volume, unstructured data
which cannot be handled by traditional database management systems
”
Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by
person analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
Hadoop Ecosystem
Source: Apache Hadoop Documentation
Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
Big Data Journey
➢ Real time Insight from all channels
➢ IT is key differentiator for your business
➢ Perfect alignment of Business and IT
➢ Ad Hoc Data Exploration
➢ Batch, Interactive, Real time use cases
➢ Predictive Analytics, Machine Learning
➢ Consolidated Analytics
➢ ETL
➢ Time Constraints
➢ Security standards defined
➢ Governance Standards Defined
➢ Integrated with the Enterprise
➢ Evaluate Business Benefits
➢ Understand Ecosystem
➢ Identify Platform
Aware of Benefits
Execute
Expand
Managed
Optimized
- Scout for Opportunities
- Pilot project
- Multiple Use cases
- Governance Model
- Core competency
Journey Over Time
BusinessValue
Effects
GREAT
GOOD
Real time Stream Processing
Architecture with Hadoop
Flume Architecture
• Distributed system for
collecting and aggregating
from multiple data stores to
a centralized data store
• Agent is a JVM that hosts
the Flume components
• Channel will store
message until picked by a
sink
• Different types of Flume
sources
• Source and Sink are
decoupled
Consolidation Architecture
Multiplexing Architecture
Kafka Introduction
• Messaging System which is distributed, partitioned and replicated
• Kafka brokers run as a cluster
• Producers and Consumers can be written in any language
Topic
• Ordered, immutable sequence numbers
• Retains messages until a period of time
• “Offset” of where they are is controlled by the consumer
• Each partition is replicated and has “leader” and 0 or more “follower”.
R/W only done on leader
Producers and Consumers
• Producer controls which partition messages goes to
• Supports both Queuing and Pub/Sub
– Abstraction called Consumer group
• Ordering within Partition
– Ordering for subscriber has to be done with only one subscriber to that
partition
Storm Introduction
• Distributed real time computational system
–Process unbounded streams of data
–Can use multiple programming languages
–Scalable, fault-tolerant and guarantees that data will be processed
• Use Cases
–Real time analytics, online machine learning
–Continuous Computation
–Distributed RPC
–ETL
• Concepts
–Topology
–Spouts
–Bolts
Concepts
• Storm Cluster
– Master node(Nimbus)
• Distributing code
• Assigns tasks to machines
• Monitors for failures
– Worker nodes(Supervisor)
• Starts/stops worker processes
• Each worker process executes subset of a topology
– Zookeeper
• Coordinates between Nimbus and Supervisors
• Nimbus and Supervisors completely stateless
• State maintained by Zookeeper or local disks
Details
• Stream
– Unbounded sequence of tuples
• Spout(write logic)
– Source of stream. Emits tuples
• Bolt(write logic)
– Processes streams and emits tuples
• Topology
– DAG of spouts and bolts
– Submit a topology to a Storm cluster
– Each node runs in parallel and parallelism is controlled
Stream groupings
• Tells a topology how to send tuples between two components
• Since tasks are executed in parallel, how do we control which tasks the
tuples are being sent to
Why Use Twitter as Data Source
Demo - Twitter TopN Trending Topic
• Method 1 — Flume with interceptor
• Method 2 — Storm with custom Twitter
Spout
• Method 3 — Flume + Kafka + Storm
Demo - Twitter TopN Trending Topic
• Use Flume Twitter Source to ingest data and
publish event to Kafka topic
• Use Kafka as messaging backbone
• Use Storm as an Real-Time event processing
system to calculate TopN trending topic
• Use Redis to store the TopN Result
• Use Node.js/JQuery for visualization
Flow Chart
Demo: Start Redis Server
Demo: Start Node.js server
Demo: Start Storm
Demo: Start Flume Agent
Demo: Storm Console Output
Demo: Trending Result
Flume Agent — Source
Flume Agent — Channel
Flume Agent — Sink
Storm Topology Design
Submit Topology to Storm
Production Cluster
Submit Topology to Test Cluster
ParseTweetBolt Code
ParseTweetBolt Code
ParseTweetBolt Code
Questions?


Vijay Mandava: vijay@insemble.com
Lan Jiang: lan@insemble.com / @Lan_Jiang



More Related Content

PDF
Low Latency Streaming Data Processing in Hadoop
PDF
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
PDF
Hands-on Workshop: Apache Pulsar
PPTX
Current and Future of Apache Kafka
PPTX
I Heart Log: Real-time Data and Apache Kafka
PDF
Kafka Overview
PPTX
Introduction to Apache Kafka
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Low Latency Streaming Data Processing in Hadoop
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Hands-on Workshop: Apache Pulsar
Current and Future of Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Kafka Overview
Introduction to Apache Kafka
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...

What's hot (20)

PPTX
Kafka 101
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
PDF
Apache pulsar - storage architecture
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PPTX
Apache kafka
PPTX
Apache Kafka
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Introduction to Apache BookKeeper Distributed Storage
PDF
Kafka and Spark Streaming
PPTX
Apache kafka
PPTX
Apache kafka
PDF
Apache Kafka - Free Friday
PDF
Message queues
PDF
Pulsar - flexible pub-sub for internet scale
ODP
Kafka aws
PDF
Apache Pulsar at Yahoo! Japan
PDF
Kafka internals
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PDF
Build a custom metrics on aws cloud
Kafka 101
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Apache pulsar - storage architecture
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Apache kafka
Apache Kafka
Apache Kafka Architecture & Fundamentals Explained
Introduction to Apache BookKeeper Distributed Storage
Kafka and Spark Streaming
Apache kafka
Apache kafka
Apache Kafka - Free Friday
Message queues
Pulsar - flexible pub-sub for internet scale
Kafka aws
Apache Pulsar at Yahoo! Japan
Kafka internals
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Build a custom metrics on aws cloud
Ad

Viewers also liked (20)

PDF
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
PPTX
NEGOSIASI
DOCX
PDF
Hssc i objective workbook
PPTX
Iman kepada Malaikat
PPTX
hivve.me - Collaborative messeneger
PPT
Pharmacy slide share
RTF
MATT CV ROEVIN
PPTX
Public Sector Show - Speakers Presentation
PPTX
Luxury Wedding Venues in MA
PPTX
hivve.me - The first collaborative learning messenger
PDF
hivve.me Project Based Learning Messenger
PPTX
ENFERMERÍA
DOCX
ankita cv final (2)
DOCX
PPTX
The Academies Show Birmingham 2014 - Session on Pupil Premium
PPTX
Q distance
PDF
JessupJamesBIAComprehensiveAssignmentFINAL
PDF
Untitled Presentation
PPTX
VCR Presentation Jessup
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
NEGOSIASI
Hssc i objective workbook
Iman kepada Malaikat
hivve.me - Collaborative messeneger
Pharmacy slide share
MATT CV ROEVIN
Public Sector Show - Speakers Presentation
Luxury Wedding Venues in MA
hivve.me - The first collaborative learning messenger
hivve.me Project Based Learning Messenger
ENFERMERÍA
ankita cv final (2)
The Academies Show Birmingham 2014 - Session on Pupil Premium
Q distance
JessupJamesBIAComprehensiveAssignmentFINAL
Untitled Presentation
VCR Presentation Jessup
Ad

Similar to Hadoop Ecosystem and Low Latency Streaming Architecture (20)

PPT
HDFS_architecture.ppt
PDF
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
PDF
Building Big Data Streaming Architectures
PPTX
Apache flume - an Introduction
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PPT
PDF
Data Streaming For Big Data
PDF
DISTRIBUTED SYSTEM CHAPTER THREE UP TO FIVE.pdf
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
PPTX
Algorithmic Trading
PPTX
Hadoop introduction
PPTX
Real-Time Inverted Search NYC ASLUG Oct 2014
PDF
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
PPTX
Introduction to Hadoop and Big Data
PDF
Kafka & Hadoop in Rakuten
PPTX
End to End Streaming Architectures
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Introduction to Storm
PDF
Introduction To Hadoop Ecosystem
PPTX
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
HDFS_architecture.ppt
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Building Big Data Streaming Architectures
Apache flume - an Introduction
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Data Streaming For Big Data
DISTRIBUTED SYSTEM CHAPTER THREE UP TO FIVE.pdf
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Algorithmic Trading
Hadoop introduction
Real-Time Inverted Search NYC ASLUG Oct 2014
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Introduction to Hadoop and Big Data
Kafka & Hadoop in Rakuten
End to End Streaming Architectures
Big Data Architecture Workshop - Vahid Amiri
Introduction to Storm
Introduction To Hadoop Ecosystem
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Introduction to Business Data Analytics.
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Quality review (1)_presentation of this 21
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Business Data Analytics.
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Fluorescence-microscope_Botany_detailed content
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Moving the Public Sector (Government) to a Digital Adoption
Quality review (1)_presentation of this 21

Hadoop Ecosystem and Low Latency Streaming Architecture

  • 1. Hadoop Ecosystem and Low Latency Streaming Architecture InSemble Inc. http://www.insemble.com
  • 2. Agenda What is Big Data and why it is relevant ?1 Flume, Kafka and Storm4 Reference Architecture for Low Latency Streaming3 Hadoop Ecosystem2 Demo5
  • 3. Big Data Definitions • Wikipedia defines it as “Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time” • Gartner defines it as Data with the following characteristics – High Velocity – High Variety – High Volume • Another Definition is “Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems ”
  • 4. Why a game changer • Schema on Read – Interpreting data at processing time – Key, Values are not intrinsic properties of data but chosen by person analyzing the data • Move code to data – With traditional, we bring data to code and I/O becomes a bottleneck – With distributed systems, we have to deal with our own checkpointing/recovery • More data beats better algorithms
  • 5. Enterprise Relevance • Missed Opportunities – Channels – Data that is analyzed • Constraint was high cost – Storage – Processing • Future-proof your business – Schema on Read – Access pattern not as relevant – Not just future-proofing your architecture
  • 6. Hadoop Ecosystem Source: Apache Hadoop Documentation
  • 7. Hadoop 2 with YARN Source: Hadoop In Practice by Alex Holmes
  • 8. Big Data Journey ➢ Real time Insight from all channels ➢ IT is key differentiator for your business ➢ Perfect alignment of Business and IT ➢ Ad Hoc Data Exploration ➢ Batch, Interactive, Real time use cases ➢ Predictive Analytics, Machine Learning ➢ Consolidated Analytics ➢ ETL ➢ Time Constraints ➢ Security standards defined ➢ Governance Standards Defined ➢ Integrated with the Enterprise ➢ Evaluate Business Benefits ➢ Understand Ecosystem ➢ Identify Platform Aware of Benefits Execute Expand Managed Optimized - Scout for Opportunities - Pilot project - Multiple Use cases - Governance Model - Core competency Journey Over Time BusinessValue Effects GREAT GOOD
  • 9. Real time Stream Processing Architecture with Hadoop
  • 10. Flume Architecture • Distributed system for collecting and aggregating from multiple data stores to a centralized data store • Agent is a JVM that hosts the Flume components • Channel will store message until picked by a sink • Different types of Flume sources • Source and Sink are decoupled
  • 13. Kafka Introduction • Messaging System which is distributed, partitioned and replicated • Kafka brokers run as a cluster • Producers and Consumers can be written in any language
  • 14. Topic • Ordered, immutable sequence numbers • Retains messages until a period of time • “Offset” of where they are is controlled by the consumer • Each partition is replicated and has “leader” and 0 or more “follower”. R/W only done on leader
  • 15. Producers and Consumers • Producer controls which partition messages goes to • Supports both Queuing and Pub/Sub – Abstraction called Consumer group • Ordering within Partition – Ordering for subscriber has to be done with only one subscriber to that partition
  • 16. Storm Introduction • Distributed real time computational system –Process unbounded streams of data –Can use multiple programming languages –Scalable, fault-tolerant and guarantees that data will be processed • Use Cases –Real time analytics, online machine learning –Continuous Computation –Distributed RPC –ETL • Concepts –Topology –Spouts –Bolts
  • 17. Concepts • Storm Cluster – Master node(Nimbus) • Distributing code • Assigns tasks to machines • Monitors for failures – Worker nodes(Supervisor) • Starts/stops worker processes • Each worker process executes subset of a topology – Zookeeper • Coordinates between Nimbus and Supervisors • Nimbus and Supervisors completely stateless • State maintained by Zookeeper or local disks
  • 18. Details • Stream – Unbounded sequence of tuples • Spout(write logic) – Source of stream. Emits tuples • Bolt(write logic) – Processes streams and emits tuples • Topology – DAG of spouts and bolts – Submit a topology to a Storm cluster – Each node runs in parallel and parallelism is controlled
  • 19. Stream groupings • Tells a topology how to send tuples between two components • Since tasks are executed in parallel, how do we control which tasks the tuples are being sent to
  • 20. Why Use Twitter as Data Source
  • 21. Demo - Twitter TopN Trending Topic • Method 1 — Flume with interceptor • Method 2 — Storm with custom Twitter Spout • Method 3 — Flume + Kafka + Storm
  • 22. Demo - Twitter TopN Trending Topic • Use Flume Twitter Source to ingest data and publish event to Kafka topic • Use Kafka as messaging backbone • Use Storm as an Real-Time event processing system to calculate TopN trending topic • Use Redis to store the TopN Result • Use Node.js/JQuery for visualization
  • 30. Flume Agent — Source
  • 31. Flume Agent — Channel
  • 34. Submit Topology to Storm Production Cluster
  • 35. Submit Topology to Test Cluster
  • 39. Questions? 
 Vijay Mandava: vijay@insemble.com Lan Jiang: lan@insemble.com / @Lan_Jiang