SlideShare a Scribd company logo
Big Data and Hadoop Training
Session 5
Big Data - Pipeline
Big Data Pipeline
Lambda Architecture - Streaming(Real-Time) Layer
with
Apache Kafka
Apache Hadoop
Apache Spark
Apache Cassandra
on Amazon Web Services Cloud Platform
Big Data - Pipeline
Big Data - Pipeline
3 EC2 instance for Kafka Cluster
Big Data - Pipeline
Repeat commands for all - 3 EC2 instance for Kafka Cluster
cat /etc/*-release
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
java -version
mkdir kafka
cd kafka
wget http://download.nextag.com/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz
tar -zxvf kafka_2.11-0.10.0.0.tgz
cd kafka_2.11-0.10.0.0
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
Big Data - Pipeline
Kafka-datanode1 (set following properties for config/server.properties)
ubuntu@ip-172-31-63-203:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties
broker.id=1
listeners=PLAINTEXT://172.31.63.203:9092
advertised.listeners=PLAINTEXT://54.173.215.211:9092
zookeeper.connect=52.91.1.93:2181
Kafka-datanode2 (set following properties for config/server.properties)
ubuntu@ip-172-31-9-25:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties
broker.id=2
listeners=PLAINTEXT://172.31.9.25:9092
advertised.listeners=PLAINTEXT://54.226.29.194:9092
zookeeper.connect=52.91.1.93:2181
Modify config/server.properties for
kafka-datanode1 & kafkadatanode2
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
Big Data - Pipeline
Launch zookeeper / datanode1 / datanode2
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
1) Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
2) Start server on Kafka-datanode1
bin/kafka-server-start.sh config/server.properties
3) Start server on Kafka-datanode2
bin/kafka-server-start.sh config/server.properties
export KAFKA_HEAP_OPTS="-Xmx256M -Xms256M"
4) Create Topic & Start consumer
bin/kafka-topics.sh --zookeeper 52.91.1.93:2181 --create --topic data --partitions 1 --replication-factor 2
bin/kafka-console-consumer.sh --zookeeper 52.91.1.93:2181 --topic data --from-beginning
Big Data - Pipeline
Launch Kafka Cluster
(Zookeeper/kafka datanode1/ kafka datanode2)
Big Data - Pipeline
Execute Python / Kafka Spark Job
Big Data - Pipeline
Sample data which we will be sending to Kafka Server
from Java Kafka Producer (csv file)
Big Data - Pipeline
Python Spark Job Processing Data from AWS Kafka Cluster
Big Data - Pipeline
Python Spark Streaming Application
Thank You
hkbhadraa@gmail.com

More Related Content

PDF
Setup 3 Node Kafka Cluster on AWS - Hands On
PDF
Spotify: Automating Cassandra repairs
PPTX
Big data lambda architecture - Streaming Layer Hands On
PDF
Set up Hadoop Cluster on Amazon EC2
PDF
Open erp on ubuntu
PDF
Describing Kafka security in AsyncAPI
PDF
Useful Kafka tools
PPTX
Capistrano 3 Deployment
Setup 3 Node Kafka Cluster on AWS - Hands On
Spotify: Automating Cassandra repairs
Big data lambda architecture - Streaming Layer Hands On
Set up Hadoop Cluster on Amazon EC2
Open erp on ubuntu
Describing Kafka security in AsyncAPI
Useful Kafka tools
Capistrano 3 Deployment

What's hot (16)

PDF
Cyber Range - An Open-Source Offensive / Defensive Learning Environment on AWS
PPTX
Cyber Range - Blackhat Europe 19 Arsenal
PPTX
Salting new ground one man ops from scratch
PDF
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
PPTX
Enable IPv6 on Route53 AWS ELB, docker and node App
PDF
[2C4]Clustered computing with CoreOS, fleet and etcd
PDF
Docker at OpenDNS
PDF
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
PDF
Salt conf 2014-installing-openstack-using-saltstack-v02
PPT
Python Deployment with Fabric
PPTX
OpenShift4 Installation by UPI on kvm
PDF
Small, Simple, and Secure: Alpine Linux under the Microscope
PDF
Multinode kubernetes-cluster
PDF
Etcd- Mission Critical Key-Value Store
PDF
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
PPTX
What makes AWS invincible? from JAWS Days 2014
Cyber Range - An Open-Source Offensive / Defensive Learning Environment on AWS
Cyber Range - Blackhat Europe 19 Arsenal
Salting new ground one man ops from scratch
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
Enable IPv6 on Route53 AWS ELB, docker and node App
[2C4]Clustered computing with CoreOS, fleet and etcd
Docker at OpenDNS
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
Salt conf 2014-installing-openstack-using-saltstack-v02
Python Deployment with Fabric
OpenShift4 Installation by UPI on kvm
Small, Simple, and Secure: Alpine Linux under the Microscope
Multinode kubernetes-cluster
Etcd- Mission Critical Key-Value Store
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
What makes AWS invincible? from JAWS Days 2014
Ad

Similar to Big data and hadoop training - Session 5 (20)

DOCX
Kafk a with zoo keeper setup documentation
PPTX
Real-time streaming and data pipelines with Apache Kafka
PDF
Sparkstreaming
PDF
Continuous Delivery: The Next Frontier
PPTX
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
PDF
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
PPTX
Sparkstreaming with kafka and h base at scale (1)
PPTX
Journey to Microservice architecture via Amazon Lambda
DOCX
Apache kafka configuration-guide
PPTX
Couch to OpenStack: Nova - July, 30, 2013
PDF
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
PPTX
Kafka Rest.pptx
PPT
Spark Streaming Info
PPTX
Training
PDF
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
PDF
JavaCro'15 - Conquer the Internet of Things with Java and Docker - Johan Jans...
PDF
Stream Processing using Apache Spark and Apache Kafka
PDF
Apache Kafka - A modern Stream Processing Platform
Kafk a with zoo keeper setup documentation
Real-time streaming and data pipelines with Apache Kafka
Sparkstreaming
Continuous Delivery: The Next Frontier
Simplifying Migration from Kafka to Pulsar - Pulsar Summit NA 2021
Drupaljam 2017 - Deploying Drupal 8 onto Hosted Kubernetes in Google Cloud
Sparkstreaming with kafka and h base at scale (1)
Journey to Microservice architecture via Amazon Lambda
Apache kafka configuration-guide
Couch to OpenStack: Nova - July, 30, 2013
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Kafka Rest.pptx
Spark Streaming Info
Training
Camel Kafka Connectors: Tune Kafka to “Speak” with (Almost) Everything (Andre...
Apache Kafka - Scalable Message-Processing and more !
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
JavaCro'15 - Conquer the Internet of Things with Java and Docker - Johan Jans...
Stream Processing using Apache Spark and Apache Kafka
Apache Kafka - A modern Stream Processing Platform
Ad

More from hkbhadraa (13)

PPTX
Big data and hadoop training - Session 3
PPTX
Big data and hadoop training - Session 2
PPTX
Retail products - machine learning recommendation engine
PPTX
Big data Lambda Architecture - Batch Layer Hands On
PDF
Project management part 5
PDF
Project management part 4
PDF
Project management part 3
PDF
Project management part 2
PDF
Project management part 1
PDF
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
PDF
Gamification
PDF
Internet of things
PDF
IBM Bluemix Cloud Platform Application Development with Eclipse IDE
Big data and hadoop training - Session 3
Big data and hadoop training - Session 2
Retail products - machine learning recommendation engine
Big data Lambda Architecture - Batch Layer Hands On
Project management part 5
Project management part 4
Project management part 3
Project management part 2
Project management part 1
Hadoop BIG Data - Fraud Detection with Real-Time Analytics
Gamification
Internet of things
IBM Bluemix Cloud Platform Application Development with Eclipse IDE

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Computer network topology notes for revision
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Introduction to Data Science and Data Analysis
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
climate analysis of Dhaka ,Banglades.pptx
Computer network topology notes for revision
STERILIZATION AND DISINFECTION-1.ppthhhbx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Knowledge Engineering Part 1
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ISS -ESG Data flows What is ESG and HowHow
STUDY DESIGN details- Lt Col Maksud (21).pptx
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Data Science and Data Analysis
Qualitative Qantitative and Mixed Methods.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf

Big data and hadoop training - Session 5

  • 1. Big Data and Hadoop Training Session 5
  • 2. Big Data - Pipeline Big Data Pipeline Lambda Architecture - Streaming(Real-Time) Layer with Apache Kafka Apache Hadoop Apache Spark Apache Cassandra on Amazon Web Services Cloud Platform
  • 3. Big Data - Pipeline
  • 4. Big Data - Pipeline 3 EC2 instance for Kafka Cluster
  • 5. Big Data - Pipeline Repeat commands for all - 3 EC2 instance for Kafka Cluster cat /etc/*-release sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer java -version mkdir kafka cd kafka wget http://download.nextag.com/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxvf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
  • 6. Big Data - Pipeline Kafka-datanode1 (set following properties for config/server.properties) ubuntu@ip-172-31-63-203:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties broker.id=1 listeners=PLAINTEXT://172.31.63.203:9092 advertised.listeners=PLAINTEXT://54.173.215.211:9092 zookeeper.connect=52.91.1.93:2181 Kafka-datanode2 (set following properties for config/server.properties) ubuntu@ip-172-31-9-25:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties broker.id=2 listeners=PLAINTEXT://172.31.9.25:9092 advertised.listeners=PLAINTEXT://54.226.29.194:9092 zookeeper.connect=52.91.1.93:2181 Modify config/server.properties for kafka-datanode1 & kafkadatanode2 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
  • 7. Big Data - Pipeline Launch zookeeper / datanode1 / datanode2 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194 1) Start zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties 2) Start server on Kafka-datanode1 bin/kafka-server-start.sh config/server.properties 3) Start server on Kafka-datanode2 bin/kafka-server-start.sh config/server.properties export KAFKA_HEAP_OPTS="-Xmx256M -Xms256M" 4) Create Topic & Start consumer bin/kafka-topics.sh --zookeeper 52.91.1.93:2181 --create --topic data --partitions 1 --replication-factor 2 bin/kafka-console-consumer.sh --zookeeper 52.91.1.93:2181 --topic data --from-beginning
  • 8. Big Data - Pipeline Launch Kafka Cluster (Zookeeper/kafka datanode1/ kafka datanode2)
  • 9. Big Data - Pipeline Execute Python / Kafka Spark Job
  • 10. Big Data - Pipeline Sample data which we will be sending to Kafka Server from Java Kafka Producer (csv file)
  • 11. Big Data - Pipeline Python Spark Job Processing Data from AWS Kafka Cluster
  • 12. Big Data - Pipeline Python Spark Streaming Application