SlideShare a Scribd company logo
messaging.pptx
Pilot Kafka Service
Manuel Martín Márquez
Kafka
• Kafka is a distributed streaming platform
• High Scalable (partition)
• Fault Tolerant (replication)
• Allow high level of parallelism and decoupling between
data producers and data consumers
• De facto standard for near real-time store, access
and process data streams
• Critical component of most of the Big Data Platform
and therefore of Hadoop ecosystem
3
Kafka Basic Concepts
4
Source: Hortonworks
Broker: Kafka node on the cluster
Topics: Stream of records category
- Multiple writers and readers
- Partitioned
- Replicated
Consumer: pulls messages off of a Kafka topic
Producer: push messages into a Kafka topic
Data Retention:
- Based on time or size
Zookeeper: Stores Kafka Metadata
Kafka entry points
• Custom implementation of producer and consumer using Kafka client API
• Java, Scala, C++, Python
• Kafka Connectors
• LogFile, HDFS, JDBC, ElasticSearch…
• Logstash
• Source and sink
• Apache Flume out-of-the-box can use Kafka as
• Source, Channel, Sink
• Other ingestion or processing tools support Kafka
• Apache Spark, LinkedIn Gobblin, Apache Storm…
5
Kafka for Data Integration and Processing
6
Stream
Source
Central data buffer
Flush
periodically
HDFS
Big Files
Events
Indexed data
Flush
immediately
Batch
processing
Fast data
access
Real time
stream
processing
Stream
Source
Stream
Source
Stream
Source
No data lost during
downtime (scheduled
and unscheduled) of a
Hadoop cluster
Kafka buffers protects the
recent data from being lost
(before a daily HDFS snapshot
can backed them up)
Kafka at CERN – it monitoring
7
Kafka cluster
(buffering) *
Processing
Data enrichment
Data aggregation
Batch Processing
Transport
Flume
Kafka
sink
Flume
sinks
FTS
Data
Sources
Rucio
XRootD
Jobs
…
Lemon
syslog
app log
DB
HTTP
feed
AMQ
Flume
AMQ
Flume
DB
Flume
HTTP
Flume
Log GW
Flume
Metric
GW
Logs
Lemon
metrics
HDFS
ElasticS
earch
…
Storage &
Search
Others
(influxdb)
Data
Access
CLI, API
Kafka at CERN – it monitoring (Requirements)
8
• Throughput and retention policy
• Currently 200 GB/day (forecast 500 GB/day)
• Retention Policy 12h in qa and 24 hours in prod (largest retention policy to cover potential
problems over weekends)
• ~ 4000 messages, up to 10k peaks
• ~50 topics
• Security (Kerberos)
• Flume can be potentially upgrade to 1.7 early in 2017 (work in progress already).
• Administration Capabilities
• Administrative operations
• Topic configuration, rebalancing, user management, start/stop cluster
• Possibility to increase retention policy, replication factor
Kafka at CERN – CALS
9
Kafka at CERN – CALS (Requirements)
10
• Throughput and retention policy
• Currently 30 GB/hour only including the logging processes
• Plan to incrementally include all the systems with potentially mean several TBs
• Compression with Snappy will be evaluated to determined performance
• Retention policy 24 hours, which is the time they need to buffer data and compact it to send it to Hadoop
• Security (Kerberos)
• Infrastructure
• Openstack under several conditions:
• TN need to be supported for several reasons
• High availability of the service CALS on top of private cloud (No CALS no BEAM in the LHC)
• Administration Capabilities
• Administrative operations
• Topic configuration, rebalancing, user management, start/stop cluster
• Possibility to increase retention policy, replication factor
Kafka at CERN
11
• Security Team
• Already using Kafka for pattern matching
• Data integration
• LHC Postmortem
• Potentially ingested by CALS
• Industrial Control Systems
• WinCCOA Data
Pilot Kafka Service
12
• Scope
• Study the current Kafka use case together with
the different teams involved
• Collect requirements
• Understand feasibility and added value of Kafka
as a central service
Pilot Kafka Service
13
• Collect requirements (5 Major Use Cases):
• CALS, IT-Monitoring, Security Team, Industrial Control, Post-mortem
• Throughput, Retention Policy, Security, Infrastructure, Administration
Capabilities
• Agreement to test the service from the first phase
• Ensure the service cope with their requirements
• More details: https://twiki.cern.ch/twiki/bin/viewauth/DB/CERNonly/KafkaService
Pilot Kafka Service – Current Development
14
• Pilot Implementation - rapid iteration which will help
to understand service and use case.
• On-demand Kafka service approach
• Self-Service Cluster creation, management and expansion
• Allow users to perform administrative tasks that are traditionally carried
out by administrators
• Facilitating operating system and engine updates (Kafka, Zookeeper)
• Transparently integrate all the needed services (Security, Storage,
Procurement, etc)
• Support for service continuity in case of hardware failure
Pilot Kafka Service – Current Development
15
• Configuration and Management REST API
• Security enabled - Kerberos on Kafka and
Zookeeper (SSL optional)
• Monitoring Capabilities
• OpenStack on GPN
• Network storage
• Dedicated Kafka and Zookeeper per user
Towards Kafka Production Service
16
• Service evaluation phase and time line
Towards Kafka Production Service
17
• Consolidation to Production
• Web Interface to manage clusters (Self-service)
• Evolution of the configuration management API
• Functionalities toward the self-service platform
• Integration with Openstack
• Full monitoring beyond JMX metrics
• Kafka-Mirroring (High Availability)
• Deploy service in TN (due to service design that is transparent
for us)
• Kafka as close as possible to consumers and producers

More Related Content

PDF
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
PDF
PPTX
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
PDF
Building High-Throughput, Low-Latency Pipelines in Kafka
PPTX
Kafkha real time analytics platform.pptx
PPTX
Kafka infrastructure production
PPTX
Apache Kafka
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Building High-Throughput, Low-Latency Pipelines in Kafka
Kafkha real time analytics platform.pptx
Kafka infrastructure production
Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka

Similar to messaging.pptx (20)

PDF
Apache Kafka - Free Friday
PDF
Connect K of SMACK:pykafka, kafka-python or?
PDF
Kafka syed academy_v1_introduction
PPTX
Kafka Basic For Beginners
PPTX
Current and Future of Apache Kafka
PDF
Apache Kafka Introduction
PPTX
Understanding kafka
PDF
Building zero data loss pipelines with apache kafka
PDF
Introduction to apache kafka
PPTX
Integrating Kafka with MuleSoft 4 and usecase
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PPTX
Data In Motion Paris 2023
PDF
Devoxx university - Kafka de haut en bas
PDF
Kafka Vienna Meetup 020719
PDF
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
PDF
Introduction to Apache Kafka
PDF
Streaming Processing with a Distributed Commit Log
PPTX
Kafka presentation
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
PPTX
Streaming Data and Stream Processing with Apache Kafka
Apache Kafka - Free Friday
Connect K of SMACK:pykafka, kafka-python or?
Kafka syed academy_v1_introduction
Kafka Basic For Beginners
Current and Future of Apache Kafka
Apache Kafka Introduction
Understanding kafka
Building zero data loss pipelines with apache kafka
Introduction to apache kafka
Integrating Kafka with MuleSoft 4 and usecase
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Data In Motion Paris 2023
Devoxx university - Kafka de haut en bas
Kafka Vienna Meetup 020719
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Introduction to Apache Kafka
Streaming Processing with a Distributed Commit Log
Kafka presentation
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Streaming Data and Stream Processing with Apache Kafka
Ad

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Ad

messaging.pptx

  • 2. Pilot Kafka Service Manuel Martín Márquez
  • 3. Kafka • Kafka is a distributed streaming platform • High Scalable (partition) • Fault Tolerant (replication) • Allow high level of parallelism and decoupling between data producers and data consumers • De facto standard for near real-time store, access and process data streams • Critical component of most of the Big Data Platform and therefore of Hadoop ecosystem 3
  • 4. Kafka Basic Concepts 4 Source: Hortonworks Broker: Kafka node on the cluster Topics: Stream of records category - Multiple writers and readers - Partitioned - Replicated Consumer: pulls messages off of a Kafka topic Producer: push messages into a Kafka topic Data Retention: - Based on time or size Zookeeper: Stores Kafka Metadata
  • 5. Kafka entry points • Custom implementation of producer and consumer using Kafka client API • Java, Scala, C++, Python • Kafka Connectors • LogFile, HDFS, JDBC, ElasticSearch… • Logstash • Source and sink • Apache Flume out-of-the-box can use Kafka as • Source, Channel, Sink • Other ingestion or processing tools support Kafka • Apache Spark, LinkedIn Gobblin, Apache Storm… 5
  • 6. Kafka for Data Integration and Processing 6 Stream Source Central data buffer Flush periodically HDFS Big Files Events Indexed data Flush immediately Batch processing Fast data access Real time stream processing Stream Source Stream Source Stream Source No data lost during downtime (scheduled and unscheduled) of a Hadoop cluster Kafka buffers protects the recent data from being lost (before a daily HDFS snapshot can backed them up)
  • 7. Kafka at CERN – it monitoring 7 Kafka cluster (buffering) * Processing Data enrichment Data aggregation Batch Processing Transport Flume Kafka sink Flume sinks FTS Data Sources Rucio XRootD Jobs … Lemon syslog app log DB HTTP feed AMQ Flume AMQ Flume DB Flume HTTP Flume Log GW Flume Metric GW Logs Lemon metrics HDFS ElasticS earch … Storage & Search Others (influxdb) Data Access CLI, API
  • 8. Kafka at CERN – it monitoring (Requirements) 8 • Throughput and retention policy • Currently 200 GB/day (forecast 500 GB/day) • Retention Policy 12h in qa and 24 hours in prod (largest retention policy to cover potential problems over weekends) • ~ 4000 messages, up to 10k peaks • ~50 topics • Security (Kerberos) • Flume can be potentially upgrade to 1.7 early in 2017 (work in progress already). • Administration Capabilities • Administrative operations • Topic configuration, rebalancing, user management, start/stop cluster • Possibility to increase retention policy, replication factor
  • 9. Kafka at CERN – CALS 9
  • 10. Kafka at CERN – CALS (Requirements) 10 • Throughput and retention policy • Currently 30 GB/hour only including the logging processes • Plan to incrementally include all the systems with potentially mean several TBs • Compression with Snappy will be evaluated to determined performance • Retention policy 24 hours, which is the time they need to buffer data and compact it to send it to Hadoop • Security (Kerberos) • Infrastructure • Openstack under several conditions: • TN need to be supported for several reasons • High availability of the service CALS on top of private cloud (No CALS no BEAM in the LHC) • Administration Capabilities • Administrative operations • Topic configuration, rebalancing, user management, start/stop cluster • Possibility to increase retention policy, replication factor
  • 11. Kafka at CERN 11 • Security Team • Already using Kafka for pattern matching • Data integration • LHC Postmortem • Potentially ingested by CALS • Industrial Control Systems • WinCCOA Data
  • 12. Pilot Kafka Service 12 • Scope • Study the current Kafka use case together with the different teams involved • Collect requirements • Understand feasibility and added value of Kafka as a central service
  • 13. Pilot Kafka Service 13 • Collect requirements (5 Major Use Cases): • CALS, IT-Monitoring, Security Team, Industrial Control, Post-mortem • Throughput, Retention Policy, Security, Infrastructure, Administration Capabilities • Agreement to test the service from the first phase • Ensure the service cope with their requirements • More details: https://twiki.cern.ch/twiki/bin/viewauth/DB/CERNonly/KafkaService
  • 14. Pilot Kafka Service – Current Development 14 • Pilot Implementation - rapid iteration which will help to understand service and use case. • On-demand Kafka service approach • Self-Service Cluster creation, management and expansion • Allow users to perform administrative tasks that are traditionally carried out by administrators • Facilitating operating system and engine updates (Kafka, Zookeeper) • Transparently integrate all the needed services (Security, Storage, Procurement, etc) • Support for service continuity in case of hardware failure
  • 15. Pilot Kafka Service – Current Development 15 • Configuration and Management REST API • Security enabled - Kerberos on Kafka and Zookeeper (SSL optional) • Monitoring Capabilities • OpenStack on GPN • Network storage • Dedicated Kafka and Zookeeper per user
  • 16. Towards Kafka Production Service 16 • Service evaluation phase and time line
  • 17. Towards Kafka Production Service 17 • Consolidation to Production • Web Interface to manage clusters (Self-service) • Evolution of the configuration management API • Functionalities toward the self-service platform • Integration with Openstack • Full monitoring beyond JMX metrics • Kafka-Mirroring (High Availability) • Deploy service in TN (due to service design that is transparent for us) • Kafka as close as possible to consumers and producers

Editor's Notes

  • #10: Critical Service no data no beam