SlideShare a Scribd company logo
© 2016 DataTorrent
Chaitanya Chebolu
Committer, Apache Apex
Engineer, DataTorrent
Sep 14, 2016
Data Ingestion - Kafka ETL
© 2016 DataTorrent
Agenda
2
• Introduction about Apache Apex (Architecture, Application, Native
Hadoop Integration)
• What is Data Ingestion
• Use Case : Kafka ETL
• Brief about Kafka
• Kafka ETL App
• Kafka ETL Demo
© 2016 DataTorrent3
Apache Apex
• Platform and runtime engine that enables development of scalable
and fault-tolerant distributed applications
• Hadoop native (Hadoop >= 2.2)
No separate service to manage stream processing
Streaming Engine built into Application Master and Containers
• Process streaming or batch big data
• High throughput and low latency
• Library of commonly needed business logic
• Write any custom business logic in your application
© 2016 DataTorrent4
Apex Architecture
© 2016 DataTorrent5
An Apex Application is a DAG
(Directed Acyclic Graph)
A DAG is composed of vertices (Operators) and edges (Streams).
A Stream is a sequence of data tuples which connects operators at end-points called Ports
An Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library
● Operator may have multiple instances that run in parallel
© 2016 DataTorrent6
Apex - Native Hadoop Integration
• YARN is the
resource
manager
• HDFS used for
storing any
persistent
state
© 2016 DataTorrent
What is Data Ingestion?
7
• Data Ingestion
A process of obtaining, importing, and analyzing data for later use
or storage in a database
• Big Data Ingestion
Discovering the data sources
Importing the data
Processing data to produce intermediate data
Sending data out to durable data stores
© 2016 DataTorrent
Use Case: Kafka ETL
8
• Consuming data from Kafka
• Processing data to produce intermediate data
• Writing the processed data to HDFS
© 2016 DataTorrent
Brief about Kafka
9
● Distributed Messaging System.
● Data Partitioning Capability.
● Fast Read and Writes.
● Basic Terminology
○ Topic
○ Producer
○ Consumer
○ Broker
© 2016 DataTorrent
Kafka ETL App
10
Kafka Parser Dedup Transform Formatter
HDFS
© 2016 DataTorrent
Kafka ETL Demo
11
Demo
© 2016 DataTorrent
Resources
12
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/

More Related Content

PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PPTX
Java High Level Stream API
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Java High Level Stream API

What's hot (20)

PPTX
Deep Dive into Apache Apex App Development
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Introduction to Apache Apex
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PDF
The Future of Apache Storm
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PDF
Data Integration
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Timeline Service v.2 (Hadoop Summit 2016)
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Apex as yarn application
PDF
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
From Batch to Streaming ET(L) with Apache Apex
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PDF
Big Migrations: Moving elephant herds by Carlos Izquierdo
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Deep Dive into Apache Apex App Development
Intro to Apache Apex @ Women in Big Data
Introduction to Apache Apex
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
The Future of Apache Storm
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Developing streaming applications with apache apex (strata + hadoop world)
Data Integration
Large-Scale Stream Processing in the Hadoop Ecosystem
Timeline Service v.2 (Hadoop Summit 2016)
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apex as yarn application
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
From Batch to Streaming ET(L) with Apache Apex
Architectual Comparison of Apache Apex and Spark Streaming
Big Migrations: Moving elephant herds by Carlos Izquierdo
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Ad

Viewers also liked (8)

PPTX
Introduction to Yarn
PPTX
Intro to Big Data Hadoop
PPTX
Hadoop Interacting with HDFS
PPTX
Fault-Tolerant File Input & Output
PPTX
Apache Apex Kafka Input Operator
PPTX
HDFS Internals
PPTX
Introduction to Map Reduce
PPTX
Introduction to Real-Time Data Processing
Introduction to Yarn
Intro to Big Data Hadoop
Hadoop Interacting with HDFS
Fault-Tolerant File Input & Output
Apache Apex Kafka Input Operator
HDFS Internals
Introduction to Map Reduce
Introduction to Real-Time Data Processing
Ad

Similar to Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations (20)

PPTX
Apache Apex Meetup at Cask
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
Introduction to Apache Apex and writing a big data streaming application
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
PDF
E2E Data Pipeline - Apache Spark/Airflow/Livy
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
PPTX
Apache Apex & Bigtop
PPTX
Apache Apex - Hadoop Users Group
PPTX
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
PPTX
Apache kafka
PPTX
Tez big datacamp-la-bikas_saha
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PPTX
Apache Tez -- A modern processing engine
PPTX
Apache Flink: Past, Present and Future
PDF
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
PDF
Webinar: What's new in CDAP 3.5?
Apache Apex Meetup at Cask
DataTorrent Presentation @ Big Data Application Meetup
Introduction to Apache Apex and writing a big data streaming application
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Large-Scale Stream Processing in the Hadoop Ecosystem
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
E2E Data Pipeline - Apache Spark/Airflow/Livy
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex & Bigtop
Apache Apex - Hadoop Users Group
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
Apache kafka
Tez big datacamp-la-bikas_saha
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Apache Tez -- A modern processing engine
Apache Flink: Past, Present and Future
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Webinar: What's new in CDAP 3.5?

More from Apache Apex (6)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Apache Beam (incubating)
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PDF
Building Your First Apache Apex Application
Low Latency Polyglot Model Scoring using Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Beam (incubating)
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Building Your First Apache Apex Application

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations

  • 1. © 2016 DataTorrent Chaitanya Chebolu Committer, Apache Apex Engineer, DataTorrent Sep 14, 2016 Data Ingestion - Kafka ETL
  • 2. © 2016 DataTorrent Agenda 2 • Introduction about Apache Apex (Architecture, Application, Native Hadoop Integration) • What is Data Ingestion • Use Case : Kafka ETL • Brief about Kafka • Kafka ETL App • Kafka ETL Demo
  • 3. © 2016 DataTorrent3 Apache Apex • Platform and runtime engine that enables development of scalable and fault-tolerant distributed applications • Hadoop native (Hadoop >= 2.2) No separate service to manage stream processing Streaming Engine built into Application Master and Containers • Process streaming or batch big data • High throughput and low latency • Library of commonly needed business logic • Write any custom business logic in your application
  • 5. © 2016 DataTorrent5 An Apex Application is a DAG (Directed Acyclic Graph) A DAG is composed of vertices (Operators) and edges (Streams). A Stream is a sequence of data tuples which connects operators at end-points called Ports An Operator takes one or more input streams, performs computations & emits one or more output streams ● Each operator is USER’s business logic, or built-in operator from our open source library ● Operator may have multiple instances that run in parallel
  • 6. © 2016 DataTorrent6 Apex - Native Hadoop Integration • YARN is the resource manager • HDFS used for storing any persistent state
  • 7. © 2016 DataTorrent What is Data Ingestion? 7 • Data Ingestion A process of obtaining, importing, and analyzing data for later use or storage in a database • Big Data Ingestion Discovering the data sources Importing the data Processing data to produce intermediate data Sending data out to durable data stores
  • 8. © 2016 DataTorrent Use Case: Kafka ETL 8 • Consuming data from Kafka • Processing data to produce intermediate data • Writing the processed data to HDFS
  • 9. © 2016 DataTorrent Brief about Kafka 9 ● Distributed Messaging System. ● Data Partitioning Capability. ● Fast Read and Writes. ● Basic Terminology ○ Topic ○ Producer ○ Consumer ○ Broker
  • 10. © 2016 DataTorrent Kafka ETL App 10 Kafka Parser Dedup Transform Formatter HDFS
  • 11. © 2016 DataTorrent Kafka ETL Demo 11 Demo
  • 12. © 2016 DataTorrent Resources 12 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups – http://www.meetup.com/pro/apacheapex/ • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/