SlideShare a Scribd company logo
Version 1.0
StreamSets for Data Engineering
In Data Engineer's Lunch #57, we will discuss StreamSets and how
it can be used for data engineering.
Arpan Patel
Engineer @ Anant
Streamsets
● Data Integration Platform Built for
DataOps
● Build streaming, batch, CDC, ETL, and
ML pipelines from a single UI and
deploy data and workloads to any cloud
● DataOps Platform
○ Free tier (no cc required) with
Data Collector Engine, Transform
Engine, Control Hub
○ Self Managed Deployments via
Docker
○ 2 active jobs, 2 active users, 10
published pipelines
Streamsets
● Control Hub
● Data Collector Engine
○ Open-source
● Transformer Engine
○ Can natively execute on Apache Spark,
Snowflake, AWS EMR, Google Cloud
Dataproc, and Databricks platforms
● Pre-built connectors and native integrations
○ Applications, Big Data, SQL/NoSQL DBs,
Storage/Warehouses, Streaming
○ Tons of Sources + Destinations
● StreamSets Academy + Tutorials
Demo
● Spin up Data Collector Deployment from Control Hub + Docker
● Create Sample Pipeline and Preview Data
● Schedule / Run Pipeline Job and View Metrics
● Spin up Transformer Engine Deployment from Control Hub + Docker
● Create Sample ETL Pipeline and Preview Data
● Submit ETL Pipeline to Local Spark
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

More Related Content

PPTX
Apache Cassandra Lunch #94: StreamSets and Cassandra
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
PDF
Logging infrastructure for Microservices using StreamSets Data Collector
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
PDF
Building Big Data Streaming Architectures
PPTX
StreamSet ETL tool
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Apache Cassandra Lunch #94: StreamSets and Cassandra
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Logging infrastructure for Microservices using StreamSets Data Collector
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Building Big Data Streaming Architectures
StreamSet ETL tool
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

Similar to Data Engineer's Lunch #57: StreamSets for Data Engineering (20)

PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Data Aggregation At Scale Using Apache Flume
PPTX
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
PDF
Structured streaming for machine learning
PDF
Towards Data Operations
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PDF
Whirlpools in the Stream with Jayesh Lalwani
PDF
Big data apache spark + scala
PDF
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
PDF
Streaming analytics state of the art
PPTX
Dealing with Drift: Building an Enterprise Data Lake
PDF
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
PDF
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
PDF
Architectural Patterns for Streaming Applications
PPTX
Data streaming fundamentals
PDF
Streaming vs batching (conundrum ai internal meetup)
PDF
Acquisition de données dans Neo4j pour le Master Data Management
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Processing and analysing streaming data with Python. Pycon Italy 2022
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Data Aggregation At Scale Using Apache Flume
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Structured streaming for machine learning
Towards Data Operations
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Whirlpools in the Stream with Jayesh Lalwani
Big data apache spark + scala
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
Streaming analytics state of the art
Dealing with Drift: Building an Enterprise Data Lake
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
Architectural Patterns for Streaming Applications
Data streaming fundamentals
Streaming vs batching (conundrum ai internal meetup)
Acquisition de données dans Neo4j pour le Master Data Management
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Processing and analysing streaming data with Python. Pycon Italy 2022
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
Ad

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
PPTX
YugabyteDB Developer Tools
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
PPTX
Machine Learning Orchestration with Airflow
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
YugabyteDB Developer Tools
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Machine Learning Orchestration with Airflow
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Ad

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Introduction to Data Science and Data Analysis
PDF
Introduction to the R Programming Language
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to machine learning and Linear Models
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Data Science and Data Analysis
Introduction to the R Programming Language
Galatica Smart Energy Infrastructure Startup Pitch Deck
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ISS -ESG Data flows What is ESG and HowHow
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Supervised vs unsupervised machine learning algorithms
Introduction to machine learning and Linear Models
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx

Data Engineer's Lunch #57: StreamSets for Data Engineering

  • 1. Version 1.0 StreamSets for Data Engineering In Data Engineer's Lunch #57, we will discuss StreamSets and how it can be used for data engineering. Arpan Patel Engineer @ Anant
  • 2. Streamsets ● Data Integration Platform Built for DataOps ● Build streaming, batch, CDC, ETL, and ML pipelines from a single UI and deploy data and workloads to any cloud ● DataOps Platform ○ Free tier (no cc required) with Data Collector Engine, Transform Engine, Control Hub ○ Self Managed Deployments via Docker ○ 2 active jobs, 2 active users, 10 published pipelines
  • 3. Streamsets ● Control Hub ● Data Collector Engine ○ Open-source ● Transformer Engine ○ Can natively execute on Apache Spark, Snowflake, AWS EMR, Google Cloud Dataproc, and Databricks platforms ● Pre-built connectors and native integrations ○ Applications, Big Data, SQL/NoSQL DBs, Storage/Warehouses, Streaming ○ Tons of Sources + Destinations ● StreamSets Academy + Tutorials
  • 4. Demo ● Spin up Data Collector Deployment from Control Hub + Docker ● Create Sample Pipeline and Preview Data ● Schedule / Run Pipeline Job and View Metrics ● Spin up Transformer Engine Deployment from Control Hub + Docker ● Create Sample ETL Pipeline and Preview Data ● Submit ETL Pipeline to Local Spark
  • 5. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037