SlideShare a Scribd company logo
Streaming Time
Series Data with
Apache Kafka and
MongoDB
DATE AND TIME GOES HERE IN ALL CAPS
Kenny Gorman
Principal Product
Manager - Streaming,
MongoDB
Elena Cuevas
Manager, Cloud Partner
Solutions Engineering,
Confluent
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
IoT - data
generators
{
'station_id': 72117,
'station_name': 'Hill Country'
'createdAt': 1664220412603,
'obs':
[ {'wind_avg': 0.9, 'uv': 4.26, 'wind_gust': 2.5 } ]
'timezone': 'America/Chicago',
'elevation': 264.83969,
'longitude': -97.84221,
'latitude': 30.36495
}
Powerful Cloud
Platforms unite!
MongoDB
Developer Data Platform
Confluent and MongoDB in the Cloud
Real-Time Online
Data Store
Primary Secondary Secondary
High Volume Real-Time
Operational Data
Analytical
Time
Series
Real Time Analytical Data
Data tiering (DL/DWH, Archiving)
Lucene based Text Search
Sharding and Replica Sets
Sensors
Digital
Content
Transactions
Clients
Security
More Legacy
Systems
of
record
Sources
of
truth
And
Mobile and
Web Apps
Personalized
Marketing
Research &
Analytics
AML / AFM / ...
?
Many
Others...
Real
Time
Vertical
Solutions
BI Connector /
Real Time Analytics
Apache Spark
Team
Collaboration
Flexible API and
Microservices
Data Facilitation
REALM
Mobile
Sync
Real-Time Online
Data Store
High Volume Real-Time
Event Processing
Bridge to Cloud
Bidirectional sync, native connectors
Registry, Real-time Processing
Highly available, scalable
Confluent
Platform
Confluent
Cloud
Kafka Sink Kafka Source
Supported
connectors
Fully managed Atlas connectors
Stream kSQL
Registry
MongoDB
Developer Data Platform
Confluent and MongoDB in the Cloud
Real-Time Online
Data Store
Primary Secondary Secondary
High Volume Real-Time
Operational Data
Analytical
Time
Series
Real Time Analytical Data
Data tiering (DL/DWH, Archiving)
Lucene based Text Search
Sharding and Replica Sets
Sensors
Digital
Content
Transactions
Clients
Security
More Legacy
Systems
of
record
Sources
of
truth
And
Mobile and
Web Apps
Personalized
Marketing
Research &
Analytics
AML / AFM / ...
?
Many
Others...
Real
Time
Vertical
Solutions
BI Connector /
Real Time Analytics
Apache Spark
Team
Collaboration
Flexible API and
Microservices
Data Facilitation
REALM
Mobile
Sync
Real-Time Online
Data Store
High Volume Real-Time
Event Processing
Bridge to Cloud
Bidirectional sync, native connectors
Registry, Real-time Processing
Highly available, scalable
Confluent
Platform
Confluent
Cloud
Kafka Sink Kafka Source
Supported
connectors
Fully managed Atlas connectors
Stream kSQL
Registry
Produce event
Connector Time Series Query
Confluent Kafka
Real-time &
Historical
Data
A sale
A shipment
A trade
A customer
interaction
A new paradigm is required for Data in Motion
Continuously process streams of data in real time
“We need to shift our thinking from everything
at rest, to everything in motion.” —
Real-Time Stream Processing
Rich, front-end
customer experiences
Real-time, software-driven
business operations
Confluent Cloud
Cloud-native data streaming platform built
by the founders of Apache Kafka®
Everywhere
Connect your data in real
time with a platform that
spans from on-prem to
cloud and across clouds
Complete
Go above & beyond Kafka
with all the essential tools
for a complete data
streaming platform
Cloud-Native
Apache Kafka©
, fully
managed and
re-architected to harness
the power of the cloud
Stream confidently on the world’s most trusted data streaming platform built by the founders of
Apache Kafka©, with resilience, security, compliance, and privacy built-in by default.
9
Serverless
● Elastic scaling up &
down from 0 to GBps
● Auto capacity mgmt,
load balancing, and
upgrades
High Availability
● 99.99% SLA
● Multi-region / AZ availability
across cloud providers
● Patches deployed in Confluent
Cloud before Apache Kafka
Infinite Storage
● Store data cost-
effectively at any scale
without growing
compute
DevOps Automation
● API-driven and/or
point-and-click ops
● Service portability &
consistency across cloud
providers and on-prem
Network
Flexibility
● Public, VPC, and
Private Link
● Self-managed
option for
air-gapped
environments
Elastic: Instantly scale to meet any demand
Seamlessly provision and deploy fully managed,
elastically scaling clusters with infinite storage that
expand & shrink to cost-effectively support all streaming
use cases
Reliable: Power all your streaming apps &
analytics with resilience
Maintain high availability of your clusters and data
streams with our 99.99% uptime SLA, multi-AZ /
region clusters, and no-touch Kafka patches &
upgrades
Agile: Focus on innovation, not infrastructure
Fully automate management of serverless clusters
through code via Terraform integration and REST APIs,
paying only for what you use when you use it
Cloud-Native
Apache Kafka®
, fully managed and
re-architected to harness the power of
the cloud
“Before Confluent, when we had broker outages that
required rebuilds, it could take up to three days of
developer time to resolve. Now, Confluent takes care
of everything for us, so our developers can focus on
building new features and applications.”
Complete
Go above & beyond Kafka with all the
essential tools for a complete data
streaming platform
Connectors & Stream Processing: Connect to
and from any app / system and process your
data streams in-flight
Reduce TCO and architectural complexity with our
portfolio of 120+ pre-built connectors and stream
processing powered by ksqlDB, all available fully
managed and built-in with Confluent Cloud
Stream Designer: Quickly build and deploy
streaming apps & pipelines
Rapidly build, test, and deploy streaming data
pipelines with Stream Designer, extensible with SQL,
while reducing the need to write boilerplate code
Security & Governance: Secure, discover, and
organize your data streams
Build trust and put your data streams to work with
enterprise-grade security and the only Stream
Governance suite for data in motion
“BHG is a fast-moving company, and Confluent is
quickly becoming not only a central highway for
our data with their vast connector portfolio, but a
streaming transformation engine as well for a vast
number of use cases… We are making Confluent the
true backbone of BHG, including leveraging 20+
Confluent connectors across both modern,
cloud-based technologies & legacy systems, to help
integrate our critical apps & data systems together.”
11
Connectors
Security
Data
Governance
Stream
Processing
Monitoring
Global
Resilience
Stream
Designer
Everywhere
Connect your data in real time with a
platform that spans from on-prem to
cloud and across clouds
Run Anywhere: Deploy across any environment
Provision Confluent as a fully managed service on
AWS, Azure, and Google Cloud across 60+ regions w/
Confluent Cloud, or on-premises w/ Confluent
Platform
Unified: Unify data across hybrid and
multi-clouds
Provide consistent, self-service access to real-time data
across all your environments with Cluster Linking and
globally connected clusters that perfectly mirror data
Consistent: Learn one platform for all
environments
Remove the burden of learning new tools for each
environment with a consistent experience spanning
across cloud, on-prem, and hybrid / multicloud
“Our transformation to a cloud-native, agile
company required a large-scale migration from
open source Apache Kafka. With Confluent, we now
support real-time data sharing across all of our
environments, and see a clear path forward for our
hybrid cloud roadmap.”
12
Using fully managed connectors is the fastest, most
efficient way to break data silos
Self-managed connector
Accelerated time-to-value • Increased developer productivity • Reduced operational burden
● Pre-built but requires manual
installation / config efforts to
set-up and deploy connectors
● Perpetual management and
maintenance of connectors that
leads to ongoing tech debt
● Risk of downtime and business
disruption due to connector /
Connect cluster related issues
Fully managed connector
Custom-built connector
● Streamlined configurations and
on-demand provisioning of your
connectors
● Eliminates operational overhead
and management complexity
with seamless scaling and load
balancing
● Reduced risk of downtime with
Confluent Cloud’s 99.99% SLA for
all your mission critical use cases
● Costly to allocate resources to
design, build, test, and maintain
non-differentiated data
integration components
● Delays time-to-value, taking up
to 3-6+ engineering months to
develop
● Perpetual management and
maintenance increases tech debt
and risk of downtime
Connect IoT data sources
Leverage existing
infrastructure investments
Reduce operational complexity
Avoid the need for third party
MQTT brokers
Ensure IoT data delivery
Compatible with all QoS
levels of the MQTT protocol
Gateways BROKER
Devices MQTT
Proxy
MQTT Proxy1
Easily connect with IOT data sources
1
Support for self-managed components with a CC
subscription with Business support tier or higher.
MongoDB
Connector for
Kafka
MongoDB Connector for Apache Kafka
● Enables users to easily integrate MongoDB with Kafka
● Users can configure MongoDB as a source to publish data changes from MongoDB
into Kafka topics for streaming to consuming applications
● Users can configure MongoDB as a sink to easily persist events from Kafka topics
directly to MongoDB collections
● Dead letter queue
● Time series integration
● JMX Integration
●
● Available from Confluent Hub and Verified Gold
● Fully managed using Confluent Cloud
● Configured via Confluent Cloud or Kafka Connect REST endpoint.
● Certified against Apache Kafka 2.3 and Confluent Platform 5.3 (or later)
Destination:
MongoDB Database
MongoDB Sink
Connector
topicA
topicB
topicC
Kafka Cluster
Writes documents
to DB collection
Receives events from
Kafka Topic(s)
MongoDB Connector for Kafka
Source:
MongoDB Database
MDB Source
Connector
Kafka Cluster
Receives documents
from DB collection
Writes events to
Kafka Topics(s)
topicA
topicB
topicC
Change
Streams
• Reads messages from topic (based on pointer to message in topic)
• Writes message into MongoDB database collection
• Moves pointer to next message based on write to database
Kafka Topic
connector
database
collection
{}
1: offset to
message to read
2: bulk write to db
3: on successful write (of
batch), moves offset to
next batch
Sink Connector Specifics
MongoDB Time
Series Collections
Time Series Collection
An optimized column oriented
collection for time-series data
which organizes writes so that
data for the same source is
stored in the same bucket,
alongside other data points from
a similar point in time
Launched with
5.0
Increases developer productivity
Reduces complexity for working with Time Series data
Reduces I/O for read operations
Massive reduction in storage size and index size
Optimized WiredTiger cache usage
Creating a Time Series
Collection
TO CREATE A TIME SERIES COLLECTION, USE THE
timeseries OPTION
Launched with
5.0
db.createCollection("weather", {
timeseries: {
timeField: "timestamp",
metaField: "sensorId",
granularity: “minutes”
},
expireAfterSeconds: 9000
})
The timeField is the only required parameter for a Time Series
collection
Terminology & concepts: metaField
> db.createCollection ("weather", { timeseries: { ..., metaField: “sensorId” } } )
{
"sensorId": 123,
“timestamp”: ISODate(“..."),
“temperature”: 47.0
},
{
"sensorId": 456,
“timestamp”: ISODate(“..."),
“temperature”: 69.8
},
{
"sensorId": 789,
“timestamp”: ISODate(“..."),
“temperature”: 97.0
}
● Label or tag that uniquely identifies a time series
● Never/rarely changes over time
123 456 789
100
75
50
25
Terminology & concepts: measurement
● A set of related key-value pairs at a specific time
● Any other fields except metadata and time
123 456 789
100
75
50
25
> db.createCollection ("weather", { timeseries: { ..., metaField: “sensorId” } } )
{
"sensorId": 123,
“timestamp”: ISODate(“..."),
“temperature”: 47.0
},
{
"sensorId": 456,
“timestamp”: ISODate(“..."),
“temperature”: 69.8
},
{
"sensorId": 789,
“timestamp”: ISODate(“..."),
“temperature”: 97.0
}
Metadata
Measurements
Internal
{
“_id”: ObjectId("629487903149047dd18f7e3e"),
“control”: {
“count”: 2
“min”: {
“_id”: ObjectId(“62951bb262fbb35f79c3b472”),
“timestamp”: ISODate("2022-05-30T09:00:00.000Z"),
“temperature”: 69.8
},
“max”: {
“_id”: ObjectId(“62951bb262fbb35f79c3b474”),
“timestamp”: ISODate("2022-05-30T09:15:00.000Z"),
“temp”: 70.0
}
},
“meta”: 456,
“data”: {
“temperature”: {
0: 69.8,
1: 70.0
},
“_id”: {
0: ObjectId(“62951bb262fbb35f79c3b472”),
1: ObjectId(“62951bb262fbb35f79c3b474”)
},
“timestamp”: {
0: ISODate("2022-05-30T09:05:00.000Z"),
1: ISODate("2022-05-30T09:15:00.000Z")
}
}
}
{
"sensorId": 789,
},
{
"sensorId": 456,
“timestamp”: ISODate("2022-05-30T09:05:00.000Z"),
“temperature”: 69.8,
“_id”: ObjectId(“6290cdcf62fbb35f79c3b472”)
},
{
"sensorId": 789,
…
},
{
"sensorId": 456,
“timestamp”: ISODate("2022-05-30T09:15:00.000Z"),
“temperature”: 70.0,
“_id”: ObjectId(“6290cdcf62fbb35f79c3b474”)
}
])
> db.weather.insertMany([
Time Series Collection Columnar
Compression
Columnar compression adds a number of
innovations that work together to significantly
improve practical compression before on-disk
compression
Launched with
5.2
Dramatically reduce database storage footprint
Improves read performance
Increases Cache efficiency fitting more data in memory
and using less I/O
Columnar
Compression
Time Series Collection Columnar Compression Example
Uncompressed BSON vs. Storage Size (Weather Data)
Uncompressed
BSON Size
Time Series Collection
Compressed Storage Size
25
50
75
100
125
107MB
2.2MB
-97%
6MB
Time Series Collection
Compressed Bucket Size
Uncompressed BSON
Size
Time Series Collection
Compressed Storage Size
Time Series Collection
Compressed Bucket Size
Querying Time Series
Collections
> db.weather.find()
Launched with
5.0
When querying time-series
collections, two main things happen
under the hood:
● Query rewrites
● Bucket “unpacking”
A Concrete
Example
An event in Apache Kafka
> _
> confluent kafka topic create stockData
> confluent kafka topic produce stockData --parse-key --delimiter ,
keyABC1, {
tx_time: 2021-06-30T15:47:31.000Z,
company_symbol: 'SCL',
company_name: 'SILKY CORNERSTONE LLC',
price: 94.0999984741211
}
keyABC2, {...}
Creating a TS collection
db.createCollection(
"StockDataTS",
{ timeseries:
{ timeField: "tx_time",
metaField: "company_symbol",
granularity: "minutes"
}
}
);
db.StockDataTS.stats().timeseries
{
bucketsNs: 'demo.system.buckets.StockDataTS',
avgBucketSize: 393,
avgNumMeasurementsPerCommit: 1,
bucketCount: 1,
numBucketInserts: 1,
numBucketUpdates: 0,
. . .
}
Configuring the connector for TS
$> curl -X PUT http://${URL}:${PORT}/connectors/sink-mongodb-users/config -H "Content-Type: application/json" -d '
{ "name": "mongo-sink-stockdata",
"config": {
"connector.class":"com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max":"1",
"topics":"stockData",
"connection.uri":(MONGODB SINK CONNECTION STRING), /* from MongoDB Atlas */
"database":"Stocks",
"collection":"StockDataTS",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
“timeseries.metafield”: “company_symbol”,
"timeseries.timefield":"tx_time",
"timeseries.timefield.auto.convert":"true",
"timeseries.timefield.auto.convert.date.format":"yyyy-MM-dd'T'HH:mm:ss'Z'" }
}
'
Query TS collections
db.StockDataTS.createIndex({ ‘company_symbol’: 1 });
db.StockDataTS.aggregate([
{ '$match': { company_symbol: 'SCL' } },
{
'$setWindowFields': {
partitionBy: '$company_name',
sortBy: { tx_time: 1 },
output: {
averagePrice: {
'$avg': '$price',
window: { documents: [ 'unbounded', 'current' ] }
}}}
}]
Results
{
tx_time: 2021-06-30T15:47:45.000Z,
_id: '60dc922172f0f39e2cd6cbeb',
company_name: 'SILKY CORNERSTONE LLC',
price: 94.06999969482422,
company_symbol: 'SCL',
averagePrice: 94.1346669514974
},
{
tx_time: 2021-06-30T15:47:47.000Z,
_id: '60dc922372f0f39e2cd6cbf0',
company_name: 'SILKY CORNERSTONE LLC',
price: 94.1500015258789,
company_symbol: 'SCL',
averagePrice: 94.13562536239624
}....
queryPlanner: {
namespace: 'demo.system.buckets.StockDataTS',
parsedQuery: { meta: { '$eq': 'SCL' } },
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { meta: 1 },
indexName: 'company_symbol_1',
isMultiKey: false,
multiKeyPaths: { meta: [] },
. . .
}
}
}
More info on MongoDB time series
collections
Thank you for
your time.

More Related Content

PDF
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
PDF
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
PDF
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PPTX
Data In Motion Paris 2023
PDF
Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way
PPTX
Best Practices for Building Hybrid-Cloud Architectures | Hans Jespersen
PDF
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data In Motion Paris 2023
Strategies For Migrating From SQL to NoSQL — The Apache Kafka Way
Best Practices for Building Hybrid-Cloud Architectures | Hans Jespersen
Bridge to Cloud: Using Apache Kafka to Migrate to AWS

Similar to Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022 (20)

PDF
Reinventing Kafka in the Data Streaming Era - Jun Rao
PDF
The Never Landing Stream with HTAP and Streaming
PDF
Modern application delivery with Consul
PPTX
AWS Immersion Day Mapfre - Confluent
PDF
Luciano Moreira_Jacob Bogie-BRSP005-10.3_22_FINAL.pdf
PDF
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
PPTX
Amazon AWS vs Azure Cloud vs Kubernetes
PDF
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
PDF
.NET Cloud-Native Bootcamp- Los Angeles
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
PDF
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
PPTX
Adaptiva OneSite Cloud: Software Delivery Everywhere
PDF
Confluent Partner Tech Talk with Synthesis
PPTX
An Introduction to Confluent Cloud: Apache Kafka as a Service
PPTX
Unlock value with Confluent and AWS.pptx
PPTX
Transform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
PPTX
Intro to Google Cloud Platform Data Engineering.
PDF
Microsoft Azure Explained - Hitesh D Kesharia
PDF
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Reinventing Kafka in the Data Streaming Era - Jun Rao
The Never Landing Stream with HTAP and Streaming
Modern application delivery with Consul
AWS Immersion Day Mapfre - Confluent
Luciano Moreira_Jacob Bogie-BRSP005-10.3_22_FINAL.pdf
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Amazon AWS vs Azure Cloud vs Kubernetes
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
.NET Cloud-Native Bootcamp- Los Angeles
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Adaptiva OneSite Cloud: Software Delivery Everywhere
Confluent Partner Tech Talk with Synthesis
An Introduction to Confluent Cloud: Apache Kafka as a Service
Unlock value with Confluent and AWS.pptx
Transform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
Intro to Google Cloud Platform Data Engineering.
Microsoft Azure Explained - Hitesh D Kesharia
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Ad

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022

  • 1. Streaming Time Series Data with Apache Kafka and MongoDB DATE AND TIME GOES HERE IN ALL CAPS Kenny Gorman Principal Product Manager - Streaming, MongoDB Elena Cuevas Manager, Cloud Partner Solutions Engineering, Confluent
  • 3. IoT - data generators { 'station_id': 72117, 'station_name': 'Hill Country' 'createdAt': 1664220412603, 'obs': [ {'wind_avg': 0.9, 'uv': 4.26, 'wind_gust': 2.5 } ] 'timezone': 'America/Chicago', 'elevation': 264.83969, 'longitude': -97.84221, 'latitude': 30.36495 }
  • 5. MongoDB Developer Data Platform Confluent and MongoDB in the Cloud Real-Time Online Data Store Primary Secondary Secondary High Volume Real-Time Operational Data Analytical Time Series Real Time Analytical Data Data tiering (DL/DWH, Archiving) Lucene based Text Search Sharding and Replica Sets Sensors Digital Content Transactions Clients Security More Legacy Systems of record Sources of truth And Mobile and Web Apps Personalized Marketing Research & Analytics AML / AFM / ... ? Many Others... Real Time Vertical Solutions BI Connector / Real Time Analytics Apache Spark Team Collaboration Flexible API and Microservices Data Facilitation REALM Mobile Sync Real-Time Online Data Store High Volume Real-Time Event Processing Bridge to Cloud Bidirectional sync, native connectors Registry, Real-time Processing Highly available, scalable Confluent Platform Confluent Cloud Kafka Sink Kafka Source Supported connectors Fully managed Atlas connectors Stream kSQL Registry
  • 6. MongoDB Developer Data Platform Confluent and MongoDB in the Cloud Real-Time Online Data Store Primary Secondary Secondary High Volume Real-Time Operational Data Analytical Time Series Real Time Analytical Data Data tiering (DL/DWH, Archiving) Lucene based Text Search Sharding and Replica Sets Sensors Digital Content Transactions Clients Security More Legacy Systems of record Sources of truth And Mobile and Web Apps Personalized Marketing Research & Analytics AML / AFM / ... ? Many Others... Real Time Vertical Solutions BI Connector / Real Time Analytics Apache Spark Team Collaboration Flexible API and Microservices Data Facilitation REALM Mobile Sync Real-Time Online Data Store High Volume Real-Time Event Processing Bridge to Cloud Bidirectional sync, native connectors Registry, Real-time Processing Highly available, scalable Confluent Platform Confluent Cloud Kafka Sink Kafka Source Supported connectors Fully managed Atlas connectors Stream kSQL Registry Produce event Connector Time Series Query
  • 8. Real-time & Historical Data A sale A shipment A trade A customer interaction A new paradigm is required for Data in Motion Continuously process streams of data in real time “We need to shift our thinking from everything at rest, to everything in motion.” — Real-Time Stream Processing Rich, front-end customer experiences Real-time, software-driven business operations
  • 9. Confluent Cloud Cloud-native data streaming platform built by the founders of Apache Kafka® Everywhere Connect your data in real time with a platform that spans from on-prem to cloud and across clouds Complete Go above & beyond Kafka with all the essential tools for a complete data streaming platform Cloud-Native Apache Kafka© , fully managed and re-architected to harness the power of the cloud Stream confidently on the world’s most trusted data streaming platform built by the founders of Apache Kafka©, with resilience, security, compliance, and privacy built-in by default. 9
  • 10. Serverless ● Elastic scaling up & down from 0 to GBps ● Auto capacity mgmt, load balancing, and upgrades High Availability ● 99.99% SLA ● Multi-region / AZ availability across cloud providers ● Patches deployed in Confluent Cloud before Apache Kafka Infinite Storage ● Store data cost- effectively at any scale without growing compute DevOps Automation ● API-driven and/or point-and-click ops ● Service portability & consistency across cloud providers and on-prem Network Flexibility ● Public, VPC, and Private Link ● Self-managed option for air-gapped environments Elastic: Instantly scale to meet any demand Seamlessly provision and deploy fully managed, elastically scaling clusters with infinite storage that expand & shrink to cost-effectively support all streaming use cases Reliable: Power all your streaming apps & analytics with resilience Maintain high availability of your clusters and data streams with our 99.99% uptime SLA, multi-AZ / region clusters, and no-touch Kafka patches & upgrades Agile: Focus on innovation, not infrastructure Fully automate management of serverless clusters through code via Terraform integration and REST APIs, paying only for what you use when you use it Cloud-Native Apache Kafka® , fully managed and re-architected to harness the power of the cloud “Before Confluent, when we had broker outages that required rebuilds, it could take up to three days of developer time to resolve. Now, Confluent takes care of everything for us, so our developers can focus on building new features and applications.”
  • 11. Complete Go above & beyond Kafka with all the essential tools for a complete data streaming platform Connectors & Stream Processing: Connect to and from any app / system and process your data streams in-flight Reduce TCO and architectural complexity with our portfolio of 120+ pre-built connectors and stream processing powered by ksqlDB, all available fully managed and built-in with Confluent Cloud Stream Designer: Quickly build and deploy streaming apps & pipelines Rapidly build, test, and deploy streaming data pipelines with Stream Designer, extensible with SQL, while reducing the need to write boilerplate code Security & Governance: Secure, discover, and organize your data streams Build trust and put your data streams to work with enterprise-grade security and the only Stream Governance suite for data in motion “BHG is a fast-moving company, and Confluent is quickly becoming not only a central highway for our data with their vast connector portfolio, but a streaming transformation engine as well for a vast number of use cases… We are making Confluent the true backbone of BHG, including leveraging 20+ Confluent connectors across both modern, cloud-based technologies & legacy systems, to help integrate our critical apps & data systems together.” 11 Connectors Security Data Governance Stream Processing Monitoring Global Resilience Stream Designer
  • 12. Everywhere Connect your data in real time with a platform that spans from on-prem to cloud and across clouds Run Anywhere: Deploy across any environment Provision Confluent as a fully managed service on AWS, Azure, and Google Cloud across 60+ regions w/ Confluent Cloud, or on-premises w/ Confluent Platform Unified: Unify data across hybrid and multi-clouds Provide consistent, self-service access to real-time data across all your environments with Cluster Linking and globally connected clusters that perfectly mirror data Consistent: Learn one platform for all environments Remove the burden of learning new tools for each environment with a consistent experience spanning across cloud, on-prem, and hybrid / multicloud “Our transformation to a cloud-native, agile company required a large-scale migration from open source Apache Kafka. With Confluent, we now support real-time data sharing across all of our environments, and see a clear path forward for our hybrid cloud roadmap.” 12
  • 13. Using fully managed connectors is the fastest, most efficient way to break data silos Self-managed connector Accelerated time-to-value • Increased developer productivity • Reduced operational burden ● Pre-built but requires manual installation / config efforts to set-up and deploy connectors ● Perpetual management and maintenance of connectors that leads to ongoing tech debt ● Risk of downtime and business disruption due to connector / Connect cluster related issues Fully managed connector Custom-built connector ● Streamlined configurations and on-demand provisioning of your connectors ● Eliminates operational overhead and management complexity with seamless scaling and load balancing ● Reduced risk of downtime with Confluent Cloud’s 99.99% SLA for all your mission critical use cases ● Costly to allocate resources to design, build, test, and maintain non-differentiated data integration components ● Delays time-to-value, taking up to 3-6+ engineering months to develop ● Perpetual management and maintenance increases tech debt and risk of downtime
  • 14. Connect IoT data sources Leverage existing infrastructure investments Reduce operational complexity Avoid the need for third party MQTT brokers Ensure IoT data delivery Compatible with all QoS levels of the MQTT protocol Gateways BROKER Devices MQTT Proxy MQTT Proxy1 Easily connect with IOT data sources 1 Support for self-managed components with a CC subscription with Business support tier or higher.
  • 16. MongoDB Connector for Apache Kafka ● Enables users to easily integrate MongoDB with Kafka ● Users can configure MongoDB as a source to publish data changes from MongoDB into Kafka topics for streaming to consuming applications ● Users can configure MongoDB as a sink to easily persist events from Kafka topics directly to MongoDB collections ● Dead letter queue ● Time series integration ● JMX Integration ● ● Available from Confluent Hub and Verified Gold ● Fully managed using Confluent Cloud ● Configured via Confluent Cloud or Kafka Connect REST endpoint. ● Certified against Apache Kafka 2.3 and Confluent Platform 5.3 (or later)
  • 17. Destination: MongoDB Database MongoDB Sink Connector topicA topicB topicC Kafka Cluster Writes documents to DB collection Receives events from Kafka Topic(s) MongoDB Connector for Kafka Source: MongoDB Database MDB Source Connector Kafka Cluster Receives documents from DB collection Writes events to Kafka Topics(s) topicA topicB topicC Change Streams
  • 18. • Reads messages from topic (based on pointer to message in topic) • Writes message into MongoDB database collection • Moves pointer to next message based on write to database Kafka Topic connector database collection {} 1: offset to message to read 2: bulk write to db 3: on successful write (of batch), moves offset to next batch Sink Connector Specifics
  • 20. Time Series Collection An optimized column oriented collection for time-series data which organizes writes so that data for the same source is stored in the same bucket, alongside other data points from a similar point in time Launched with 5.0 Increases developer productivity Reduces complexity for working with Time Series data Reduces I/O for read operations Massive reduction in storage size and index size Optimized WiredTiger cache usage
  • 21. Creating a Time Series Collection TO CREATE A TIME SERIES COLLECTION, USE THE timeseries OPTION Launched with 5.0 db.createCollection("weather", { timeseries: { timeField: "timestamp", metaField: "sensorId", granularity: “minutes” }, expireAfterSeconds: 9000 }) The timeField is the only required parameter for a Time Series collection
  • 22. Terminology & concepts: metaField > db.createCollection ("weather", { timeseries: { ..., metaField: “sensorId” } } ) { "sensorId": 123, “timestamp”: ISODate(“..."), “temperature”: 47.0 }, { "sensorId": 456, “timestamp”: ISODate(“..."), “temperature”: 69.8 }, { "sensorId": 789, “timestamp”: ISODate(“..."), “temperature”: 97.0 } ● Label or tag that uniquely identifies a time series ● Never/rarely changes over time 123 456 789 100 75 50 25
  • 23. Terminology & concepts: measurement ● A set of related key-value pairs at a specific time ● Any other fields except metadata and time 123 456 789 100 75 50 25 > db.createCollection ("weather", { timeseries: { ..., metaField: “sensorId” } } ) { "sensorId": 123, “timestamp”: ISODate(“..."), “temperature”: 47.0 }, { "sensorId": 456, “timestamp”: ISODate(“..."), “temperature”: 69.8 }, { "sensorId": 789, “timestamp”: ISODate(“..."), “temperature”: 97.0 }
  • 24. Metadata Measurements Internal { “_id”: ObjectId("629487903149047dd18f7e3e"), “control”: { “count”: 2 “min”: { “_id”: ObjectId(“62951bb262fbb35f79c3b472”), “timestamp”: ISODate("2022-05-30T09:00:00.000Z"), “temperature”: 69.8 }, “max”: { “_id”: ObjectId(“62951bb262fbb35f79c3b474”), “timestamp”: ISODate("2022-05-30T09:15:00.000Z"), “temp”: 70.0 } }, “meta”: 456, “data”: { “temperature”: { 0: 69.8, 1: 70.0 }, “_id”: { 0: ObjectId(“62951bb262fbb35f79c3b472”), 1: ObjectId(“62951bb262fbb35f79c3b474”) }, “timestamp”: { 0: ISODate("2022-05-30T09:05:00.000Z"), 1: ISODate("2022-05-30T09:15:00.000Z") } } } { "sensorId": 789, }, { "sensorId": 456, “timestamp”: ISODate("2022-05-30T09:05:00.000Z"), “temperature”: 69.8, “_id”: ObjectId(“6290cdcf62fbb35f79c3b472”) }, { "sensorId": 789, … }, { "sensorId": 456, “timestamp”: ISODate("2022-05-30T09:15:00.000Z"), “temperature”: 70.0, “_id”: ObjectId(“6290cdcf62fbb35f79c3b474”) } ]) > db.weather.insertMany([
  • 25. Time Series Collection Columnar Compression Columnar compression adds a number of innovations that work together to significantly improve practical compression before on-disk compression Launched with 5.2 Dramatically reduce database storage footprint Improves read performance Increases Cache efficiency fitting more data in memory and using less I/O
  • 26. Columnar Compression Time Series Collection Columnar Compression Example Uncompressed BSON vs. Storage Size (Weather Data) Uncompressed BSON Size Time Series Collection Compressed Storage Size 25 50 75 100 125 107MB 2.2MB -97% 6MB Time Series Collection Compressed Bucket Size Uncompressed BSON Size Time Series Collection Compressed Storage Size Time Series Collection Compressed Bucket Size
  • 27. Querying Time Series Collections > db.weather.find() Launched with 5.0 When querying time-series collections, two main things happen under the hood: ● Query rewrites ● Bucket “unpacking”
  • 29. An event in Apache Kafka > _ > confluent kafka topic create stockData > confluent kafka topic produce stockData --parse-key --delimiter , keyABC1, { tx_time: 2021-06-30T15:47:31.000Z, company_symbol: 'SCL', company_name: 'SILKY CORNERSTONE LLC', price: 94.0999984741211 } keyABC2, {...}
  • 30. Creating a TS collection db.createCollection( "StockDataTS", { timeseries: { timeField: "tx_time", metaField: "company_symbol", granularity: "minutes" } } ); db.StockDataTS.stats().timeseries { bucketsNs: 'demo.system.buckets.StockDataTS', avgBucketSize: 393, avgNumMeasurementsPerCommit: 1, bucketCount: 1, numBucketInserts: 1, numBucketUpdates: 0, . . . }
  • 31. Configuring the connector for TS $> curl -X PUT http://${URL}:${PORT}/connectors/sink-mongodb-users/config -H "Content-Type: application/json" -d ' { "name": "mongo-sink-stockdata", "config": { "connector.class":"com.mongodb.kafka.connect.MongoSinkConnector", "tasks.max":"1", "topics":"stockData", "connection.uri":(MONGODB SINK CONNECTION STRING), /* from MongoDB Atlas */ "database":"Stocks", "collection":"StockDataTS", "key.converter":"org.apache.kafka.connect.storage.StringConverter", "value.converter":"org.apache.kafka.connect.json.JsonConverter", “timeseries.metafield”: “company_symbol”, "timeseries.timefield":"tx_time", "timeseries.timefield.auto.convert":"true", "timeseries.timefield.auto.convert.date.format":"yyyy-MM-dd'T'HH:mm:ss'Z'" } } '
  • 32. Query TS collections db.StockDataTS.createIndex({ ‘company_symbol’: 1 }); db.StockDataTS.aggregate([ { '$match': { company_symbol: 'SCL' } }, { '$setWindowFields': { partitionBy: '$company_name', sortBy: { tx_time: 1 }, output: { averagePrice: { '$avg': '$price', window: { documents: [ 'unbounded', 'current' ] } }}} }]
  • 33. Results { tx_time: 2021-06-30T15:47:45.000Z, _id: '60dc922172f0f39e2cd6cbeb', company_name: 'SILKY CORNERSTONE LLC', price: 94.06999969482422, company_symbol: 'SCL', averagePrice: 94.1346669514974 }, { tx_time: 2021-06-30T15:47:47.000Z, _id: '60dc922372f0f39e2cd6cbf0', company_name: 'SILKY CORNERSTONE LLC', price: 94.1500015258789, company_symbol: 'SCL', averagePrice: 94.13562536239624 }.... queryPlanner: { namespace: 'demo.system.buckets.StockDataTS', parsedQuery: { meta: { '$eq': 'SCL' } }, winningPlan: { stage: 'FETCH', inputStage: { stage: 'IXSCAN', keyPattern: { meta: 1 }, indexName: 'company_symbol_1', isMultiKey: false, multiKeyPaths: { meta: [] }, . . . } } }
  • 34. More info on MongoDB time series collections