SlideShare a Scribd company logo
Complex Event Processing platform handling
millions of users
Krzysztof Zarzycki - CTO @ Getindata
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
About us
Founded in 2014 by
ex-Spotify engineers.
Focus only on Big Data and
Cloud (from day 1)
Community builders (Big
Data Tech Warsaw
organizers)
50+ Big Data engineers
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
What is it?
The application logic, analytics, and queries exist continuously, and data flows through them continuously.
Stream Processing
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Why is it important for business?
Stream Processing
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Actionable insights
Stream Processing
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Stream Processing
Why is it important for engineering?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
It’s NOT only about real-time
It’s just natural - data comes continuously.
Stream Processing
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Stream Processing
User sessions spanning minutes, hours, or days
Batch boundaries are often artificial.
♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[9:00 - 10:00) [10:00-11:00)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Complex Event Processing
● Analyze patterns,relations, cause-and-effect
○ If A & B then C
● Infer business-relevant events from raw technical stream
● .. and cascade extraction of even higher-level events
● Alerts, triggers, workflow automation
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Complex Event Processing
● behavioral marketing
● product analytics
● business activity monitoring
● technical monitoring and anomaly detection
● IoT
● fraud detection
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
ESP vs CEP
● Difference is blurry and diminishing
● Traditional CEP
○ Complex proc, low latency, single-machine
○ high-level language like SQL
● Traditional ESP
○ straightforward, high-throughput, distributed
○ Broader, more generic and low-level
● NOW: Best of both!
○ Often called “Streaming Analytics”
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
The need
● Streaming model and real-time
● On par with batch
○ Enrichment, Joins
○ Aggregation
○ Reprocessing of historical data
○ Machine Learning scoring, inference
○ Complex Event Processing
● large scale, high-throughput
● correctness and fault tolerance
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
The solution
Apache Flink
open-source stateful processor over massive data streams
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Who uses Flink
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
We use Flink!
Banks Telcos Automotive Adtech
Commiters to Flink
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Late & Out-of-sequence
events
Breaks correctness.
Often handled with very tedious user code.
Or solved in batch by “waiting enough” and “processing twice”.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Late & Out-of-sequence
events
Handled by Flink in the framework
Based on watermarks heuristics, that marks the progress of event time in the stream.
Asserts that all earlier events have probably arrived.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Local State
Operational state obligatory for analytics
Used for accumulators, windows, source offsets, tracking patterns, ...
6
sum
1
3
2
4
1
1
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Local State
Persistent operational state local to computation
Maximize performance with millions of updates per second & core.
Enable out-of-core (more than RAM) processing, with RocksDb
State
Task 1
Logic State
Task N
Logic
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Fault tolerance
(Checkpointing)
State survives abrupt crashes or just maintenance
Checkpointed regularly to resilient external storage.
Accurate - keeps stream offsets, accumulators or windows in perfect sync, consistency.
Efficient - almost no impact on the processing.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Flink Cluster
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
APIs
Java/Scala
high-level Table API
mid-level dataflow
low-level advanced for tricky cases
Developer Data Scientist Analyst
SQL
Incl. analytical functions
MATCH_RECOGNIZE
and UDF extensibility
Python
Based on Table API
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Big picture
Assisting Millions of User in Real-Time
Kcell case
About Kcell
Kcell has a strong software
development team and lots
of experience in building
services and products
We like innovations
> 10 000 000 subscribers
Largest GSM operator
in Kazakhstan
4G (40%), 3G (73%), 2G
(96%) population
Great network
coverage
There is the ongoing
process of company digital
transformation
Not only telco
Business needs
Assisting Millions of Users in Real-Time
SMS events
Voice usage events
Data usage events
Roaming events
Location events
Input Process Actions
Use Cases
Use case scenarios. Just few of many.
Case
If subscriber top-ups her balance too often in
short period of time. We can offer her a less
expensive tariff or auto-payment services.
Balance Top Up Case
Trigger UI
Roaming
Fraud
Trigger to Marketing Platform if subscriber
visited X country OR/AND registered in Y
visited mobile network and his device's type
is Z
Roaming case
Send an email to the anti-fraud unit if
subscriber registered in roaming but his
balance at the moment is equal to 0.
This situation is impossible in standard case.
Fraud case in roaming
Personalized Notifications
Business Automation
Regulatory
Future Work
We have already done a lot. But more great things are coming.
2020 Q3 2020 Q4 2021 Q1 Bright Future
More Data Sources
More Triggers
Geolocation data
Equipment logs
Commoditize Machine Learning
Extract value from ML company-wide!
Enable easy ML training and productization
of models in real-time
Real-time BI
Intraday view on business and
operations
Monetize valuable insights from
our combined rich data sources.
Data Monetization
Predictive maintenance
Network Optimization
To lower operational costs
And make better investments
And many more...
Create behavioral profile of the
customers for better
personalised serving
Customer 360 view
Old System
Why did we start to look for the new solution?
External Vendor
Solution
Blackbox Solution
Scalability issues
Not reliable
1
2
3
Kcell Developers can’t fix, tweak or optimize it
Limited to ~2000 events / sec
Can’t support all needed data sources
Multiple accidents which took too much time to resolve
Scale
Required system throughput
500K
Events / second
10M
Subscribers
40
TB / month
New Solution
Real-time Stream Processing
ingestion outgestion
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
New Solution
Real-time Stream Processing
flink
ingestion outgestion
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
flink flink
New Solution (Operations)
Web UI, Monitoring, Security
flink
ingestion outgestion
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
Admin UI
(Triggers workbench)
Monitoring
Loki - logs
Prometheus/Grafana -
metrics
Security
FreeIPA
Kerberos
LDAP/AD
API (kafka based)
flink flink
New Solution (Data Lake)
Data Lake and Sub-second OLAP Analytics
flink
ingestion outgestion
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
Data Lake
Historical Storage (HDFS)
Batch (Spark) SQL (Hive)
Keep history, Report, Explore
Column-oriented
Data store
OLAP (Druid)
Interactive BI
flink flink
Processing Flow
Real-time Stream Processing
raw call events
data usage events
transform
transformed events
transform
transformed events
local state
RocksDB
control topic
Admin UI
HTTP
calls
notification
events
outgestion
ingestion
ingestion
submit/stop
triggers
Dynamic Rules
Design
Some treats for Squirrels
Dynamic Rules Design
Key Points
● We want to run 100s of triggers/business rules
● A typical approach: job per rule
● Won’t work in our case:
○ Run 100s of topologies/jobs = multiplied resources cost
○ Pull data from Kafka 100s of times
○ State (user features) replicated 100s times
○ Starting rule requires deployment of the job
Dynamic Rules Design
Key Points
● Our approach: One job to run all triggers/rules
○ And to consume all the sources
● Trigger “templates” still coded with java
● adding/removing rules without restarting application
● 100s of rules running efficiently
Dynamic Rules Design
The Overview
billing events
roaming
Sort by time
control topic
notification
events
Deduplicate Router
Late events
Trigger 1
Trigger 2
State
Updater
Apply Triggers (CoProcess Function)
Keyed by User
Dynamic Rules Design
Pros and Cons
Shared resources and costs
● CPU, RAM, state, shuffle
● Pulling data from Kafka
One bad rule affects whole system
● Watermarks are shared
● Failures are shared
No job restart on start of new rules
● Rules started by business, no IT
involved
Still need to code rule template in
the job
● No way to use SQL, Table API, CEP
Sharing of state
● Build customer features, that can
be seen by all rules
Can be tricky to debug
● Code is shared
● Code paths enabled externally
Dynamic Rules Design
Issue: lagging sources slow down all rules
Source A:
highly unordered, late
Source B:
Ordered, low latency
Late notifications
Low latency
notifications
Triggers
Triggers
Group 1
Triggers
Group 2
Source A:
highly unordered, late
Source B:
Ordered, low latency
Triggers
Group
Late notifications
Problem Solution
Flink Changes Wishlist
What could be even better?
attach new branch to existing topology
that receives the same data
Dynamic Topologies
Cheaper topologies
● Graph of topologies that pass
data locally in Flink
● Other words: Local
Proxying/fan-out of Kafka traffic
Share inputs between topologies
Dynamic SQL
SQL
{ }
Decisions made
Some decisions our team made before or during project implementation
Streaming-first approach
Apache Kafka for event hub
Apache Flink
Powerful Real-Time Analytics
Apache Avro
Keep state local to the process
Ingest reference data for local joins and
enrichment
● No need to query external systems
while processing
● Data time correlation correctness
Performance
transformed
events
transformed
events
Subscriber profile data
(events)
Local State
Not at >100K
events / sec
Nifi for data ingestion (no coding)
● but not for CEP
Web UI for configuring triggers
Ease of Use
Flink on YARN, with HDFS
HA for redundancy and running ~24/7
Prometheus & Grafana for monitoring &
alerting
Loki for logs collection and aggregation
Reliability and battle-tested techniques
Kerberos and AD thanks to FreeIPA
Apache Ranger for authorization
Security
One platform for the whole Enterprise
Batch (adhoc) queries too
● Spark, Hive/Presto
Online analytics
● OLAP
Extensiveness
HDP
Open-source technologies
HDP as a licence-free distribution
Just start with a bunch of servers
Cost-Efficiency
Testing
def "should notify when user's balance drops below threshold"() {
given:
BalanceDropTrigger trigger =balanceDropTrigger()
.threshold(
50.0)
.outgestionSystem(
'campaignSystem')
.build()
admin.createsTrigger(trigger)
and:
user.withBalance(
60.0)
when:
user.makesCall(
phoneCall().amountSpent(
20.0))
then:
wait(allowedEventLateness)
and:
List<Notification> actualNotifications =
campaignSystem.getNotifications(
user, trigger)
and:
actualNotifications.size() ==1
assertThat(actualNotifications.first())
.hasMsisdn(
user.msisdn)
.hasBalanceAfter(
40.0)
cleanup:
admin.deletesTrigger(trigger)
}
flink
ingestion outgestion
events
hub
events
processing
Fake
Campaign
System
HTTP
push/pull
FTP
NFS
MQ
Test event generators
Preprod environment
Our Collaboration
Two heads are better than one
Joint development team
Not a vendor solution
Development as one team
Code quality
Code review and
automated tools for
code quality control
Agile Practices
Distant geographic
locations, but
everyday standups
Go live quickly!
<4 months to first
production case
running 24/7!
Deliver
DevOps/Automation
Knowledge sharing
Constant knowledge
exchange in areas of
expertise
Testing
Separate testing
environment
Automated Unit/E2E tests
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Q&A

More Related Content

PDF
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
PDF
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
PDF
Big Data Monitoring Cockpit
PPTX
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
PDF
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
PDF
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
PDF
Building Reactive Real-time Data Pipeline
PPTX
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Big Data Monitoring Cockpit
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Building Reactive Real-time Data Pipeline
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...

What's hot (20)

PDF
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
PPTX
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
PDF
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
PDF
Advanced data science algorithms applied to scalable stream processing by Dav...
PDF
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
PDF
Lambda Architecture and open source technology stack for real time big data
PDF
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
PDF
Security Breakout Session
PDF
Moving data to the cloud BY CESAR ROJAS from Pivotal
PDF
Lambda Architecture 2.0 for Reactive AB Testing
PDF
Real Time Business Platform by Ivan Novick from Pivotal
PDF
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
PDF
How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
PDF
Life is but a Stream
PPTX
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
PDF
T-Mobile and Elastic
PPT
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
PDF
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
PPTX
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Advanced data science algorithms applied to scalable stream processing by Dav...
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Lambda Architecture and open source technology stack for real time big data
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Security Breakout Session
Moving data to the cloud BY CESAR ROJAS from Pivotal
Lambda Architecture 2.0 for Reactive AB Testing
Real Time Business Platform by Ivan Novick from Pivotal
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
Life is but a Stream
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
T-Mobile and Elastic
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Ad

Similar to Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData (20)

PDF
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
PPTX
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
PPTX
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
PPTX
The Past, Present, and Future of Apache Flink
PPTX
Data Stream Processing with Apache Flink
PDF
Squirrels and Elephants - The InnoGames Big Data and Streaming Infrastructure
PDF
Evolution of Real-time User Engagement Event Consumption at Pinterest
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PDF
Rivivi il Data in Motion Tour Milano 2024
PDF
Making Sense of Apache Flink: A Fearless Introduction
PPTX
January 2016 Flink Community Update & Roadmap 2016
PDF
Stream Processing and Complex Event Processing together with Kafka, Flink and...
PPTX
The Past, Present, and Future of Apache Flink®
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PPTX
Streaming datasets for personalization
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Apache Flink 101 - the rise of stream processing and beyond
Unbounded bounded-data-strangeloop-2016-monal-daxini
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
The Past, Present, and Future of Apache Flink
Data Stream Processing with Apache Flink
Squirrels and Elephants - The InnoGames Big Data and Streaming Infrastructure
Evolution of Real-time User Engagement Event Consumption at Pinterest
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Rivivi il Data in Motion Tour Milano 2024
Making Sense of Apache Flink: A Fearless Introduction
January 2016 Flink Community Update & Roadmap 2016
Stream Processing and Complex Event Processing together with Kafka, Flink and...
The Past, Present, and Future of Apache Flink®
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Big Data Analytics Platforms by KTH and RISE SICS
Streaming datasets for personalization
Ad

More from GetInData (20)

PDF
LLMOps: from Demo to Production-Ready GenAI Systems
PDF
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
PDF
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
PDF
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
PDF
How NOT to win a Kaggle competition
PDF
How to become good Developer in Scrum Team?
PDF
OpenLineage & Airflow - data lineage has never been easier
PDF
Benefits of a Homemade ML Platform
PDF
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
PDF
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
PDF
MLOps implemented - how we combine the cloud & open-source to boost data scie...
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
PDF
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
PDF
Big data trends - Krzysztof Zarzycki, GetInData
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PDF
Predicting Startup Market Trends based on the news and social media - Albert ...
PDF
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
LLMOps: from Demo to Production-Ready GenAI Systems
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
How NOT to win a Kaggle competition
How to become good Developer in Scrum Team?
OpenLineage & Airflow - data lineage has never been easier
Benefits of a Homemade ML Platform
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Big data trends - Krzysztof Zarzycki, GetInData
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Predicting Startup Market Trends based on the news and social media - Albert ...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData

  • 1. Complex Event Processing platform handling millions of users Krzysztof Zarzycki - CTO @ Getindata
  • 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. About us Founded in 2014 by ex-Spotify engineers. Focus only on Big Data and Cloud (from day 1) Community builders (Big Data Tech Warsaw organizers) 50+ Big Data engineers
  • 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What is it? The application logic, analytics, and queries exist continuously, and data flows through them continuously. Stream Processing
  • 4. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Why is it important for business? Stream Processing
  • 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Actionable insights Stream Processing
  • 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Stream Processing Why is it important for engineering?
  • 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. It’s NOT only about real-time It’s just natural - data comes continuously. Stream Processing
  • 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Stream Processing User sessions spanning minutes, hours, or days Batch boundaries are often artificial. ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [9:00 - 10:00) [10:00-11:00)
  • 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Complex Event Processing ● Analyze patterns,relations, cause-and-effect ○ If A & B then C ● Infer business-relevant events from raw technical stream ● .. and cascade extraction of even higher-level events ● Alerts, triggers, workflow automation
  • 10. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Complex Event Processing ● behavioral marketing ● product analytics ● business activity monitoring ● technical monitoring and anomaly detection ● IoT ● fraud detection
  • 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ESP vs CEP ● Difference is blurry and diminishing ● Traditional CEP ○ Complex proc, low latency, single-machine ○ high-level language like SQL ● Traditional ESP ○ straightforward, high-throughput, distributed ○ Broader, more generic and low-level ● NOW: Best of both! ○ Often called “Streaming Analytics”
  • 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The need ● Streaming model and real-time ● On par with batch ○ Enrichment, Joins ○ Aggregation ○ Reprocessing of historical data ○ Machine Learning scoring, inference ○ Complex Event Processing ● large scale, high-throughput ● correctness and fault tolerance
  • 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The solution Apache Flink open-source stateful processor over massive data streams
  • 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Who uses Flink
  • 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. We use Flink! Banks Telcos Automotive Adtech Commiters to Flink
  • 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Late & Out-of-sequence events Breaks correctness. Often handled with very tedious user code. Or solved in batch by “waiting enough” and “processing twice”.
  • 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Late & Out-of-sequence events Handled by Flink in the framework Based on watermarks heuristics, that marks the progress of event time in the stream. Asserts that all earlier events have probably arrived.
  • 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Local State Operational state obligatory for analytics Used for accumulators, windows, source offsets, tracking patterns, ... 6 sum 1 3 2 4 1 1
  • 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Local State Persistent operational state local to computation Maximize performance with millions of updates per second & core. Enable out-of-core (more than RAM) processing, with RocksDb State Task 1 Logic State Task N Logic
  • 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Fault tolerance (Checkpointing) State survives abrupt crashes or just maintenance Checkpointed regularly to resilient external storage. Accurate - keeps stream offsets, accumulators or windows in perfect sync, consistency. Efficient - almost no impact on the processing.
  • 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Flink Cluster
  • 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. APIs Java/Scala high-level Table API mid-level dataflow low-level advanced for tricky cases Developer Data Scientist Analyst SQL Incl. analytical functions MATCH_RECOGNIZE and UDF extensibility Python Based on Table API
  • 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Big picture
  • 24. Assisting Millions of User in Real-Time Kcell case
  • 25. About Kcell Kcell has a strong software development team and lots of experience in building services and products We like innovations > 10 000 000 subscribers Largest GSM operator in Kazakhstan 4G (40%), 3G (73%), 2G (96%) population Great network coverage There is the ongoing process of company digital transformation Not only telco
  • 26. Business needs Assisting Millions of Users in Real-Time SMS events Voice usage events Data usage events Roaming events Location events Input Process Actions
  • 27. Use Cases Use case scenarios. Just few of many. Case If subscriber top-ups her balance too often in short period of time. We can offer her a less expensive tariff or auto-payment services. Balance Top Up Case Trigger UI
  • 28. Roaming Fraud Trigger to Marketing Platform if subscriber visited X country OR/AND registered in Y visited mobile network and his device's type is Z Roaming case Send an email to the anti-fraud unit if subscriber registered in roaming but his balance at the moment is equal to 0. This situation is impossible in standard case. Fraud case in roaming
  • 30. Future Work We have already done a lot. But more great things are coming. 2020 Q3 2020 Q4 2021 Q1 Bright Future More Data Sources More Triggers Geolocation data Equipment logs Commoditize Machine Learning Extract value from ML company-wide! Enable easy ML training and productization of models in real-time Real-time BI Intraday view on business and operations Monetize valuable insights from our combined rich data sources. Data Monetization Predictive maintenance Network Optimization To lower operational costs And make better investments And many more... Create behavioral profile of the customers for better personalised serving Customer 360 view
  • 31. Old System Why did we start to look for the new solution? External Vendor Solution Blackbox Solution Scalability issues Not reliable 1 2 3 Kcell Developers can’t fix, tweak or optimize it Limited to ~2000 events / sec Can’t support all needed data sources Multiple accidents which took too much time to resolve
  • 32. Scale Required system throughput 500K Events / second 10M Subscribers 40 TB / month
  • 33. New Solution Real-time Stream Processing ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ
  • 34. New Solution Real-time Stream Processing flink ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ flink flink
  • 35. New Solution (Operations) Web UI, Monitoring, Security flink ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ Admin UI (Triggers workbench) Monitoring Loki - logs Prometheus/Grafana - metrics Security FreeIPA Kerberos LDAP/AD API (kafka based) flink flink
  • 36. New Solution (Data Lake) Data Lake and Sub-second OLAP Analytics flink ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ Data Lake Historical Storage (HDFS) Batch (Spark) SQL (Hive) Keep history, Report, Explore Column-oriented Data store OLAP (Druid) Interactive BI flink flink
  • 37. Processing Flow Real-time Stream Processing raw call events data usage events transform transformed events transform transformed events local state RocksDB control topic Admin UI HTTP calls notification events outgestion ingestion ingestion submit/stop triggers
  • 39. Dynamic Rules Design Key Points ● We want to run 100s of triggers/business rules ● A typical approach: job per rule ● Won’t work in our case: ○ Run 100s of topologies/jobs = multiplied resources cost ○ Pull data from Kafka 100s of times ○ State (user features) replicated 100s times ○ Starting rule requires deployment of the job
  • 40. Dynamic Rules Design Key Points ● Our approach: One job to run all triggers/rules ○ And to consume all the sources ● Trigger “templates” still coded with java ● adding/removing rules without restarting application ● 100s of rules running efficiently
  • 41. Dynamic Rules Design The Overview billing events roaming Sort by time control topic notification events Deduplicate Router Late events Trigger 1 Trigger 2 State Updater Apply Triggers (CoProcess Function) Keyed by User
  • 42. Dynamic Rules Design Pros and Cons Shared resources and costs ● CPU, RAM, state, shuffle ● Pulling data from Kafka One bad rule affects whole system ● Watermarks are shared ● Failures are shared No job restart on start of new rules ● Rules started by business, no IT involved Still need to code rule template in the job ● No way to use SQL, Table API, CEP Sharing of state ● Build customer features, that can be seen by all rules Can be tricky to debug ● Code is shared ● Code paths enabled externally
  • 43. Dynamic Rules Design Issue: lagging sources slow down all rules Source A: highly unordered, late Source B: Ordered, low latency Late notifications Low latency notifications Triggers Triggers Group 1 Triggers Group 2 Source A: highly unordered, late Source B: Ordered, low latency Triggers Group Late notifications Problem Solution
  • 44. Flink Changes Wishlist What could be even better? attach new branch to existing topology that receives the same data Dynamic Topologies Cheaper topologies ● Graph of topologies that pass data locally in Flink ● Other words: Local Proxying/fan-out of Kafka traffic Share inputs between topologies Dynamic SQL SQL { }
  • 45. Decisions made Some decisions our team made before or during project implementation Streaming-first approach Apache Kafka for event hub Apache Flink Powerful Real-Time Analytics
  • 46. Apache Avro Keep state local to the process Ingest reference data for local joins and enrichment ● No need to query external systems while processing ● Data time correlation correctness Performance transformed events transformed events Subscriber profile data (events) Local State Not at >100K events / sec
  • 47. Nifi for data ingestion (no coding) ● but not for CEP Web UI for configuring triggers Ease of Use
  • 48. Flink on YARN, with HDFS HA for redundancy and running ~24/7 Prometheus & Grafana for monitoring & alerting Loki for logs collection and aggregation Reliability and battle-tested techniques Kerberos and AD thanks to FreeIPA Apache Ranger for authorization Security
  • 49. One platform for the whole Enterprise Batch (adhoc) queries too ● Spark, Hive/Presto Online analytics ● OLAP Extensiveness HDP Open-source technologies HDP as a licence-free distribution Just start with a bunch of servers Cost-Efficiency
  • 50. Testing def "should notify when user's balance drops below threshold"() { given: BalanceDropTrigger trigger =balanceDropTrigger() .threshold( 50.0) .outgestionSystem( 'campaignSystem') .build() admin.createsTrigger(trigger) and: user.withBalance( 60.0) when: user.makesCall( phoneCall().amountSpent( 20.0)) then: wait(allowedEventLateness) and: List<Notification> actualNotifications = campaignSystem.getNotifications( user, trigger) and: actualNotifications.size() ==1 assertThat(actualNotifications.first()) .hasMsisdn( user.msisdn) .hasBalanceAfter( 40.0) cleanup: admin.deletesTrigger(trigger) } flink ingestion outgestion events hub events processing Fake Campaign System HTTP push/pull FTP NFS MQ Test event generators Preprod environment
  • 51. Our Collaboration Two heads are better than one Joint development team Not a vendor solution Development as one team Code quality Code review and automated tools for code quality control Agile Practices Distant geographic locations, but everyday standups Go live quickly! <4 months to first production case running 24/7! Deliver DevOps/Automation Knowledge sharing Constant knowledge exchange in areas of expertise Testing Separate testing environment Automated Unit/E2E tests
  • 52. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A