SlideShare a Scribd company logo
In-Flux Limiting for a Multi-Tenant Logging Service
Ambud Sharma & Suma Cherukuri
Cloud Platform Engineering @ Symantec
In-Flux Limiting for a Multi-Tenant Logging Service 1
Overview
• Who are we?
• Architecture
• Streaming Pipeline
• Influx Issue
• Influx Limiting Design & Solution
• Conclusion
• Q & A
In-Flux Limiting for a Multi-Tenant Logging Service 2
Who are we?
• Symantec’s internal cloud team
• Host over $1B+ revenue applications
• Team
– Logging as a Service (LaaS) – Elasticsearch/Kibana
– Metering as a Service (MaaS) – InfluxDB/Grafana
– Alerting as a Service (AaaS) – Hendrix
We are hiring!
Also checkout Hendrix: https://github.com/Symantec/hendrix
In-Flux Limiting for a Multi-Tenant Logging Service 3
Our Data
Logs
• Application and system
logs data from VM’s and
Containers
• Used for troubleshooting
Metrics
• Application and system
telemetries
• Used for Application
Performance
Monitoring
{
“message”: “User logged in from 1.1.1.1”,
“@version”: "1",
“@timestamp”: "2014-07-16T06:49:39.919Z",
“host”: "value",
“path”: “/opt/logstash/sample.log",
“tenant_id”: "291167ebed3221a006eb",
“apikey”: "06be8a-28ef-4568-8cb8-612",
“string_boolean”: "true",
“host_ip”: "192.168.99.01"
}
{
“@version”: "1",
“@timestamp”: "2014-07-16T06:49:39.919Z",
“host”: "host1.symantec.com",
“tenant_id”: "291167ebed3221a006ebf6",
“apikey”: "06be8a-28ef-4568-8cb8-618",
“value”: 0.65,
“name”: “cpu”
}
Log Event Metric Event
In-Flux Limiting for a Multi-Tenant Logging Service 4
LMM Architecture
Redis
Customer
Agents
Elasticsearch
InfluxDB
Log Topology
Metrics Topology
Kafka
Logstash
Users
Open to
customers
In-Flux Limiting for a Multi-Tenant Logging Service 5
Streaming Pipeline
• Validate events to match schema to optimize indexing
• Authenticate events to route data to the correct index
• Have 1 index per day per tenant
Kafka
Validate Auth Index
In-Flux Limiting for a Multi-Tenant Logging Service 6
Influx Issue
• You know your data store performance
limits (find EPS from benchmark/capacity)
• Tenants send a lot of data and ingestion
rate is never linear
• Ingestion spikes are bound to happen in a
real-time streaming application
• Wouldn’t it be great if you could
normalize these spikes?
In-Flux Limiting for a Multi-Tenant Logging Service 7
Influx Limiting
• Normalize the EPS curve using buffers
• Like a Hydro Dam, explicitly allocate EPS resource to tenants
Before
After
In-Flux Limiting for a Multi-Tenant Logging Service 8
Design - Options
Approach 1 Approach 2
• Route to separate Kafka topic
• No back-pressure in primary queue
• Secondary queue is drained
at a slower pace
• Events may appear out of order
• Controlled back-pressure in the
primary queue
• Selectively reduce ingestion rate
for tenants
• Events will always appear in order
In-Flux Limiting for a Multi-Tenant Logging Service 9
Customer Requirements
• Customers want threshold quotas defined for them
• Thresholds defined as policies (duration in seconds)
• Policies saved in a data store
Tenant A Tenant B Tenant C
{
“threshold”: 100,
“window”: 90
}
{
“threshold”: 700,
“window”: 10
}
{
“threshold”: 900,
“window”: 1
}
In-Flux Limiting for a Multi-Tenant Logging Service 10
Bolt Design
Kafka
1. Track “Event Rate” for each Tenant for the policy window
2. If threshold exceeds then throttle else allow the events
3. Reset window when the time interval is complete (tumbling window)
Validate Auth Throttle Index
In-Flux Limiting for a Multi-Tenant Logging Service 11
Scheduled-task design pattern
• Clock is maintained using
Storm Tick Tuple
• Tenant’s counter is
incremented when event is
received from it
• Counters are reset when
modulated value matches
Is Time % Throttle Duration = 0?
= Tenant Throttle Counter
Clock time
Modulo
Reset counters for each tenant in this sliceNothing to Reset
= Tenant Throttle Duration (modulated)
Reset counters for each tenant in this slice
In-Flux Limiting for a Multi-Tenant Logging Service 12
Results
13
• Reduced EPS to
Elasticsearch
• We can normalize
flow rate based on
load
In-Flux Limiting for a Multi-Tenant Logging Service
In-Flux Limiting for a Multi-Tenant Logging Service
Conclusion
• Overview of real-time log and metric indexing
• Approaches to rate limit in real-time streaming application
• Design pattern to efficiently perform counting in Storm
14
That’s all folks!
Questions?
In-Flux Limiting for a Multi-Tenant Logging Service 15

More Related Content

PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
PPTX
Spark Technology Center IBM
PPTX
Event Detection Pipelines with Apache Kafka
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
PDF
Data Pipeline with Kafka
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Spark Technology Center IBM
Event Detection Pipelines with Apache Kafka
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Pipeline with Kafka

What's hot (20)

PDF
Trend Micro Big Data Platform and Apache Bigtop
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
PPTX
Built-In Security for the Cloud
PDF
Data Pipelines with Spark & DataStax Enterprise
PPTX
Troubleshooting Kerberos in Hadoop: Taming the Beast
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PDF
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
PDF
Big Data Tools in AWS
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PPTX
Securing Data in Hadoop at Uber
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Preventative Maintenance of Robots in Automotive Industry
PPTX
Data Architectures for Robust Decision Making
PPTX
Building Continuously Curated Ingestion Pipelines
PDF
Introduction to Apache Kafka
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
ODP
Lambda Architecture with Spark
Trend Micro Big Data Platform and Apache Bigtop
Real time fraud detection at 1+M scale on hadoop stack
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Built-In Security for the Cloud
Data Pipelines with Spark & DataStax Enterprise
Troubleshooting Kerberos in Hadoop: Taming the Beast
Spark Summit EU talk by Kaarthik Sivashanmugam
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Big Data Tools in AWS
Unified, Efficient, and Portable Data Processing with Apache Beam
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Securing Data in Hadoop at Uber
What's new in SQL on Hadoop and Beyond
Preventative Maintenance of Robots in Automotive Industry
Data Architectures for Robust Decision Making
Building Continuously Curated Ingestion Pipelines
Introduction to Apache Kafka
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Lambda Architecture with Spark
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf

In Flux Limiting for a multi-tenant logging service

  • 1. In-Flux Limiting for a Multi-Tenant Logging Service Ambud Sharma & Suma Cherukuri Cloud Platform Engineering @ Symantec In-Flux Limiting for a Multi-Tenant Logging Service 1
  • 2. Overview • Who are we? • Architecture • Streaming Pipeline • Influx Issue • Influx Limiting Design & Solution • Conclusion • Q & A In-Flux Limiting for a Multi-Tenant Logging Service 2
  • 3. Who are we? • Symantec’s internal cloud team • Host over $1B+ revenue applications • Team – Logging as a Service (LaaS) – Elasticsearch/Kibana – Metering as a Service (MaaS) – InfluxDB/Grafana – Alerting as a Service (AaaS) – Hendrix We are hiring! Also checkout Hendrix: https://github.com/Symantec/hendrix In-Flux Limiting for a Multi-Tenant Logging Service 3
  • 4. Our Data Logs • Application and system logs data from VM’s and Containers • Used for troubleshooting Metrics • Application and system telemetries • Used for Application Performance Monitoring { “message”: “User logged in from 1.1.1.1”, “@version”: "1", “@timestamp”: "2014-07-16T06:49:39.919Z", “host”: "value", “path”: “/opt/logstash/sample.log", “tenant_id”: "291167ebed3221a006eb", “apikey”: "06be8a-28ef-4568-8cb8-612", “string_boolean”: "true", “host_ip”: "192.168.99.01" } { “@version”: "1", “@timestamp”: "2014-07-16T06:49:39.919Z", “host”: "host1.symantec.com", “tenant_id”: "291167ebed3221a006ebf6", “apikey”: "06be8a-28ef-4568-8cb8-618", “value”: 0.65, “name”: “cpu” } Log Event Metric Event In-Flux Limiting for a Multi-Tenant Logging Service 4
  • 5. LMM Architecture Redis Customer Agents Elasticsearch InfluxDB Log Topology Metrics Topology Kafka Logstash Users Open to customers In-Flux Limiting for a Multi-Tenant Logging Service 5
  • 6. Streaming Pipeline • Validate events to match schema to optimize indexing • Authenticate events to route data to the correct index • Have 1 index per day per tenant Kafka Validate Auth Index In-Flux Limiting for a Multi-Tenant Logging Service 6
  • 7. Influx Issue • You know your data store performance limits (find EPS from benchmark/capacity) • Tenants send a lot of data and ingestion rate is never linear • Ingestion spikes are bound to happen in a real-time streaming application • Wouldn’t it be great if you could normalize these spikes? In-Flux Limiting for a Multi-Tenant Logging Service 7
  • 8. Influx Limiting • Normalize the EPS curve using buffers • Like a Hydro Dam, explicitly allocate EPS resource to tenants Before After In-Flux Limiting for a Multi-Tenant Logging Service 8
  • 9. Design - Options Approach 1 Approach 2 • Route to separate Kafka topic • No back-pressure in primary queue • Secondary queue is drained at a slower pace • Events may appear out of order • Controlled back-pressure in the primary queue • Selectively reduce ingestion rate for tenants • Events will always appear in order In-Flux Limiting for a Multi-Tenant Logging Service 9
  • 10. Customer Requirements • Customers want threshold quotas defined for them • Thresholds defined as policies (duration in seconds) • Policies saved in a data store Tenant A Tenant B Tenant C { “threshold”: 100, “window”: 90 } { “threshold”: 700, “window”: 10 } { “threshold”: 900, “window”: 1 } In-Flux Limiting for a Multi-Tenant Logging Service 10
  • 11. Bolt Design Kafka 1. Track “Event Rate” for each Tenant for the policy window 2. If threshold exceeds then throttle else allow the events 3. Reset window when the time interval is complete (tumbling window) Validate Auth Throttle Index In-Flux Limiting for a Multi-Tenant Logging Service 11
  • 12. Scheduled-task design pattern • Clock is maintained using Storm Tick Tuple • Tenant’s counter is incremented when event is received from it • Counters are reset when modulated value matches Is Time % Throttle Duration = 0? = Tenant Throttle Counter Clock time Modulo Reset counters for each tenant in this sliceNothing to Reset = Tenant Throttle Duration (modulated) Reset counters for each tenant in this slice In-Flux Limiting for a Multi-Tenant Logging Service 12
  • 13. Results 13 • Reduced EPS to Elasticsearch • We can normalize flow rate based on load In-Flux Limiting for a Multi-Tenant Logging Service
  • 14. In-Flux Limiting for a Multi-Tenant Logging Service Conclusion • Overview of real-time log and metric indexing • Approaches to rate limit in real-time streaming application • Design pattern to efficiently perform counting in Storm 14 That’s all folks!
  • 15. Questions? In-Flux Limiting for a Multi-Tenant Logging Service 15

Editor's Notes

  • #2: welcome to talk In-Flux Limiting for a Multi-Tenant Logging Service introduce yourself we are from CPE at symantec.
  • #3: today we are here to talk about how we do event throttling, rate limiting for real time streaming, go over architecture internal details of streaming pipeline influx issue different approaches of solving prob show you results also want to cover an efficient pattern of computing if there are any pressing questions pls feel free to stop us. but we would prefer to take questions to the end
  • #4: we are part of symantec internal cloud team that hosts app’s generating 1B revenue specifically our team builds, owns and runs 3 primary services. we call them Logging as a Service, Metering as a Service and Alerting as a Service Side note we are hiring if anyone is interested in joining the effort to build the biggest security data lake in the world, please stop by after the presentation. Side note we have open sourced project called hendrix , which is our alerting as a service . pls feel free to go and check it at out at github.com slash Symantec slash hendrix
  • #5: before we jump on into the actual design and architecture of our system, lets talk about the data that we get and what is the problem that we are solving basically we offer logging and APM (app performance and monitoring) as a service. APM stands for application performance monitoring. our customers are Symantec product teams and they send us app and system logs generated on VM’s and containers. The teams use these for troubleshooting their applications. This is basically our own version of splunk. On The metrics side of the story we get app and system telemtetries which the teams use for application performance monitoring. Here is our sample events. we accept data in Json format and this is what it looks like on the left is the log event and the right is the metric event. If you look at log events you notice that there are 2 special fields one is called tenant id and the other is the api key. so whats the tenant and an api key ? Every customer which is a P&S teams at Symantec have something called tenants. the concept of tenant comes from our Openstack cloud. A given P&S teams can have more than one tenant. For example, their production App A can have a tenant and prod app B can have another tenant. Basically every tenant is a unit of isolation for us. An api key is a token used to allow and revoke flow of logs for a given tenant. lets say you wanted to stop a given tenant from sending data, you can revoke the api key. this means we start discarding the events. we call this process event authentication.
  • #6: Now lets get into our architecture. basically customers run agents like flume, logstash, collectd and statsd which send data to our kafka cluster which is exposed over loadbalancers. we then run a set of storm topologies which write data to destination data stores. incase of logs its ES and incase of metrics its Influx db. We use Kibana as a front end for ES and Grafana as a front end for the influx db so that customers can graph and query the data. Redis is where we store tenant id’s and api keys.
  • #7: Here’s what happens inside our streaming pipeline that is that storm topology. First like we showed earlier, events arrive in Kafka; we use the Storm Kafka Spout to read them and then we validate these events against the format and schema specifications that we publish to our customers, example if it’s malformed JSON we will drop the event. Next like we check whether the Tenant Id and API Key are valid. And lastly we Index the data to Elasticsearch or insert it into InfluxDB Each of the above stages are in separate Storm Bolts.
  • #8: So now that you that you have a fair idea of our pipeline let’s understand the Influx issue. Influx means arrival of a large quantity of something in a short time, in this case that is events. When you are writing data to a data store like Hbase, Cassandra or Elasticsearch you provision a capacity in the cluster as in your cluster will have X number of nodes and they can support let’s say 10000 inserts per second. You can gauge this number / capacity running benchmarks. For us these inserts per second can also be referred to as Event per Second or EPS. EPS sent by our tenants is never linear, it fluctuates quite a lot as you can see in the graph on the right, each line in this graph represents the EPS from a tenant. At times we get spikes which is bound to happen in any real-time event processing system, when the load increases on the applications, they generate a lot more logs and we get a spike. So when a spike happens we don’t have provisioned capacity to handle indexing of the additional influx of data that came in because of the spike. So wouldn’t it be great if you could normalize these spikes? What we mean is have an almost flat EPS curve for every tenant.
  • #9: Let’s understand how we can limit the influx and normalize these spikes. Think about event streams as a river and if there’s a cloud burst (no pun intended) the river will get temporarily flooded so much so that the banks will overflow right? So to fix this problem we can build a Dam and this Dam will buffer the additional influx of water we just got and you can control the rate at which this dam is being drained. Since we are using Kafka we already have a buffer however we have no control over the buffer as it will be controlled by the back pressure our elasticsearch cluster creates because it can take just sot many writes. So the purpose of this work was to have controlled back pressure into Kafka for our streaming pipelines to let us quantitatively determine how many events we would like to let flow through our pipeline into Elasticsearch. But we would like to do this on a tenant by tenant basis as you can see in the diagrams before you have different tenants sending different quantities of data shown in different colors. If we were to have a controlled system we can normalize and evenly divide the capacity among all of them or knowingly make it uneven that is if 1 customer has more need we allocate more capacity to them rather than others.
  • #10: How can we do this? Well there are 2 approaches we thought of, one is you can write a substream where if a tenant exceeds their allocated throughput capacity we divert the extra event traffic that is over the capacity to a separate queue which we will then drain at a slower pace. In technical that means a separate kafka topic and a separate storm topology with a lower parallelism configuration. The other way of solving this problem is to pause the processing of events in the existing streaming pipeline for the tenant that is sending more data. Both approaches have some pros and cons. If you go with the first one you will see events out of order that means some data you will see right way which is flowing through our main pipeline and the some data will be delayed because it’s flowing though the slower pipeline. For the second approach you will always see events in order but if the queue back pressures too much you may loose data but that is true for either approaches if you share the kafka cluster because for a given cluster you disk space is limited.
  • #13: What do we do inside the bolt and how do we track the Event rate for tenants? To track counts of events we keep a hashtable of tenant id and an integer counter which every time we see an event from a tenant we increment. But our customers wanted policies that define event rates differently for every tenant that is someone wants to be allowed to send 300 events in 2 minutes while the other one wanted 5000 in 10 minutes which don’t mean the same EPS so we had to come up with a way to track this for every tenant and we came up with an interesting way of solving this problem without using multiple threads. What we built logically was a sort of mary go round where every tenant is allowed to go round once. How it’s setup is every tenant influx limiting policy has 2 parts, 1. the number of events they would like to send 2. the time duration for those events so what we do is we take the time duration we we place tenants on this virtual mary go round based on their policy time duration.