DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka

Building Resilient
Log Aggregation Pipeline
Using Elasticsearch and Kafka
Rafał Kuć @ Sematext Group, Inc.

Sematext & I
Logsene
SPM
logs
metrics

Next 30 minutes…
Log shipping
- buffers
- protocols
- parsing
Central buffering
- Kafka
- Redis
Storage & Analysis
- Elasticsearch
- Kibana
- Grafana

Log shipping architecture
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

Focus: Elasticsearch
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

Elasticsearch cluster architecture
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

Dedicated masters please
client
client
client
data
data
data
data
data
data
master
master
master
discovery.zen.minimum_master_nodes -> N/2 + 1 master eligible nodes
ingest
ingest
ingest

One big index is a no-go
Not scalable enough for time based data

Indexing slows down with time

Expensive merges

Delete by query needed for data retention

Not scalable enough for time based data
Indexing slows down with time
Expensive merges
Delete by query needed for data retention

Daily indices are a good start
2016.11.18 2016.11.19 2016.11.22 2016.11.23. . .
Indexing is faster for smaller indices
Deletes are cheap
Search can be performed on indices that are needed
Static indices are cache friendly
indexing
most searches

Daily indices are a good start
2016.11.18 2016.11.19 2016.11.22 2016.11.23. . .
Indexing is faster for smaller indices
Deletes are cheap
Search can be performed on indices that are needed
Static indices are cache friendly
indexing
most searches
We delete whole indices

Daily indices are sub-optimal
black
friday
saturday
sunday
load
is not
even

Size based indices are optimal
size limit for indices
logs_01
indexing
around 5 – 10GB per shard on AWS

logs_01
indexing
logs_02

logs_01 logs_02
indexing
logs_N. . .

Slice using size
Predictable searching and indexing performance
Better indices balancing
Fewer shards
Easier handling of spiky loads
Less costs because of better hardware utilization

Proper Elasticsearch configuration
Keep index.refresh_interval at maximum possible value
1 sec -> 100%, 5 sec -> 125%, 30 sec -> 175%
You can loosen up merges
- possible because of heavy aggregation use
- segments_per_tier -> higher
- max_merge_at_once-> higher
- max_merged_segment -> lower
All prefixed with index.merge.policy
} higher indexing
throughput

Proper Elasticsearch configuration
Index only needed fields
Use doc values
Do not index _source
Do not store _all

Optimization time
We can optimize data nodes for time based data
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

Hot – cold architecture
ES hot ES cold ES cold
-Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold

logs_2016.11.22
-Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold
curl -XPUT localhost:9200/logs_2016.11.22 -d '{
"settings" : {
"index.routing.allocation.exclude.tag" : "cold",
"index.routing.allocation.include.tag" : "hot"
}
}'

logs_2016.11.22
indexing

logs_2016.11.22
logs_2016.11.23
indexing

logs_2016.11.22
logs_2016.11.23
indexing
move index after day ends
curl -XPUT localhost:9200/logs_2016.11.22/_settings -d '{
"index.routing.allocation.exclude.tag" : "hot",
"index.routing.allocation.include.tag” : "cold"
}'

logs_2016.11.23 logs_2016.11.22
indexing

logs_2016.11.23
logs_2016.11.24
logs_2016.11.22
indexing

logs_2016.11.23
logs_2016.11.24
logs_2016.11.22
indexing
move index after day ends

logs_2016.11.24 logs_2016.11.22 logs_2016.11.23
indexing

Hot ES Tier
Good CPU
Lots of I/O
Cold ES Tier
Memory bound
Decent I/O
ES cold
Cold ES Tier
Memory bound
Decent I/O

Hot – cold architecture summary
ES cold
Optimize costs – different hardware for different tier
Performance – use case optimized hardware
Isolation – long running searches don’t affect indexing

Elasticsearch client node needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

Elasticsearch client node needs
No data = no IOPS
Large query throughput = high CPU usage
Lots of results = high memory usage
Lots of concurrent queries = higher resources utilization

Elasticsearch ingest node needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

No data = no IOPS
Large index throughput = high CPU & memory usage
Complicated rules = high CPU usage
Larger documents = more resources utilization

Elasticsearch master node needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

No data = no IOPS
Large number of indices = high CPU & memory usage
Complicated mappings = high memory usage
Daily indices = spikes in resources utilization

Focus: Centralized Buffer
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

Why Apache Kafka?
Fast & easy to use
Easy to scale
Fault tolerant and highly available
Supports streaming
Works in publish/subscribe mode

Kafka architecture
ZooKeeper
ZooKeeper
ZooKeeper
Kafka
Kafka
KafkaKafka

Kafka & topics
security_logs access_logs
app1_logs app2_logs
Kafka stores data
in topics
written on disk

Kafka & topics & partitions & replicas
logs
partition 2
logs
partition 1
logs
partition 3
logs
partition 4
logs replica
partition 2
logs replica
partition 1
logs replica
partition 3
logs replica
partition 4

Scaling Kafka
logs
partition 1

Scaling Kafka
logs
partition 1
logs
partition 2
logs
partition 3
logs
partition 4

Scaling Kafka
logs
partition 1
logs
partition 2
logs
partition 3
logs
partition 4
logs
partition 5
logs
partition 6
logs
partition 7
logs
partition 8
logs
partition 9
logs
partition 10
logs
partition 11
logs
partition 12
logs
partition 13
logs
partition 14
logs
partition 15
logs
partition 16

Things to remember when using Kafka
Scales by adding more partitions not threads
The more IOPS the better
Keep the # of consumers equal to # of partitions
Replicas used for HA and FT only
Offsets stored per consumer – multiple destinations
easily possible

Focus: Shipper
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

What about the shipper?
logs
Centralized
Buffer
Which shipper to use?
Which protocol should be used
What about the buffering
Log to JSON or parse and how

Buffers
performance & availability
batches & threads when central buffer is gone

Buffer types
Disk || memory || combined hybrid approach
On source || centralized
App
Buffer
App
Buffer
file or local log shipper
easy scaling – fewer moving parts
often with the use of lightweight shipper
App
App
Kafka / Redis / Logstash / etc…
one place for all changes
extra features made easy (like TTL)
ES
ES

Buffers Summary
Simple Reliable
App
Buffer
App
Buffer
ES
App
App
ES

Protocols
UDP – fast, cool for the application, not reliable
TCP – reliable (almost) application gets ACK when written to buffer
Application level ACKs may be needed
HTTP
RELP
Beats
Kafka
Logstash, rsyslog, Fluentd
Logstash, rsyslog
Logstash, Filebeat
Logstash, rsyslog, Filebeat, Fluentd

Choosing the shipper
application
rsyslog Elasticsearch
http
socket
memory & disk
assisted queues

Choosing the shipper
application
rsyslog Elasticsearch
http
socket
memory & disk
assisted queues
application
file
rsyslog
filebeat
consumer

What about OS?
Say NO to swap
Set the right disk scheduler
CFQ for spinning disks
deadline for SSD
Use proper mount options for ext4
noatime
nodirtime
data=writeback, nobarier
For bare metal
check CPU governor
disable transparent huge pages
/proc/sys/vm/nr_hugepages=0

We are engineers!
We develop DevOps tools!
We are DevOps people!
We do fun stuff ;)
http://sematext.com/jobs

Thank you for listening! Get in touch!
Rafał
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs
Come talk to us
at the booth

DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka

More Related Content

What's hot (19)

Similar to DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka (20)

Recently uploaded (20)

DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka