Kafka elastic search meetup 09242018

Streaming DynamoDB
Changelog to
Elasticsearch
Ying Xu - Core Streaming Team
Dan Fan - Core Datastores Team

● DynamoDB -> Elasticsearch use case
● DynamoDB changelog to Elasticsearch -- deep dive
● Elasticsearch cluster management at Lyft
● Summary
Agenda

HTTP PUT
UI UI
Update
HTTP PUT
HTTP GET
DynamoDB
streams
What happens in the backend when extending
a ride pass package?
Changelog data ingestion system

Streaming
DynamoDB
changelog to
Elasticsearch

Requirements on DynamoDB changelog
data ingestion
• Real-time data
‒ Data available for downstream consumption within seconds
• High durability
‒ Data loss implies inconsistency at the sink
• Strong ordering
‒ Out-of-order implies inconsistency at the sink
• Heterogeneity
‒ Distinct data sources/sinks
• Highly available, fault-tolerant
‒ Self-recover from occasional failures
• Real-time transformation
‒ Ex: JSON->protobuf

Data ordering and consistency
EXP:
11/18
v3
EXP:
10/18
v2
EXP:
09/18
v1
EXP:
10/18
v2
EXP:
11/18
v3
EXP:
09/18
v1
EXP:
10/18
v2
EXP:
11/18
v3

Overview of E2E data pipeline
Flink job
(dynamostreams
->kafka)
map
JSON
PROTO
Dynamostr
emsflink
connector
Kafka cluster
cdc2es-tableA
Flinkkafka
connector
DynamoDB
(tableA)
DynamoDB
streams
.
.
.
.
.
.
.
.
.
.
.
.
Flink job
(kafka->Elastics
earch)
Flinkkafka
connector
map
Elasticsearch
cluster
Flink
Elasticsearch
connector
NEW DEVELOPMENT

Kafka -- distributed event log
• Apache Kafka: state-of-the-art pubsub technology
‒ High durability and strong ordering guarantee
‒ Excellent fanout
‣ No hard limitation on number of consumer groups
‒ High throughput and low latency
‒ Multi-language client support
‣ Python/GO (librdkafka)
‣ Java (native or flink kafka connector)
‒ Mature technology with wide adoption

Running Kafka on AWS
• Confluent cloud
‒ Managed Kafka clusters running on AWS
‣ VPC peered with Lyft AWS account
‣ High availability, multi-AZ config
‣ SASL authentication
‣ Encryption on the wire & at rest
‒ Monitoring and control
‣ Cloud portal with monitoring/control panel
‣ Broker-side metrics**
‒ General SLOs
‣ Uptime: 99.95%
‣ Message success rate: 99.99%
‣ Write latency: p95 < 50ms
‒ Changelog data retention: 4 days
** on support roadmap

Flink based data pipeline
• Apache Flink: modern stream compute
framework
‒ Per-event based stream processing
‒ Flexible API unifying stream and batch processing
‒ State persisted through checkpoints/savepoints (fault
tolerance)
‒ Multi-language (Python/GO) support through Apache
BEAM framework
• Flink at Lyft
‒ Tooling for job lifecycle management
‒ Flink jobs running as services
‒ Distributed execution -- standalone flink cluster
Graph from flink website: https://ci.apache.org/projects/flink/flink-docs-release-1.6/concepts/runtime.html

E2E data pipeline -- recap
Upstream flink job: DynamoDB
changelogs -> Kafka
Downstream flink job: Kafka ->
Elasticsearch
Flink job
(dynamostreams
->kafka)
map
JSON
PROTO
Dynamostr
emsflink
connector
Kafka cluster
cdc2es-tableA
Flinkkafka
connector
DynamoDB
(tableA)
DynamoDB
streams
.
.
.
.
.
.
.
.
.
.
.
.
Flink job
(kafka->Elasticse
arch)
Flinkkafka
connector
Elasticsearch
cluster
Flink
Elasticsearch
connector
NEW DEVELOPMENT
map

Flink DynamoDB streams connector
• DynamoDB streams
‒ Changelog self-scaling along with DynamoDB
‒ Exactly-once delivery of change operations
‣ Event type
‣ Primary key
‣ Old and/or new images of the record
‒ Special Kinesis stream
• Cannot directly use existing Flink-Kinesis
connector
Two ways to interact with DynamoDB streams

Flink DynamoDB streams connector
• NEW Flink source connector developed at Lyft
‒ Integrate official dynamostreams to kinesis adapter
‒ Apply adaptations to necessary kinesis client methods
‣ describeStream
‣ getRecords
‣ getShardIterator
‒ Reuses Flink-Kinesis connector’s state management and
checkpoint mechanism.
• Other adjustments
‒ Handle DynamoDB streams shardID format
‒ Handle AWS DynamoDB local container compatibility
DynamoDB
streams
DynamoDB
Dynamo Kinesis
adapter
Kinesis protocol
handler
Dynamostreams
low-levelAPI
Flink DynamoDB streams
connector

Sync Changelog
From Kafka to
Elasticsearch

● Overview of Kafka -> Elasticsearch flink job
● How to handle 429 too many request
● How to address access control issue
● How to achieve seamless pipeline migration & ES upgrade

Overview of Kafka -> Elasticsearch Flink Job
● Elasticsearch @ Lyft
○ All AWS managed Elasticsearch cluster
○ Orchestrated by salt
○ Per service per ES cluster for full isolation
● Why not use open source Flink Connector ？
○ Lyft uses AWS managed Elasticsearch Cluster
○ Open source Flink job is a ES transport client based connector
○ Not allowed by AWS managed cluster

bulk
request
Flink job
map
Flink
Kafka
consumer
Elastic
search
Kafka records
Elasticsearch actions
Jest
Http
ClientKafka Cluster
Kafka topic
X documents or every Y seconds,
whichever comes first

● Quick overview of Kafka -> Elasticsearch sink
● How to handle 429 too many request
● How to address access control issue
● How to achieve seamless pipeline migration & ES upgrade

Things may not always go right ...
Pic © Copyright 2018. From @mikeleeorg.

Remind the requirements ...
● zero data loss
● strong ordering
● duplication is fine
Delay is bad, still better than being wrong

How to handle 429 too many requests ?
● Retry with exponential backoff
● No checkpointing till success
● Replay from last checkpoint when throwing exception
Kafka Topic
Flink Job Elasticsearch
Update x = 6
Update x = 6
Update x
= 6
Update x = 5
Update x = 5
Update x
= 5
Load from Kafka Sync to ES
Checkpointing Get 200
Get 429
Too many
Request
X No
checkpointing
Sync to ES

● Virtual private cloud
● Fully configurable
VPC Security Group Elasticsearch Config
VPC & security group for access control
● Define inbound and
outbound policies
● Under Lyft VPC
● Dedicated security
group

Access Control with AWS Managed ES Cluster
Elasticsearch Cluster Config
…
…
- VPCOptions:
- SubnetIds:
- SecurityGroupIds:
- security_group_id
Ensure AWS sg exists:
….
….
- rules:
- ip_protocol: tcp
from_port: 443
to_port: 443
source_group_name:
- coupon-service-iad
- {{other services}}
- vpc_id:
Create a security group Add the security group for ES
Security
Group
Id

Access control - VPC & security group
● Benefits
○ More secure
○ Faster development and debugging
○ Centralize the access permissions in one place
● More restrictions - IAM policy
○ For example: readonly

Elasticsearch upgrade/migration
● Why not upgrade in place
○ Backward incompatibility
○ Encryption at rest
● Challenges:
○ No service downtime
○ Backfilling ES could be time consuming

Web
Service
migration
service
Dynamo
Dynamo
change
log
Elastic
search
Old
pipeline
Kafka Cluster
(Buffer Change Log）
Flink job
Elastic
search
Flink job

Conclusions and lessons learned
• Kafka as event storage is essential for
‒ Data ingestion with high durability, low latency & strong ordering guarantee
• Flink connector is essential for
‒ Zero data loss
‒ Easy recovery from failure and ES degradations
• AWS managed Elasticsearch Cluster
‒ Trade small inflexibility for simplicity, scalability & better security
‒ Backward incompatibility is a pain
‒ Seamless migration is an important factor to consider for pipeline design

Thank You!
Q&A
We are hiring!
lyft.com/careers

Kafka elastic search meetup 09242018

More Related Content

What's hot (20)

Similar to Kafka elastic search meetup 09242018 (20)

Recently uploaded (20)

Kafka elastic search meetup 09242018