Capital One Delivers Risk Insights in Real Time with Stream Processing

Confidential
Capital One Delivers Risk Insights in
Real Time with Stream Processing
Jeff Sharpe and Ravi Dubey
Capital One Retail Bank
Confluent Online Talk
May 30, 2018

2
Ravi is a senior manager working for Capital One in Virginia. Ravi
has over 25 years of software development and management
experience across a range of products in support of government
and commercial industries. His most recent experience includes
full stack development of web apps, cloud-based enterprise-facing
support applications and a high-throughput, low-latency,
distributed cloud-hosted data processing platform.
Ravi Dubey
Senior Manager, Software Engineering, Capital One
Jeff is a senior software engineer working for Capital One in
Virginia. He’s been an engineer for almost 18 years, with major
projects spanning five different languages. Though he began his
work on kernel drivers and web applications, he’s been repeatedly
drawn into high volume, high throughput data processing
projects.
Jeff Sharpe
Senior Software Engineer, Capital One

3
Housekeeping Items
● This session will last about an hour.
● This session will be recorded.
● You can submit your questions by entering them into the GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.

Thanks…
• Bobby Calderwood
– @bobbycalderwood
– https://www.confluent.io/blog/author/bobby/
• Keith Gasser
– Keith.Gasser@capitalone.com

Real Time Decisioning Platform - Introduction
• Decisioning using ML models and Rules using low-latency
processing
• Streamed, batched, or micro-batched messages

Real Time Decisioning Platform - Introduction
Streaming “Window”

RT Decisioning Platform - Introduction
• High Speed Durable Message Bus – Apache Kafka
• Enterprise Data Sources – Streams, Databases, and
Warehouses
• ETL – Apache NiFi, Kafka Connect, Confluent Schema Registry
• Distributed Processing – Apache Flink and others
• Feature Caching – Apache Flink, Redis, Kafka Compacted
Topics
• Prometheus, Grafana – Metrics, Alert Management
• Supplemented with Cloud compute, RDBMS, and Caching
services
• Containerization – Docker and Kubernetes

RT Decisioning Platform - Kafka Messaging
• Durable, fast, and clustered Kafka topics act as data streams
regarding decisioning input and decision scoring output
• DataStream window intervals correlate to Kafka Topic
log.retention.ms, typically between 30 and 180+ days
• DataStream objects are aggregated into cached features,
such as average daily balance for a specific account holder
• Ten brokers in total per AWS region, dozens of topics
• Producers include NiFi, Kafka Connect, external Streams
Producer-Maintained
Transaction IDs
(Can arrive out of order)
Producer
(Data Source)
22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124
Kafka
Topic
+ Payload

Producer-Maintained
Transaction IDs
(Can arrive out of order)
Producer
(Data Source)
22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124
Kafka
Topic
+ Payload
Apache Flink 20
19
18
…
7
6
5
4
3
DataStream
Structure
(sorts,
Aggregates, etc.)
Kafka
Compacted Topics

Independent
Model
Consumer
Rules
Consumer
12 11 9 10 8 5 7 6 4 3 2 1
• Independent and Interdependent Decisioning
Patterns, Kafka decouples models and rules
Source Topic
8 5 7 6 4 3 2 1
Downstream Topics Support Dependent Scoring
2 110
Dependent
Model
Consumer
5
+ Payload
+ Model Score
+ Payload
+ Rules Score
3
+ Payload
+ Rules Score
10
+ Payload
13
+ Payload
5
+ Payload
3
+ Payload
+ Rules Score
+ Model Score
Producer-Defined ID

Enterprise Compliance: Image Rehydration
• Cloud VM Machine Images require periodic update
• RT Platform stack has 100+ distinct containers – underlying
image rehydration best handled with an abstraction layer
• Simple Blue-Green approaches can work for stateless
components, BUT…
• Network Storage and other Disk Volumes add complexity for
stateful components such as Kafka Brokers
• Kafka Clustering provides fault tolerance and failover during
rehydration, though we needed a solution to manage Kafka
logs mounted on Cloud Storage
Storage mount points broken
during instance recreation

Kubernetes
• Kubernetes (k8s) is OSS that manages container lifecycle,
addressing, and networking among other things
• Scheduler “moves” both Pods and associated storage volumes
defined in Stateful Sets in coordination between VM nodes
enabling clean rolling rehydration of Kafka Brokers
• Services allow Kafka Brokers and Kafka Connect to be accessed
by a logical service name by all platform components.
• Software Networking enables single TLS solution between all
components, common DNS, and integrated cloud Load
Balancing
• For external access to Kafka on the RT Platform, we recycle
external DNS mapping IP to common name at configurable
intervals (20 sec)

Kafka Considerations – Cluster 1
• RT Platform hosts all containers on instance types… 150GB RAM,
40 Cores, 10GB network performance. Good for most stack
components
– Instance Node affinity set so max one Kafka broker and max one ZK node.
– Shared ZooKeeper cluster with other RT Platform components
– In AWS, st1 EBS volume types optimized for write throughput, optimized
for Kafka
• Brokers increase demand on instance and platform shared
resources
– Platform Zookeeper state
– Instance OS open files
– Instance RAM
– Instance Network Access
– Instance Storage IO

• Kafka Brokers utilize RAM including Java heap and page cache
correlating to the size of topics.
• Replication Factor of 3 means four times the disk space consumed
Deeper Topics = More Disk Space
More Page Cache RAM

Kubernetes Pod Memory Usage
EC2 Node Memory Usage

C Kafka
C
C
C
C
C
C
Z
C
C C C
C C CC
C
CC
Kafka
C
Z
C
C
C
C
C
C
C
C
C
Larger (m4.10xlarge , n1-standard-32 , n1-highmem-32)
instance/machine types: Faster network speeds, 100+ GB of RAM,
30+ cores, noisier neighbors competing for RAM, Network IO, “Blast
Radius”
TLS IOIO

Smaller instance/machine types (m4.2xlarge , n1-highmem-4 ,
standard-8), dedicated ZK, single broker node affinity, Connect, and or
Schema Registry. Tradeoff: risk, predictability, simplicity vs. faster
networking network and high-end CPU
Kafka
C
C
C
C
C
C
C
Z
C
C C C
C C CC
C
CC
CC
CC
Z KCSR
KC
Kafka
Z KCSR
Kafka
Z KC
KC
Kafka
Z
KC
KC
+

Kafka Real-Time Upgrades
• RT Platform supports multiple active tenants, so
uniform downtime during version upgrades is not
usually an option.
• Rolling upgrades potentially pose compatibility risks
between Kafka versions.

1- Green Cluster provisioned and Topic Offsets
captured
12 11 9 10 8 5 7 6 4 3 2 1
Producer
Kafka1Svc
Capture Each
Topic Offset

2- Tooling Backfills new Topics
• Depending on desired window size, tooling may be used to
backfill data for topics on new clusters, respecting time
stamp for consistent retention policy.
• Possible Candidate Process for Mirroring
13 12 11 9 10 8 5 7 6 4 3
12 11 9 10 8 5 7 6 4 3 2 1
Backfill Tooling,
Possible Mirroring
Producer
Kafka1Svc

3- Producer flows set to load second Kafka cluster as
required
• Producers reference newly upgraded Kafka Clusters by
new k8s service name and upgrade to new cluster
independently
14 13 12 11 9 10 8 5 7 6
14 14 13 12 11 9 10 8 5 7
Producer
Kafka2Svc
Kafka1Svc

Kafka Real-Time Upgrades- Consequences
• Overlaps between 2) and 3) likely to create
duplicates (better than gaps)
• If downstream state based on original cluster or
original offsets are not preserved, all messages in
window may need to be replayed to recover
14 13 12 11 9 10 8 5 7 6
14 14 13 12 11 9 10 8 5 7
Producer

Kafka Across Regions
• Regional Clusters
• Why Do This?
– Partitioned Strategy
• Active-Active
• Latency or Partition Routed, Increased
Performance and Efficiency
– Disaster Recovery
• Active-Passive, Active-Active
• Redundantly Constructed and Routed,
Increased Reliability
• Issues
– Syncing Data
– Latency
• Inefficient Operation Across Great
Distance
• Kafka Cluster Replication not
recommended

Kafka Across Regions – Data Syncing Options
• Duplicate Common Upstream Sources
• Producer-Driven Replication
• Mirroring
• Mirroring + Consolidation

Common Upstream
• Local Producers use Common Source
• 2 Topics Represent 1 Logical Topic
• Pros
• Fewest Number of Topics
• Consumer behavior minimally impacted
• Cons
• Each Local Producer needs to know about Each Regional
Deployment

Producer
Region BRegion A
Producer
Topi
c
Topi
c
2 Topics Represent 1 Logical set of Messages
Consumers Consumers
Common Upstream
ETL Pull

Producer-Driven Replication
• Producers maintain Topic consistency across multiple
regions
• 2 Topics Represent 1 Logical Topic, Clusters
• Pros
• Fewest Number of Topics
• Cons
• Each Producer needs to know about Each Regional
Deployment
• Failure strategy, Reliability Tracking, SLA, etc. must be
Implemented by each Producer– likely using shadow topics

Producer
Region BRegion A
Producer
Topic AB Topic
BA
Consumers Consumers
Producer-Driven Replication
A Routed Data B Routed Data
Shadow TopicShadow Topic

Mirroring
• Tooling Automatically Replicates Topics
• Confluent Replicator (Licensed)
• Mirror Maker, uReplicator (OSS)
• Pros
• Producer behavior minimally impacted
• Cons
• Each Consumer needs to know about Each Replicated
Topic
• Complexity–More Topics

Producer
Region BRegion A
Producer
Topic A Topic
B
Consumers Consumers
Mirroring
Topic
B’
Topic
A’
Mirror

Mirroring + Consolidation
• Tooling Automatically Replicates Topics
• Additional Tooling merges Topics for Consumers
• ETL Tooling, NiFi, etc.
• Kafka Connect
• Pros
• Producer behavior minimally impacted
• Cons
• Custom tooling must implement failure strategy, reliability
tracking, etc.
• Complexity– Lots More Topics, flow logic, and associated
resource consumption

Producer
Region BRegion A
Producer
Topic A Topic
B
Mirroring + Consolidation
Topic
B’
Topic
A’
Consumers
Topic AB
Consumers
Topic
BA
ETL ETL
Mirror

• Multiple Tenant Use Cases and Risk
Tolerances
• Combination of Solutions
– Common Upstream
– Confluent Replication
So What Do We Use?

Kafka – Moving Forward
• Exactly Once Semantics/Transactionality
• Hyper Partitioning
• Alternate Backends to Support Indefinite
Retention (S3, etc.)

Kafka for Real Time Bank Decisions
Handling Private Information
Real-Time Request and Response

Handling PII (not) on Kafka
Goal:
Remove the possibility of exposing PII

Encrypted Volume: Simple & Effective
Library Card#
8675309
Library Card#
TOK:113581321
KAFKA
Storage
Encryption
Tokenizatio
n
Consumer
Consumer
ConsumerTopic
Persistence
Producer

Encrypted Volume:
Following the Path of Least Resistance
Good
• Highly durable across Kafka
restarts
• Simple disaster recovery
planning
• Follows recommended
Kafka configuration
practices
Not So Good
• Information privacy
regulations require extra
levels of protection
• Durability is based on
additional storage volumes
being managed with the
Kafka service

Volatile Storage: Performance & Privacy
KAFKA
Consumer
Consumer
ConsumerTopic
Persistence
Initial
State
Storage
tmpfs
Storage
Copy on Startup
Library Card#
8675309
Library Card#
TOK:113581321
Tokenizatio
n
Producer

Volatile Storage: Strange Trade-offs
Improvements
• Noticeably better
performance
• Data is always “in flight”, so
extra encryption shouldn’t
be needed
• Effectively stateless images
Complications
• Needs scripting to bootstrap
• Topic contents are cleared
on host reboot
• Zookeeper won’t be able to
manage offsets between
reboots

Volatile Storage: Why We Aren’t Using It
• We need long-term storage of data and RAM is already a
precious resource.
• Our recovery strategy is built on Kafka as our state storage
mechanism. Losing that state complicates recovery efforts.
• Host disk caching gives us most of the benefit of volatile
storage.

Request-Response Pattern
/rəˈkwest rəˈspans ˈpadərn/
noun
1. A pattern of interaction with a remote service where the
local task submits a request for remote work and
expects a response before continuing work.
2. A specialized use of Kafka using dedicated topic pairs
to communicate with a shared service

Request Response Basics
Application
Request Topic
Response Topic
3. Prepare DataData
4. Assign a unique ID
5. Put request on
request topic
Service
(Service does work,
and builds a response
with the Request ID)
6. Read Response topic
until Request ID is seen
2. Initialize Producer
1. Initialize ConsumerConsumer
Producer
ID: 14159-26535
ID:14159-26532
ID:14159-26531
ID:14159-26533
ID:14159-26535
ID:14159-26536

How Request-Response Feels
Application
Service
Data

How Request-Response Actually Works
Data
Data
Data
Application
Request Topic
Data Data Data Data Data
Data
Response Topic
Data Data Data Data

When Failures Occur
Data
Data
Data
Application
Request Topic
Data Data Data Data Data
Data
Response Topic
Data Data
Data
Data
Data
Data
Data
Missing
Responses
Data
Data

The Slow Failure Problem
Failures

The Request-Response Pattern
This is actually the
“Background Job” pattern:
1. Submit Job
2. Get assigned a Job ID
3. Poll for the service for until the Job ID is
marked as complete
4. Retrieve the results of the job

Request Response: Serverless Considerations
• Try to reuse Producers and Consumers
• Explicitly assign Consumer partitions
• Attempt to read from the Consumer before
submitting to the Producer
• Remember to commit offsets before sending
responses

REST
GRPC
ETC
Slightly Better: The Real-Time Tap Pattern
Input Topic
Precomputed
Values
Processing
Service
Application
Request
Real Time
Service
Read Request
Process
Send Response
Deliver Data
Request
Response Response

Real-Time Tap Pattern
• Real-time request is handled by a session-based
protocol
• Resilient data processing is handled by Kafka
• Failures are reported when they happen via
real-time protocol
• Kafka interactions can be optimized by the handler
service, rather than relying on clients

Capital One Delivers Risk Insights in Real Time with Stream Processing

More Related Content

What's hot (20)

Similar to Capital One Delivers Risk Insights in Real Time with Stream Processing (20)

More from confluent (20)

Recently uploaded (20)

Capital One Delivers Risk Insights in Real Time with Stream Processing