SlideShare a Scribd company logo
Confidential
Capital One Delivers Risk Insights in
Real Time with Stream Processing
Jeff Sharpe and Ravi Dubey
Capital One Retail Bank
Confluent Online Talk
May 30, 2018
2
Ravi is a senior manager working for Capital One in Virginia. Ravi
has over 25 years of software development and management
experience across a range of products in support of government
and commercial industries. His most recent experience includes
full stack development of web apps, cloud-based enterprise-facing
support applications and a high-throughput, low-latency,
distributed cloud-hosted data processing platform.
Ravi Dubey
Senior Manager, Software Engineering, Capital One
Jeff is a senior software engineer working for Capital One in
Virginia. He’s been an engineer for almost 18 years, with major
projects spanning five different languages. Though he began his
work on kernel drivers and web applications, he’s been repeatedly
drawn into high volume, high throughput data processing
projects.
Jeff Sharpe
Senior Software Engineer, Capital One
3
Housekeeping Items
● This session will last about an hour.
● This session will be recorded.
● You can submit your questions by entering them into the GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.
Thanks…
• Bobby Calderwood
– @bobbycalderwood
– https://www.confluent.io/blog/author/bobby/
• Keith Gasser
– Keith.Gasser@capitalone.com
Real Time Decisioning Platform - Introduction
• Decisioning using ML models and Rules using low-latency
processing
• Streamed, batched, or micro-batched messages
Real Time Decisioning Platform - Introduction
Streaming “Window”
RT Decisioning Platform - Introduction
• High Speed Durable Message Bus – Apache Kafka
• Enterprise Data Sources – Streams, Databases, and
Warehouses
• ETL – Apache NiFi, Kafka Connect, Confluent Schema Registry
• Distributed Processing – Apache Flink and others
• Feature Caching – Apache Flink, Redis, Kafka Compacted
Topics
• Prometheus, Grafana – Metrics, Alert Management
• Supplemented with Cloud compute, RDBMS, and Caching
services
• Containerization – Docker and Kubernetes
RT Decisioning Platform - Kafka Messaging
• Durable, fast, and clustered Kafka topics act as data streams
regarding decisioning input and decision scoring output
• DataStream window intervals correlate to Kafka Topic
log.retention.ms, typically between 30 and 180+ days
• DataStream objects are aggregated into cached features,
such as average daily balance for a specific account holder
• Ten brokers in total per AWS region, dozens of topics
• Producers include NiFi, Kafka Connect, external Streams
Producer-Maintained
Transaction IDs
(Can arrive out of order)
Producer
(Data Source)
22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124
Kafka
Topic
+ Payload
RT Decisioning Platform - Kafka Messaging
Producer-Maintained
Transaction IDs
(Can arrive out of order)
Producer
(Data Source)
22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124
Kafka
Topic
+ Payload
Apache Flink 20
19
18
…
7
6
5
4
3
DataStream
Structure
(sorts,
Aggregates, etc.)
Kafka
Compacted Topics
RT Decisioning Platform - Kafka Messaging
Independent
Model
Consumer
Rules
Consumer
12 11 9 10 8 5 7 6 4 3 2 1
• Independent and Interdependent Decisioning
Patterns, Kafka decouples models and rules
Source Topic
8 5 7 6 4 3 2 1
Downstream Topics Support Dependent Scoring
2 110
Dependent
Model
Consumer
5
+ Payload
+ Model Score
+ Payload
+ Rules Score
3
+ Payload
+ Rules Score
10
+ Payload
13
+ Payload
5
+ Payload
3
+ Payload
+ Rules Score
+ Model Score
Producer-Defined ID
Enterprise Compliance: Image Rehydration
• Cloud VM Machine Images require periodic update
• RT Platform stack has 100+ distinct containers – underlying
image rehydration best handled with an abstraction layer
• Simple Blue-Green approaches can work for stateless
components, BUT…
• Network Storage and other Disk Volumes add complexity for
stateful components such as Kafka Brokers
• Kafka Clustering provides fault tolerance and failover during
rehydration, though we needed a solution to manage Kafka
logs mounted on Cloud Storage
Storage mount points broken
during instance recreation
Kubernetes
• Kubernetes (k8s) is OSS that manages container lifecycle,
addressing, and networking among other things
• Scheduler “moves” both Pods and associated storage volumes
defined in Stateful Sets in coordination between VM nodes
enabling clean rolling rehydration of Kafka Brokers
• Services allow Kafka Brokers and Kafka Connect to be accessed
by a logical service name by all platform components.
• Software Networking enables single TLS solution between all
components, common DNS, and integrated cloud Load
Balancing
• For external access to Kafka on the RT Platform, we recycle
external DNS mapping IP to common name at configurable
intervals (20 sec)
Kafka Considerations – Cluster 1
• RT Platform hosts all containers on instance types… 150GB RAM,
40 Cores, 10GB network performance. Good for most stack
components
– Instance Node affinity set so max one Kafka broker and max one ZK node.
– Shared ZooKeeper cluster with other RT Platform components
– In AWS, st1 EBS volume types optimized for write throughput, optimized
for Kafka
• Brokers increase demand on instance and platform shared
resources
– Platform Zookeeper state
– Instance OS open files
– Instance RAM
– Instance Network Access
– Instance Storage IO
• Kafka Brokers utilize RAM including Java heap and page cache
correlating to the size of topics.
• Replication Factor of 3 means four times the disk space consumed
Kafka Considerations – Cluster 1
Deeper Topics = More Disk Space
More Page Cache RAM
Kubernetes Pod Memory Usage
EC2 Node Memory Usage
Kafka Considerations – Cluster 1
C Kafka
C
C
C
C
C
C
Z
C
C C C
C C CC
C
CC
Kafka
C
Z
C
C
C
C
C
C
C
C
C
Larger (m4.10xlarge , n1-standard-32 , n1-highmem-32)
instance/machine types: Faster network speeds, 100+ GB of RAM,
30+ cores, noisier neighbors competing for RAM, Network IO, “Blast
Radius”
TLS IOIO
Kafka Considerations – Cluster 1
Smaller instance/machine types (m4.2xlarge , n1-highmem-4 ,
standard-8), dedicated ZK, single broker node affinity, Connect, and or
Schema Registry. Tradeoff: risk, predictability, simplicity vs. faster
networking network and high-end CPU
Kafka
C
C
C
C
C
C
C
Z
C
C C C
C C CC
C
CC
CC
CC
Z KCSR
KC
Kafka
Z KCSR
Kafka
Z KC
KC
Kafka
Z
KC
KC
+
Kafka Considerations – Cluster 2
Kafka Real-Time Upgrades
• RT Platform supports multiple active tenants, so
uniform downtime during version upgrades is not
usually an option.
• Rolling upgrades potentially pose compatibility risks
between Kafka versions.
Kafka Real-Time Upgrades
1- Green Cluster provisioned and Topic Offsets
captured
12 11 9 10 8 5 7 6 4 3 2 1
Producer
Kafka1Svc
Capture Each
Topic Offset
Kafka Real-Time Upgrades
2- Tooling Backfills new Topics
• Depending on desired window size, tooling may be used to
backfill data for topics on new clusters, respecting time
stamp for consistent retention policy.
• Possible Candidate Process for Mirroring
13 12 11 9 10 8 5 7 6 4 3
12 11 9 10 8 5 7 6 4 3 2 1
Backfill Tooling,
Possible Mirroring
Producer
Kafka1Svc
Kafka Real-Time Upgrades
3- Producer flows set to load second Kafka cluster as
required
• Producers reference newly upgraded Kafka Clusters by
new k8s service name and upgrade to new cluster
independently
14 13 12 11 9 10 8 5 7 6
14 14 13 12 11 9 10 8 5 7
Producer
Kafka2Svc
Kafka1Svc
Kafka Real-Time Upgrades- Consequences
• Overlaps between 2) and 3) likely to create
duplicates (better than gaps)
• If downstream state based on original cluster or
original offsets are not preserved, all messages in
window may need to be replayed to recover
14 13 12 11 9 10 8 5 7 6
14 14 13 12 11 9 10 8 5 7
Producer
Kafka Across Regions
• Regional Clusters
• Why Do This?
– Partitioned Strategy
• Active-Active
• Latency or Partition Routed, Increased
Performance and Efficiency
– Disaster Recovery
• Active-Passive, Active-Active
• Redundantly Constructed and Routed,
Increased Reliability
• Issues
– Syncing Data
– Latency
• Inefficient Operation Across Great
Distance
• Kafka Cluster Replication not
recommended
Kafka Across Regions – Data Syncing Options
• Duplicate Common Upstream Sources
• Producer-Driven Replication
• Mirroring
• Mirroring + Consolidation
Kafka Across Regions – Data Syncing Options
Common Upstream
• Local Producers use Common Source
• 2 Topics Represent 1 Logical Topic
• Pros
• Fewest Number of Topics
• Consumer behavior minimally impacted
• Cons
• Each Local Producer needs to know about Each Regional
Deployment
Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topi
c
Topi
c
2 Topics Represent 1 Logical set of Messages
Consumers Consumers
Common Upstream
ETL Pull
Kafka Across Regions – Data Syncing Options
Producer-Driven Replication
• Producers maintain Topic consistency across multiple
regions
• 2 Topics Represent 1 Logical Topic, Clusters
• Pros
• Fewest Number of Topics
• Consumer behavior minimally impacted
• Cons
• Each Producer needs to know about Each Regional
Deployment
• Failure strategy, Reliability Tracking, SLA, etc. must be
Implemented by each Producer– likely using shadow topics
Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topic AB Topic
BA
2 Topics Represent 1 Logical set of Messages
Consumers Consumers
Producer-Driven Replication
A Routed Data B Routed Data
Shadow TopicShadow Topic
Kafka Across Regions – Data Syncing Options
Mirroring
• Tooling Automatically Replicates Topics
• Confluent Replicator (Licensed)
• Mirror Maker, uReplicator (OSS)
• 4 Topics Represent 1 Logical Topic
• Pros
• Producer behavior minimally impacted
• Cons
• Each Consumer needs to know about Each Replicated
Topic
• Complexity–More Topics
Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topic A Topic
B
4 Topics Represent 1 Logical set of Messages
Consumers Consumers
Mirroring
Topic
B’
Topic
A’
Mirror
Kafka Across Regions – Data Syncing Options
Mirroring + Consolidation
• Tooling Automatically Replicates Topics
• Additional Tooling merges Topics for Consumers
• ETL Tooling, NiFi, etc.
• Kafka Connect
• 6 Topics Represent 1 Logical Topic
• Pros
• Producer behavior minimally impacted
• Consumer behavior minimally impacted
• Cons
• Custom tooling must implement failure strategy, reliability
tracking, etc.
• Complexity– Lots More Topics, flow logic, and associated
resource consumption
Kafka Across Regions – Data Syncing Options
Producer
Region BRegion A
Producer
Topic A Topic
B
6 Topics Represent 1 Logical set of Messages
Mirroring + Consolidation
Topic
B’
Topic
A’
Consumers
Topic AB
Consumers
Topic
BA
ETL ETL
Mirror
Kafka Across Regions – Data Syncing Options
• Multiple Tenant Use Cases and Risk
Tolerances
• Combination of Solutions
– Common Upstream
– Confluent Replication
So What Do We Use?
Kafka – Moving Forward
• Exactly Once Semantics/Transactionality
• Hyper Partitioning
• Alternate Backends to Support Indefinite
Retention (S3, etc.)
Kafka for Real Time Bank Decisions
Handling Private Information
Real-Time Request and Response
Handling PII (not) on Kafka
Goal:
Remove the possibility of exposing PII
Encrypted Volume: Simple & Effective
Library Card#
8675309
Library Card#
TOK:113581321
KAFKA
Storage
Encryption
Tokenizatio
n
Consumer
Consumer
ConsumerTopic
Persistence
Producer
Encrypted Volume:
Following the Path of Least Resistance
Good
• Highly durable across Kafka
restarts
• Simple disaster recovery
planning
• Follows recommended
Kafka configuration
practices
Not So Good
• Information privacy
regulations require extra
levels of protection
• Durability is based on
additional storage volumes
being managed with the
Kafka service
Volatile Storage: Performance & Privacy
KAFKA
Consumer
Consumer
ConsumerTopic
Persistence
Initial
State
Storage
tmpfs
Storage
Copy on Startup
Library Card#
8675309
Library Card#
TOK:113581321
Tokenizatio
n
Producer
Volatile Storage: Strange Trade-offs
Improvements
• Noticeably better
performance
• Data is always “in flight”, so
extra encryption shouldn’t
be needed
• Effectively stateless images
Complications
• Needs scripting to bootstrap
• Topic contents are cleared
on host reboot
• Zookeeper won’t be able to
manage offsets between
reboots
Volatile Storage: Why We Aren’t Using It
• We need long-term storage of data and RAM is already a
precious resource.
• Our recovery strategy is built on Kafka as our state storage
mechanism. Losing that state complicates recovery efforts.
• Host disk caching gives us most of the benefit of volatile
storage.
Request-Response Pattern
/rəˈkwest rəˈspans ˈpadərn/
noun
1. A pattern of interaction with a remote service where the
local task submits a request for remote work and
expects a response before continuing work.
2. A specialized use of Kafka using dedicated topic pairs
to communicate with a shared service
Request Response Basics
Application
Request Topic
Response Topic
3. Prepare DataData
4. Assign a unique ID
5. Put request on
request topic
Service
(Service does work,
and builds a response
with the Request ID)
6. Read Response topic
until Request ID is seen
2. Initialize Producer
1. Initialize ConsumerConsumer
Producer
ID: 14159-26535
ID:14159-26532
ID:14159-26531
ID:14159-26533
ID:14159-26535
ID:14159-26536
How Request-Response Feels
Application
Service
Data
How Request-Response Actually Works
Data
Data
Data
Application
Request Topic
Data Data Data Data Data
Data
Response Topic
Data Data Data Data
When Failures Occur
Data
Data
Data
Application
Request Topic
Data Data Data Data Data
Data
Response Topic
Data Data
Data
Data
Data
Data
Data
Missing
Responses
Data
Data
The Slow Failure Problem
Failures
The Request-Response Pattern
This is actually the
“Background Job” pattern:
1. Submit Job
2. Get assigned a Job ID
3. Poll for the service for until the Job ID is
marked as complete
4. Retrieve the results of the job
Request Response: Serverless Considerations
• Try to reuse Producers and Consumers
• Explicitly assign Consumer partitions
• Attempt to read from the Consumer before
submitting to the Producer
• Remember to commit offsets before sending
responses
REST
GRPC
ETC
Slightly Better: The Real-Time Tap Pattern
Input Topic
Precomputed
Values
Processing
Service
Application
Request
Real Time
Service
Read Request
Process
Send Response
Deliver Data
Request
Response Response
Real-Time Tap Pattern
• Real-time request is handled by a session-based
protocol
• Resilient data processing is handled by Kafka
• Failures are reported when they happen via
real-time protocol
• Kafka interactions can be optimized by the handler
service, rather than relying on clients
52
Questions?
53
Thank you for joining us!

More Related Content

PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Cruise Control: Effortless management of Kafka clusters
PPTX
Improving Kafka at-least-once performance at Uber
PDF
Transforming Financial Services with Event Streaming Data
PDF
Kafka 101 and Developer Best Practices
ODP
Stream processing using Kafka
PDF
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
PPTX
Introduction to KSQL: Streaming SQL for Apache Kafka®
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Cruise Control: Effortless management of Kafka clusters
Improving Kafka at-least-once performance at Uber
Transforming Financial Services with Event Streaming Data
Kafka 101 and Developer Best Practices
Stream processing using Kafka
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introduction to KSQL: Streaming SQL for Apache Kafka®

What's hot (20)

PPTX
Introduction to Kafka Cruise Control
PDF
Kafka Streams: What it is, and how to use it?
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
GCP for Apache Kafka® Users: Stream Ingestion and Processing
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
PDF
How Apache Kafka® Works
PPTX
Introduction to Apache Kafka
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
PDF
Introduction to apache kafka, confluent and why they matter
PDF
ksqlDB: A Stream-Relational Database System
PPTX
Kafka presentation
PDF
Benefits of Stream Processing and Apache Kafka Use Cases
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Producer Performance Tuning for Apache Kafka
PDF
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
PDF
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
PDF
Kafka Streams State Stores Being Persistent
PDF
Exactly-once Semantics in Apache Kafka
PPTX
PDF
Apache Kafka - Martin Podval
Introduction to Kafka Cruise Control
Kafka Streams: What it is, and how to use it?
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
GCP for Apache Kafka® Users: Stream Ingestion and Processing
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
How Apache Kafka® Works
Introduction to Apache Kafka
Event Sourcing & CQRS, Kafka, Rabbit MQ
Introduction to apache kafka, confluent and why they matter
ksqlDB: A Stream-Relational Database System
Kafka presentation
Benefits of Stream Processing and Apache Kafka Use Cases
Dynamic Rule-based Real-time Market Data Alerts
Producer Performance Tuning for Apache Kafka
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Kafka Streams State Stores Being Persistent
Exactly-once Semantics in Apache Kafka
Apache Kafka - Martin Podval
Ad

Similar to Capital One Delivers Risk Insights in Real Time with Stream Processing (20)

PDF
Making Apache Kafka Even Faster And More Scalable
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
PDF
Big Data Streams Architectures. Why? What? How?
PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
PPTX
Tuning kafka pipelines
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Kafka talk
PPTX
Real time data pipline with kafka streams
PPTX
Modern Distributed Messaging and RPC
PDF
Kinesis vs-kafka-and-kafka-deep-dive
PDF
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PDF
Building Event Streaming Architectures on Scylla and Kafka
PDF
Introduction to Apache Kafka
PDF
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
PDF
Keystone - ApacheCon 2016
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PDF
Hacking apache cloud stack
PDF
Flink forward-2017-netflix keystones-paas
Making Apache Kafka Even Faster And More Scalable
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Big Data Streams Architectures. Why? What? How?
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Tuning kafka pipelines
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Ingestion and Dimensions Compute and Enrich using Apache Apex
Kafka talk
Real time data pipline with kafka streams
Modern Distributed Messaging and RPC
Kinesis vs-kafka-and-kafka-deep-dive
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Building Event Streaming Architectures on Scylla and Kafka
Introduction to Apache Kafka
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
Keystone - ApacheCon 2016
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Hacking apache cloud stack
Flink forward-2017-netflix keystones-paas
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Modernizing your data center with Dell and AMD
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced Soft Computing BINUS July 2025.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
MYSQL Presentation for SQL database connectivity
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
Modernizing your data center with Dell and AMD
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Monthly Chronicles - July 2025
Advanced Soft Computing BINUS July 2025.pdf

Capital One Delivers Risk Insights in Real Time with Stream Processing

  • 1. Confidential Capital One Delivers Risk Insights in Real Time with Stream Processing Jeff Sharpe and Ravi Dubey Capital One Retail Bank Confluent Online Talk May 30, 2018
  • 2. 2 Ravi is a senior manager working for Capital One in Virginia. Ravi has over 25 years of software development and management experience across a range of products in support of government and commercial industries. His most recent experience includes full stack development of web apps, cloud-based enterprise-facing support applications and a high-throughput, low-latency, distributed cloud-hosted data processing platform. Ravi Dubey Senior Manager, Software Engineering, Capital One Jeff is a senior software engineer working for Capital One in Virginia. He’s been an engineer for almost 18 years, with major projects spanning five different languages. Though he began his work on kernel drivers and web applications, he’s been repeatedly drawn into high volume, high throughput data processing projects. Jeff Sharpe Senior Software Engineer, Capital One
  • 3. 3 Housekeeping Items ● This session will last about an hour. ● This session will be recorded. ● You can submit your questions by entering them into the GoToWebinar panel. ● The last 10-15 minutes will consist of Q&A. ● The slides and recording will be available after the talk.
  • 4. Thanks… • Bobby Calderwood – @bobbycalderwood – https://www.confluent.io/blog/author/bobby/ • Keith Gasser – Keith.Gasser@capitalone.com
  • 5. Real Time Decisioning Platform - Introduction • Decisioning using ML models and Rules using low-latency processing • Streamed, batched, or micro-batched messages
  • 6. Real Time Decisioning Platform - Introduction Streaming “Window”
  • 7. RT Decisioning Platform - Introduction • High Speed Durable Message Bus – Apache Kafka • Enterprise Data Sources – Streams, Databases, and Warehouses • ETL – Apache NiFi, Kafka Connect, Confluent Schema Registry • Distributed Processing – Apache Flink and others • Feature Caching – Apache Flink, Redis, Kafka Compacted Topics • Prometheus, Grafana – Metrics, Alert Management • Supplemented with Cloud compute, RDBMS, and Caching services • Containerization – Docker and Kubernetes
  • 8. RT Decisioning Platform - Kafka Messaging • Durable, fast, and clustered Kafka topics act as data streams regarding decisioning input and decision scoring output • DataStream window intervals correlate to Kafka Topic log.retention.ms, typically between 30 and 180+ days • DataStream objects are aggregated into cached features, such as average daily balance for a specific account holder • Ten brokers in total per AWS region, dozens of topics • Producers include NiFi, Kafka Connect, external Streams Producer-Maintained Transaction IDs (Can arrive out of order) Producer (Data Source) 22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124 Kafka Topic + Payload
  • 9. RT Decisioning Platform - Kafka Messaging Producer-Maintained Transaction IDs (Can arrive out of order) Producer (Data Source) 22 21 19 20 18 17 16 15 14 13 12 11 9 10 8 5 7 6 4 3 2 124 Kafka Topic + Payload Apache Flink 20 19 18 … 7 6 5 4 3 DataStream Structure (sorts, Aggregates, etc.) Kafka Compacted Topics
  • 10. RT Decisioning Platform - Kafka Messaging Independent Model Consumer Rules Consumer 12 11 9 10 8 5 7 6 4 3 2 1 • Independent and Interdependent Decisioning Patterns, Kafka decouples models and rules Source Topic 8 5 7 6 4 3 2 1 Downstream Topics Support Dependent Scoring 2 110 Dependent Model Consumer 5 + Payload + Model Score + Payload + Rules Score 3 + Payload + Rules Score 10 + Payload 13 + Payload 5 + Payload 3 + Payload + Rules Score + Model Score Producer-Defined ID
  • 11. Enterprise Compliance: Image Rehydration • Cloud VM Machine Images require periodic update • RT Platform stack has 100+ distinct containers – underlying image rehydration best handled with an abstraction layer • Simple Blue-Green approaches can work for stateless components, BUT… • Network Storage and other Disk Volumes add complexity for stateful components such as Kafka Brokers • Kafka Clustering provides fault tolerance and failover during rehydration, though we needed a solution to manage Kafka logs mounted on Cloud Storage Storage mount points broken during instance recreation
  • 12. Kubernetes • Kubernetes (k8s) is OSS that manages container lifecycle, addressing, and networking among other things • Scheduler “moves” both Pods and associated storage volumes defined in Stateful Sets in coordination between VM nodes enabling clean rolling rehydration of Kafka Brokers • Services allow Kafka Brokers and Kafka Connect to be accessed by a logical service name by all platform components. • Software Networking enables single TLS solution between all components, common DNS, and integrated cloud Load Balancing • For external access to Kafka on the RT Platform, we recycle external DNS mapping IP to common name at configurable intervals (20 sec)
  • 13. Kafka Considerations – Cluster 1 • RT Platform hosts all containers on instance types… 150GB RAM, 40 Cores, 10GB network performance. Good for most stack components – Instance Node affinity set so max one Kafka broker and max one ZK node. – Shared ZooKeeper cluster with other RT Platform components – In AWS, st1 EBS volume types optimized for write throughput, optimized for Kafka • Brokers increase demand on instance and platform shared resources – Platform Zookeeper state – Instance OS open files – Instance RAM – Instance Network Access – Instance Storage IO
  • 14. • Kafka Brokers utilize RAM including Java heap and page cache correlating to the size of topics. • Replication Factor of 3 means four times the disk space consumed Kafka Considerations – Cluster 1 Deeper Topics = More Disk Space More Page Cache RAM
  • 15. Kubernetes Pod Memory Usage EC2 Node Memory Usage Kafka Considerations – Cluster 1
  • 16. C Kafka C C C C C C Z C C C C C C CC C CC Kafka C Z C C C C C C C C C Larger (m4.10xlarge , n1-standard-32 , n1-highmem-32) instance/machine types: Faster network speeds, 100+ GB of RAM, 30+ cores, noisier neighbors competing for RAM, Network IO, “Blast Radius” TLS IOIO Kafka Considerations – Cluster 1
  • 17. Smaller instance/machine types (m4.2xlarge , n1-highmem-4 , standard-8), dedicated ZK, single broker node affinity, Connect, and or Schema Registry. Tradeoff: risk, predictability, simplicity vs. faster networking network and high-end CPU Kafka C C C C C C C Z C C C C C C CC C CC CC CC Z KCSR KC Kafka Z KCSR Kafka Z KC KC Kafka Z KC KC + Kafka Considerations – Cluster 2
  • 18. Kafka Real-Time Upgrades • RT Platform supports multiple active tenants, so uniform downtime during version upgrades is not usually an option. • Rolling upgrades potentially pose compatibility risks between Kafka versions.
  • 19. Kafka Real-Time Upgrades 1- Green Cluster provisioned and Topic Offsets captured 12 11 9 10 8 5 7 6 4 3 2 1 Producer Kafka1Svc Capture Each Topic Offset
  • 20. Kafka Real-Time Upgrades 2- Tooling Backfills new Topics • Depending on desired window size, tooling may be used to backfill data for topics on new clusters, respecting time stamp for consistent retention policy. • Possible Candidate Process for Mirroring 13 12 11 9 10 8 5 7 6 4 3 12 11 9 10 8 5 7 6 4 3 2 1 Backfill Tooling, Possible Mirroring Producer Kafka1Svc
  • 21. Kafka Real-Time Upgrades 3- Producer flows set to load second Kafka cluster as required • Producers reference newly upgraded Kafka Clusters by new k8s service name and upgrade to new cluster independently 14 13 12 11 9 10 8 5 7 6 14 14 13 12 11 9 10 8 5 7 Producer Kafka2Svc Kafka1Svc
  • 22. Kafka Real-Time Upgrades- Consequences • Overlaps between 2) and 3) likely to create duplicates (better than gaps) • If downstream state based on original cluster or original offsets are not preserved, all messages in window may need to be replayed to recover 14 13 12 11 9 10 8 5 7 6 14 14 13 12 11 9 10 8 5 7 Producer
  • 23. Kafka Across Regions • Regional Clusters • Why Do This? – Partitioned Strategy • Active-Active • Latency or Partition Routed, Increased Performance and Efficiency – Disaster Recovery • Active-Passive, Active-Active • Redundantly Constructed and Routed, Increased Reliability • Issues – Syncing Data – Latency • Inefficient Operation Across Great Distance • Kafka Cluster Replication not recommended
  • 24. Kafka Across Regions – Data Syncing Options • Duplicate Common Upstream Sources • Producer-Driven Replication • Mirroring • Mirroring + Consolidation
  • 25. Kafka Across Regions – Data Syncing Options Common Upstream • Local Producers use Common Source • 2 Topics Represent 1 Logical Topic • Pros • Fewest Number of Topics • Consumer behavior minimally impacted • Cons • Each Local Producer needs to know about Each Regional Deployment
  • 26. Kafka Across Regions – Data Syncing Options Producer Region BRegion A Producer Topi c Topi c 2 Topics Represent 1 Logical set of Messages Consumers Consumers Common Upstream ETL Pull
  • 27. Kafka Across Regions – Data Syncing Options Producer-Driven Replication • Producers maintain Topic consistency across multiple regions • 2 Topics Represent 1 Logical Topic, Clusters • Pros • Fewest Number of Topics • Consumer behavior minimally impacted • Cons • Each Producer needs to know about Each Regional Deployment • Failure strategy, Reliability Tracking, SLA, etc. must be Implemented by each Producer– likely using shadow topics
  • 28. Kafka Across Regions – Data Syncing Options Producer Region BRegion A Producer Topic AB Topic BA 2 Topics Represent 1 Logical set of Messages Consumers Consumers Producer-Driven Replication A Routed Data B Routed Data Shadow TopicShadow Topic
  • 29. Kafka Across Regions – Data Syncing Options Mirroring • Tooling Automatically Replicates Topics • Confluent Replicator (Licensed) • Mirror Maker, uReplicator (OSS) • 4 Topics Represent 1 Logical Topic • Pros • Producer behavior minimally impacted • Cons • Each Consumer needs to know about Each Replicated Topic • Complexity–More Topics
  • 30. Kafka Across Regions – Data Syncing Options Producer Region BRegion A Producer Topic A Topic B 4 Topics Represent 1 Logical set of Messages Consumers Consumers Mirroring Topic B’ Topic A’ Mirror
  • 31. Kafka Across Regions – Data Syncing Options Mirroring + Consolidation • Tooling Automatically Replicates Topics • Additional Tooling merges Topics for Consumers • ETL Tooling, NiFi, etc. • Kafka Connect • 6 Topics Represent 1 Logical Topic • Pros • Producer behavior minimally impacted • Consumer behavior minimally impacted • Cons • Custom tooling must implement failure strategy, reliability tracking, etc. • Complexity– Lots More Topics, flow logic, and associated resource consumption
  • 32. Kafka Across Regions – Data Syncing Options Producer Region BRegion A Producer Topic A Topic B 6 Topics Represent 1 Logical set of Messages Mirroring + Consolidation Topic B’ Topic A’ Consumers Topic AB Consumers Topic BA ETL ETL Mirror
  • 33. Kafka Across Regions – Data Syncing Options • Multiple Tenant Use Cases and Risk Tolerances • Combination of Solutions – Common Upstream – Confluent Replication So What Do We Use?
  • 34. Kafka – Moving Forward • Exactly Once Semantics/Transactionality • Hyper Partitioning • Alternate Backends to Support Indefinite Retention (S3, etc.)
  • 35. Kafka for Real Time Bank Decisions Handling Private Information Real-Time Request and Response
  • 36. Handling PII (not) on Kafka Goal: Remove the possibility of exposing PII
  • 37. Encrypted Volume: Simple & Effective Library Card# 8675309 Library Card# TOK:113581321 KAFKA Storage Encryption Tokenizatio n Consumer Consumer ConsumerTopic Persistence Producer
  • 38. Encrypted Volume: Following the Path of Least Resistance Good • Highly durable across Kafka restarts • Simple disaster recovery planning • Follows recommended Kafka configuration practices Not So Good • Information privacy regulations require extra levels of protection • Durability is based on additional storage volumes being managed with the Kafka service
  • 39. Volatile Storage: Performance & Privacy KAFKA Consumer Consumer ConsumerTopic Persistence Initial State Storage tmpfs Storage Copy on Startup Library Card# 8675309 Library Card# TOK:113581321 Tokenizatio n Producer
  • 40. Volatile Storage: Strange Trade-offs Improvements • Noticeably better performance • Data is always “in flight”, so extra encryption shouldn’t be needed • Effectively stateless images Complications • Needs scripting to bootstrap • Topic contents are cleared on host reboot • Zookeeper won’t be able to manage offsets between reboots
  • 41. Volatile Storage: Why We Aren’t Using It • We need long-term storage of data and RAM is already a precious resource. • Our recovery strategy is built on Kafka as our state storage mechanism. Losing that state complicates recovery efforts. • Host disk caching gives us most of the benefit of volatile storage.
  • 42. Request-Response Pattern /rəˈkwest rəˈspans ˈpadərn/ noun 1. A pattern of interaction with a remote service where the local task submits a request for remote work and expects a response before continuing work. 2. A specialized use of Kafka using dedicated topic pairs to communicate with a shared service
  • 43. Request Response Basics Application Request Topic Response Topic 3. Prepare DataData 4. Assign a unique ID 5. Put request on request topic Service (Service does work, and builds a response with the Request ID) 6. Read Response topic until Request ID is seen 2. Initialize Producer 1. Initialize ConsumerConsumer Producer ID: 14159-26535 ID:14159-26532 ID:14159-26531 ID:14159-26533 ID:14159-26535 ID:14159-26536
  • 45. How Request-Response Actually Works Data Data Data Application Request Topic Data Data Data Data Data Data Response Topic Data Data Data Data
  • 46. When Failures Occur Data Data Data Application Request Topic Data Data Data Data Data Data Response Topic Data Data Data Data Data Data Data Missing Responses Data Data
  • 47. The Slow Failure Problem Failures
  • 48. The Request-Response Pattern This is actually the “Background Job” pattern: 1. Submit Job 2. Get assigned a Job ID 3. Poll for the service for until the Job ID is marked as complete 4. Retrieve the results of the job
  • 49. Request Response: Serverless Considerations • Try to reuse Producers and Consumers • Explicitly assign Consumer partitions • Attempt to read from the Consumer before submitting to the Producer • Remember to commit offsets before sending responses
  • 50. REST GRPC ETC Slightly Better: The Real-Time Tap Pattern Input Topic Precomputed Values Processing Service Application Request Real Time Service Read Request Process Send Response Deliver Data Request Response Response
  • 51. Real-Time Tap Pattern • Real-time request is handled by a session-based protocol • Resilient data processing is handled by Kafka • Failures are reported when they happen via real-time protocol • Kafka interactions can be optimized by the handler service, rather than relying on clients
  • 53. 53 Thank you for joining us!