SlideShare a Scribd company logo
Karthik Aaravabhoomi
July 20, 2016
Welcome Data Enthusiasts
• More than 65 million customer accounts
• More than 44,000 associates
• Largest US direct bank
• 3rd largest independent auto loan originator
• 4th largest credit card issuer in the US
Capital One at a glance
• Overview of Cyber – Technology Data and Analytics Frameworks: motivation,
vision, and roadmap.
• Architecture overview
• Machine Learning use case
• Governance and Progression
• Key Benefits
The Focus of Today’s Discussion
Leveraging big data we can create a single pane of glass, automate and enrich alerts to ease the burden on our
analysts
Bad Actors Attack Capital One and Our Tools Monitor and Generate Lots of
Alerts in Disparate Tools for Our Analysts to Analyze
Technology
Analytics
Security Analytics
Sample Use Cases
• Malware using brute force attempt to login
• Malware detection acceleration due to watering
hole attack
• Traffic to/from high risk geo-locations
• Full assessment of a security breach, pulling
together all relevant security and non-security
events involved
• Evaluation of privileged user behavior to identify
outliers from normal patterns
Sample Use Cases
• Predict performance and workload profile for
complex multi-tenant environments
• Unified dashboard that displays real-time
backup status of servers and databases
• Recommend device locations, and failure
impact based on resiliency requirements
• Provide capacity answers to business in real-
time
“What threats are occurring in our
environment and where do we need to take
action to address bad actors?”
“What is the health of the Capital One
environment and where do we see
degradation in performance?”
Primary Focus: Security Primary Focus: Technology
Common Requirements
• Data aggregation • Event correlation • Data visualization & reporting• Data enrichment • Predictive Modeling
The Cyber –Tech Data Lake provides the data processing capabilities to meet
the analytical needs for Security and Technology Operations
The Cyber Data Lake will
provide new capabilities:
• Predict Insider Threats
• Identify Cyber Criminals
• Predict Sophisticated
Attacks
• Automate Incident
Management
• Alert phishing attacks
• Centralize storage
Log Data Sources Enrichment Visualization Machine Learning
• Web Proxy
• Syslog
• Email
• Firewall
The Cyber Data Lake will be a Differentiator for Our Cybersecurity Program
Create value through fast prototyping.
Bridge the gap between prototype and production.
Show how open collaboration produces network effects.
Accelerate our partners’ transformation.
The Frameworks and Platform Team’s Mission Centers on Facilitating
Innovation and Transformation within the Organization
Unsupervised Learning
Supervised Learning
Supervised and unsupervised are two highly complimentary
techniques for understanding data and building smart decisioning
Feature Engineering
Machine Learning Enables the Ability for Algorithms to Iteratively Learn,
which Allows Us to Find Hidden Insight without Direct Programming
Many models can be combined and applied to multiple use cases to detect
broad, complex threat patterns.
Model build process
Data collection
Data
exploration
Variable
reduction
Variable
cleaning
Model selection Validation Deployment Documentation
Model builds are a highly-iterative process comprised of several universal
steps
Easy to use
• Users must be able to add features easily
Highly efficient
• Product must have high performance and minimize waste due to re-work, errors
Scalable
• We should have the ability to scale this multiple applications and entities
Platform agnostic
• The attributes must be able to work on any platform- Hadoop, AWS and potentially others
Well-governed
• Attributes must protect our IP
Based on 5 Core Principles
Leveraging H20
Mission
Augment human judgment by harnessing machine learning
Objectives
• Best Practices: Develop implementations of established modeling best practices for Data
Scientists using general purpose programming languages (e.g., Python, Java, Scala).
• Automation: Enable end-to-end automation of a model build, including generation of risk
management and regulatory artifacts, to reduce iteration times and enable more thorough
analysis.
• Portability: Abstract over tool choice so analytics can be scaled from laptops to next
generation Big Data tools with minimal rework.
A supervised/Unsupervised learning and model risk management framework
How?
A supervised/Unsupervised learning and model risk management framework
Objectives
• Best Practices: Work closely with Model Risk office, Decision Sciences, and
Engineering teams to identify and prioritize best practices for implementation.
• Automation: Build on top of H20, a framework for automating complex data processing
workflows involving multiple frameworks.
• Portability: Develop a high level API focused on modeling tasks, with a variety of
implementations enabling tool substitution “under the hood”.
Data Extraction Data Parsing
Feature
Selection
Model
Development
Model
Management
Model
Comparison
Model(s)
• Extract Load
Transform
• Adaptors/
Connectors
Data Pipeline
Format
Conversion
Data Prep
• Group, sort,
selection,
impute etc.
• Create tabular
output for
feature selection
Data Munging
Feature
Imputation
• Create feature
extraction
routines
• Algorithms to
check and
validate selected
features
Feature Pipeline Model Pipeline Deployment
Data Pipelines
Continuous
Integration
• Model metrics
and selection
• Model
management
• Scoring
Services
• Build Integration
• Pipeline
Integration
Development and Deployment Pipeline using H2O
Component Architecture – Model Building
Machine
Logs
Firewall
Logs
Device
Logs
LogAggregation(Rawevents)
Amazon S3
Feature Pipeline
Model Pipeline
Row
Incremental
Batch
Large Batch
User Interface
Alerts Batch Processing API
Data Pipeline and Munging
Incremental
Load
In-Memory Data
store
Feature
Extraction
Streaming Data Integration
Feature Imputation
H2O Model Execution Pipeline – Batch & Real Time
Real Time
Events
DStream
(Raw Data
over time
window)
Sparkling Water
UI
Spark Streaming
Spark RDD
H2O Frame
Raw Data
H2O Frames
(Feature Data
using Feat-
Ext.py)
Bolt
Feat-Ext.py
Bolt
Storm
H2O POJO
S3 Events Sparkling Water
Feat-Ext.py
Row
Incremental Batch
Large Batch
H2O Model Execution Pipeline – Batch & Real Time
AUTOMATE RELENTLESSLY
Automated processes are testable, less error prone, and clear away drudgery to make space for creativity.
STRIVE FOR REPRODUCABILITY
It enables results to be validated and built upon. Our data products touch the financial lives of millions.
BE OPEN
Build for openness, insist that your work be of value to others, and enjoy the network effects.
EXHIBIT TECHNICAL LEADERSHIP
Team leaders are hands-on and write great code. Performers see themselves as architects generating building
blocks of enduring value
Our Methodology Reflects a Commitment to Usability and Collaboration
• Free up our risk officers and data scientists to solve business problems, not
shepherd around individual tasks.
• Encodes the accepted best practices of the risk and modeling communities
• Building blocks have a unified API, allows developers to handle the newest
technologies, letting users to explore their business value
• Analysis is in code, hence reproducible, loggable, testable, and under version
control
Automation has many benefits
What To Remember
Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One

More Related Content

PDF
Cybersecurity with AI - Ashrith Barthur
PPTX
Applying Noisy Knowledge Graphs to Real Problems
PDF
Smart data for a predictive bank
PDF
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
PPTX
AI in the Enterprise at Scale
PDF
Digital Transformation in a Connected World
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
PDF
Introduction to Neo4j
Cybersecurity with AI - Ashrith Barthur
Applying Noisy Knowledge Graphs to Real Problems
Smart data for a predictive bank
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
AI in the Enterprise at Scale
Digital Transformation in a Connected World
Hortonworks Hybrid Cloud - Putting you back in control of your data
Introduction to Neo4j

What's hot (20)

PDF
Using Data Science for Cybersecurity
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PDF
H2O AutoML roadmap - Ray Peck
PPTX
Self Guiding User Experience
PDF
H2O for Medicine and Intro to H2O in Python
PDF
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
PDF
Software Engineering for Data Scientists
PPTX
2016 Cybersecurity Analytics State of the Union
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
The Proliferation of New Database Technologies and Implications for Data Scie...
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Data Wrangling and the Art of Big Data Discovery
PDF
Full-Stack Data Science: How to be a One-person Data Team
PDF
Structuring Data from Unstructured Things. Sean Lorenz
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
PDF
Introducción al Aprendizaje Automatico con H2O-3 (1)
PDF
Sqrrl Enterprise: Integrate, Explore, Analyze
PDF
Innovating With Data and Analytics
PDF
DataWorks 2018: How Big Data and AI Saved the Day
PPTX
Big Data Application Architectures - IoT
Using Data Science for Cybersecurity
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
H2O AutoML roadmap - Ray Peck
Self Guiding User Experience
H2O for Medicine and Intro to H2O in Python
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Software Engineering for Data Scientists
2016 Cybersecurity Analytics State of the Union
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
The Proliferation of New Database Technologies and Implications for Data Scie...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Data Wrangling and the Art of Big Data Discovery
Full-Stack Data Science: How to be a One-person Data Team
Structuring Data from Unstructured Things. Sean Lorenz
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sqrrl Enterprise: Integrate, Explore, Analyze
Innovating With Data and Analytics
DataWorks 2018: How Big Data and AI Saved the Day
Big Data Application Architectures - IoT
Ad

Viewers also liked (20)

PPTX
Visual Machine Learning - Tony Chu
PPTX
Spotlight - The human behind the machine
PPTX
Better Customer Experience with Data Science - Bernard Burg, Comcast
PDF
Demystifying Security Analytics: Data, Methods, Use Cases
PDF
H2O Advancements - Arno Candel
PPTX
Comcast Enterprise Network Services
PPTX
Predicting Patient Outcomes in Real-Time at HCA
PDF
Anti-Money Laundering Solution
PDF
Strata San Jose 2016: Scalable Ensemble Learning with H2O
DOCX
Assignment noushad
PDF
Building a Production-ready Predictive App for Customer Service - Alex Ingerm...
PPT
6.3 evaluating-and-graphing-polynomila-functions
PPTX
PPTX
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
PPTX
Evaluating functions basic rules (day 3)
PPT
Yr 11 5 minute lesson plan
PDF
Stacked Ensembles in H2O
PDF
Evaluating Functions Handout 2
PDF
Evaluating functions and notation
PPTX
Building Real Time Targeting Capabilities - Ryan Zotti, Subbu Thiruppathy - C...
Visual Machine Learning - Tony Chu
Spotlight - The human behind the machine
Better Customer Experience with Data Science - Bernard Burg, Comcast
Demystifying Security Analytics: Data, Methods, Use Cases
H2O Advancements - Arno Candel
Comcast Enterprise Network Services
Predicting Patient Outcomes in Real-Time at HCA
Anti-Money Laundering Solution
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Assignment noushad
Building a Production-ready Predictive App for Customer Service - Alex Ingerm...
6.3 evaluating-and-graphing-polynomila-functions
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Evaluating functions basic rules (day 3)
Yr 11 5 minute lesson plan
Stacked Ensembles in H2O
Evaluating Functions Handout 2
Evaluating functions and notation
Building Real Time Targeting Capabilities - Ryan Zotti, Subbu Thiruppathy - C...
Ad

Similar to Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One (20)

PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Platforming the Major Analytic Use Cases for Modern Engineering
PPTX
A machine learning and data science pipeline for real companies
PPTX
Deliveinrg explainable AI
PPTX
BsidesLVPresso2016_JZeditsv6
PDF
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
PPSX
10-Hot-Data-Analytics-Tre-8904178.ppsx
PPTX
Predictive Analytics: Extending asset management framework for multi-industry...
PPTX
Managing your Assets with Big Data Tools
PDF
Overview of analytics and big data in practice
PDF
Mining Intelligent Insights: AI/ML for Financial Services
PDF
PDF
Analytics&IoT
PDF
Resume
PDF
Gse uk-cedrinemadera-2018-shared
PDF
AWS Partner Data Analytics on AWS_Handout.pdf
PPTX
Automated Analytics at Scale
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PPTX
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Platforming the Major Analytic Use Cases for Modern Engineering
A machine learning and data science pipeline for real companies
Deliveinrg explainable AI
BsidesLVPresso2016_JZeditsv6
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
10-Hot-Data-Analytics-Tre-8904178.ppsx
Predictive Analytics: Extending asset management framework for multi-industry...
Managing your Assets with Big Data Tools
Overview of analytics and big data in practice
Mining Intelligent Insights: AI/ML for Financial Services
Analytics&IoT
Resume
Gse uk-cedrinemadera-2018-shared
AWS Partner Data Analytics on AWS_Handout.pdf
Automated Analytics at Scale
The Future of Apache Hadoop an Enterprise Architecture View
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
annual-report-2024-2025 original latest.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
oil_refinery_comprehensive_20250804084928 (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Qualitative Qantitative and Mixed Methods.pptx
Quality review (1)_presentation of this 21
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
annual-report-2024-2025 original latest.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Database Infoormation System (DBIS).pptx
IB Computer Science - Internal Assessment.pptx
Introduction to machine learning and Linear Models
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx

Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One

  • 1. Karthik Aaravabhoomi July 20, 2016 Welcome Data Enthusiasts
  • 2. • More than 65 million customer accounts • More than 44,000 associates • Largest US direct bank • 3rd largest independent auto loan originator • 4th largest credit card issuer in the US Capital One at a glance
  • 3. • Overview of Cyber – Technology Data and Analytics Frameworks: motivation, vision, and roadmap. • Architecture overview • Machine Learning use case • Governance and Progression • Key Benefits The Focus of Today’s Discussion
  • 4. Leveraging big data we can create a single pane of glass, automate and enrich alerts to ease the burden on our analysts Bad Actors Attack Capital One and Our Tools Monitor and Generate Lots of Alerts in Disparate Tools for Our Analysts to Analyze
  • 5. Technology Analytics Security Analytics Sample Use Cases • Malware using brute force attempt to login • Malware detection acceleration due to watering hole attack • Traffic to/from high risk geo-locations • Full assessment of a security breach, pulling together all relevant security and non-security events involved • Evaluation of privileged user behavior to identify outliers from normal patterns Sample Use Cases • Predict performance and workload profile for complex multi-tenant environments • Unified dashboard that displays real-time backup status of servers and databases • Recommend device locations, and failure impact based on resiliency requirements • Provide capacity answers to business in real- time “What threats are occurring in our environment and where do we need to take action to address bad actors?” “What is the health of the Capital One environment and where do we see degradation in performance?” Primary Focus: Security Primary Focus: Technology Common Requirements • Data aggregation • Event correlation • Data visualization & reporting• Data enrichment • Predictive Modeling The Cyber –Tech Data Lake provides the data processing capabilities to meet the analytical needs for Security and Technology Operations
  • 6. The Cyber Data Lake will provide new capabilities: • Predict Insider Threats • Identify Cyber Criminals • Predict Sophisticated Attacks • Automate Incident Management • Alert phishing attacks • Centralize storage Log Data Sources Enrichment Visualization Machine Learning • Web Proxy • Syslog • Email • Firewall The Cyber Data Lake will be a Differentiator for Our Cybersecurity Program
  • 7. Create value through fast prototyping. Bridge the gap between prototype and production. Show how open collaboration produces network effects. Accelerate our partners’ transformation. The Frameworks and Platform Team’s Mission Centers on Facilitating Innovation and Transformation within the Organization
  • 8. Unsupervised Learning Supervised Learning Supervised and unsupervised are two highly complimentary techniques for understanding data and building smart decisioning Feature Engineering Machine Learning Enables the Ability for Algorithms to Iteratively Learn, which Allows Us to Find Hidden Insight without Direct Programming
  • 9. Many models can be combined and applied to multiple use cases to detect broad, complex threat patterns.
  • 10. Model build process Data collection Data exploration Variable reduction Variable cleaning Model selection Validation Deployment Documentation Model builds are a highly-iterative process comprised of several universal steps
  • 11. Easy to use • Users must be able to add features easily Highly efficient • Product must have high performance and minimize waste due to re-work, errors Scalable • We should have the ability to scale this multiple applications and entities Platform agnostic • The attributes must be able to work on any platform- Hadoop, AWS and potentially others Well-governed • Attributes must protect our IP Based on 5 Core Principles
  • 12. Leveraging H20 Mission Augment human judgment by harnessing machine learning Objectives • Best Practices: Develop implementations of established modeling best practices for Data Scientists using general purpose programming languages (e.g., Python, Java, Scala). • Automation: Enable end-to-end automation of a model build, including generation of risk management and regulatory artifacts, to reduce iteration times and enable more thorough analysis. • Portability: Abstract over tool choice so analytics can be scaled from laptops to next generation Big Data tools with minimal rework. A supervised/Unsupervised learning and model risk management framework
  • 13. How? A supervised/Unsupervised learning and model risk management framework Objectives • Best Practices: Work closely with Model Risk office, Decision Sciences, and Engineering teams to identify and prioritize best practices for implementation. • Automation: Build on top of H20, a framework for automating complex data processing workflows involving multiple frameworks. • Portability: Develop a high level API focused on modeling tasks, with a variety of implementations enabling tool substitution “under the hood”.
  • 14. Data Extraction Data Parsing Feature Selection Model Development Model Management Model Comparison Model(s) • Extract Load Transform • Adaptors/ Connectors Data Pipeline Format Conversion Data Prep • Group, sort, selection, impute etc. • Create tabular output for feature selection Data Munging Feature Imputation • Create feature extraction routines • Algorithms to check and validate selected features Feature Pipeline Model Pipeline Deployment Data Pipelines Continuous Integration • Model metrics and selection • Model management • Scoring Services • Build Integration • Pipeline Integration Development and Deployment Pipeline using H2O
  • 15. Component Architecture – Model Building Machine Logs Firewall Logs Device Logs LogAggregation(Rawevents) Amazon S3 Feature Pipeline Model Pipeline Row Incremental Batch Large Batch User Interface Alerts Batch Processing API Data Pipeline and Munging Incremental Load In-Memory Data store Feature Extraction Streaming Data Integration Feature Imputation
  • 16. H2O Model Execution Pipeline – Batch & Real Time Real Time Events DStream (Raw Data over time window) Sparkling Water UI Spark Streaming Spark RDD H2O Frame Raw Data H2O Frames (Feature Data using Feat- Ext.py) Bolt Feat-Ext.py Bolt Storm H2O POJO S3 Events Sparkling Water Feat-Ext.py Row Incremental Batch Large Batch
  • 17. H2O Model Execution Pipeline – Batch & Real Time
  • 18. AUTOMATE RELENTLESSLY Automated processes are testable, less error prone, and clear away drudgery to make space for creativity. STRIVE FOR REPRODUCABILITY It enables results to be validated and built upon. Our data products touch the financial lives of millions. BE OPEN Build for openness, insist that your work be of value to others, and enjoy the network effects. EXHIBIT TECHNICAL LEADERSHIP Team leaders are hands-on and write great code. Performers see themselves as architects generating building blocks of enduring value Our Methodology Reflects a Commitment to Usability and Collaboration
  • 19. • Free up our risk officers and data scientists to solve business problems, not shepherd around individual tasks. • Encodes the accepted best practices of the risk and modeling communities • Building blocks have a unified API, allows developers to handle the newest technologies, letting users to explore their business value • Analysis is in code, hence reproducible, loggable, testable, and under version control Automation has many benefits