SlideShare a Scribd company logo
THE PATH TO AUTOMATION IN A HIGHLY REGULATED SPACE
Paul Wilkinson and Naveen Gupta
2 © Cloudera, Inc. All rights reserved.
SPEAKERS (1)
Naveen Gupta
Background:
• Architect / Developer ~ 20 Years
• Financial Services ~ 14 Years
Areas of Interest:
• Enterprise Risk and Data Governance
• Regulatory Demand in Market Risk
• Data Architecture
• White-Label Product Development in Risk
Professional Experience:
• Financial Services, Software, Civil Aviation, Healthcare
Senior Vice President
Market Risk, Deutsche Bank
3 © Cloudera, Inc. All rights reserved.
SPEAKERS (2)
Paul Wilkinson
Background:
• Architect / Developer ~ 15 Years
• Big Data ~ 10 Years
Areas of Interest:
• Data Engineering at Scale
• Platform Architecture
• Application Development Specialist
Professional Experience:
• DB Resident Solution Architect ~ 3 Years
• Financial Services, Cyber Security, Telco
Principal Solution Architect
Professional Services, Cloudera
Author: Architecting Modern Data Platforms
© Cloudera, Inc. All rights reserved.
INTRODUCTION
5 © Cloudera, Inc. All rights reserved.
AND FINANCIAL SERVICES
19 Banks “Too Big to Fail” (G-SIB*)
12 Banks with Risk Use Cases
10,000+ Nodes Deployed
5+ Years in Production
(* Global, Systemically Important Bank)
6 © Cloudera, Inc. All rights reserved.
BIG DATA USE CASES IN FINANCIAL SERVICES
Customer Journey/360 Risk & Compliance
Financial Crime
Customer
Loyalty,
Retention
PersonalizationNext Best Offer
Risk Analytics,
Underwriting,
Actuarial
BASEL,
BCBS-239,
FRTB, IFRS-9
Risk Aggregation,
Stress Testing
AML,
Payments
Fraud
Insider Threats Cyber Security
Investment
Research
Portfolio
Analytics, Robo-
Advisors
IoT,
Blockchain,
ML/AI
Product & Service Efficiencies
Modern Data Architecture
ETL/ Warehouse Optimization, Active Archive, Storage Optimization & Embedded Analytics
7 © Cloudera, Inc. All rights reserved.
POST-CRISIS REGULATIONS AFFECTING RISK TECHNOLOGIES
Each crisis leads to a desire to prevent future occurrences:
• Increased stress testing, more scenarios (CCAR).
• Historical simulation of VaR (HISTSIM)
• Backtesting requirements (US Basel III)
• Improved risk data, technology and process controls (BCBS-239).
• Individual model regulatory review and back-testing (FRTB).
8 © Cloudera, Inc. All rights reserved.
REGULATORY IMPACT (1)
Huge Storage and Processing Growth – FRTB Estimates
Data Storage Impact
Position Size Reg. 1Yr Size (TB) 5-Yr History (PBs) Reg. 1Yr Size (TB) 5-Yr History (PB's)
100,000 Positions Basel 2.5: (TB) 3 0.15 FRTB: (TB) 72 3.6
1 Million Positions Basel 2.5: (TB) 30 1.5 FRTB: (TB) 720 36
Computational Impact
Position Size Reg. Type Computations Reg. Type Computations
100,000 Positions Basel 2.5: (TB)
Scenario
Valuations
50 Million (Daily) &
50 Million (Week)
FRTB: (TB)
Scenario
Valuations
1.57 Billion (Daily)
Approximate 20x increase in Historical Data Storage Requirements
Approximate 30x increase in Computational Requirements
9 © Cloudera, Inc. All rights reserved.
REGULATORY IMPACT (2)
BCBS-239: Principles for Effective Risk Data Aggregation and Risk Reporting
• Overarching Principals
• Governance
• Data Architecture and IT Infrastructure
• Risk Data Aggregation
• Accuracy and Integrity
• Completeness
• Timeliness
• Adaptability
• Risk Reporting Practices
• Accuracy
• Comprehensiveness
• Clarity
• Frequency
• Distribution
Significant impact on system design
© Cloudera, Inc. All rights reserved.
THE EARLY JOURNEY
11 © Cloudera, Inc. All rights reserved.
EARLY ADOPTION AT DEUTSCHE BANK
Challenges and Successes
People
• Difficult to find talent
• Most developers had a relational mind-set
Suppliers
• Few suppliers had capabilities
Technology
• Apache Hadoop & Spark Standalone – no YARN
• Open source with no vendor support
Successes!
• Delivered portfolio stress testing!
• Historical simulation ingestion
• Platform gained wider acceptance
• Still needed to build an enterprise-class capability
• Beginning of engagement with Cloudera PS
12 © Cloudera, Inc. All rights reserved.
EARLY RISK PLATFORM POC ~ 5 YEARS AGO
Stress Testing, Historical and Monte Carlo Simulation
• Poor Performance and Scalability
• Lots of resource contention
• Inconsistent SLAs
• Platform bugs
• Slow time-to-market
• Bespoke solutions
• No component reusability
• Automation
• Manual QA reconciliation
• Long release process
• Security or Governance
• Compliance issue
• Unsupported Platform
• Difficult to get issues resolved
• Platform support used developer time
• Lack of Production Sandbox Capability
• Lack of resource management
• Activity contended with production
• Lack of Standardisation
• Segregated models for each function
• Implicit Data Glossary
© Cloudera, Inc. All rights reserved.
MODERNISING RISK DATA ANALYSIS
Architect: Naveen Gupta
14 © Cloudera, Inc. All rights reserved.
“A CENTRALLY-MANAGED, UNIFIED DATA PLATFORM THAT
COORDINATES RISK PROCESSING.”
15 © Cloudera, Inc. All rights reserved.
RISK DATA SERVICES - VISION
• Shorten time-to-market
• For data acquisition, processing and distribution
• Provide workflow management
• Decouple processing to enable scheduling
• Standardise information architecture
• Provide monitoring & notifications
• Dismantle Legacy Systems
• Accelerate Strategic Delivery
• Rollout Strategy
• Decommission
• Improve Efficiency
• Process data with continuous context
• Provide Standardised Data Models
• Conceptual, logical and physical
• Glossary for data discovery
• Increased agility
• Measure Data Quality
• Record-level and dataset-level acceptance criteria
• Automated data profiling
• Quality assessment & reporting
• Implement Data Lineage
• Full traceability from acquisition to distribution
• Change management
• Automate Testing
• Continuous integration
• Automated QA reconciliation
16 © Cloudera, Inc. All rights reserved.
MOTIVATING EXAMPLE: WORKLOAD MANAGEMENT
Direct Execution
• Consider a simple two-stage workload:
1. Load CSV data into HDFS staging:
2. Run a Spark ETL job to load into Hive/Parquet:

RISK COMPONENTS
Metadata Driven Automation
Metadata
Data
Processing
Egress
Me
A
Notifications
Housekeeping
Data
Quality
Test
Automation
Ingress

15
 © Cloudera, Inc. All rights r
RISK COMPONENTS
Metadata Driven Automation
Metadata
Data
Processing
Egress
Metrics
Alerts
Notifications
SLA
Management
Housekeeping
Data
Quality
Test
Automation
Ingress
hdfs dfs -put localfiles/*.csv staging
spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options
<application-jar> 
[application-arguments]
LOAD
PROCESS
Resources
Time
Max
17 © Cloudera, Inc. All rights reserved.
MOTIVATING EXAMPLE: WORKLOAD MANAGEMENT
The Problem with Direct Execution
• Users can execute workloads at arbitrary times
• Without YARN, this can lead to platform contention and instability
• No contextual understanding of the data / action
• Did the process run at the right time?
• Were the right number of results produced?
• Are there downstream processes that should now be run?
• The context is implicit
• SLAs were difficult to guarantee to the business

15
RISK COMPONENTS
Metadata Driven Automation
Metadata
Data
Processing
SLA
Management
Housekeeping
Data
Quality
Test
Automation
Ingress

15
Metadata Driven Automation
Metadata
Processing
Egress
Metrics
Alerts
Notifications
SLA
Management
Housekeeping
Data
Quality
Test
Automation
Ingress
LOAD
PROCESS
Resources
Time
Max
LOAD
PROCESS
LOAD
PROCESS
Time
Max
LOAD
PROCESS
Resources
18 © Cloudera, Inc. All rights reserved.
WORKLOAD MANAGEMENT
Alternative Approach: Managed Execution
• What if we decoupled ingress and processing?
• A workload manager could run processes intelligently…
… but what else could it do?

RISK COMPONENTS
Metadata Driven Automation
Metadata
Data
Processing
Egress
Me
A
Notifications
Housekeeping
Data
Quality
Test
Automation
Ingress
Workload
Manager
PROCESS
LOAD
PROCESS
Resources
Time
Max
LOAD
19 © Cloudera, Inc. All rights reserved.
WORKLOAD MANAGEMENT
Metadata-Driven Execution
The platform:
• Chooses when to run processes
• SLA / QoS Management
• Configures them at runtime
• Monitors their execution
• Alerts operators when needed
• Notifies up- and down-stream systems
• Performs follow-on work:
• Data quality analysis
• Completeness & accuracy
• Retention & Replication to DR
• Is complementary to YARN Resourcing
• Provides a powerful data context
Workflow
Management
Data
Processing
Egress
Metrics
Alerts
Notifications
SLA
Management
Metadata
Sandbox
State
Management
Data
Management
Data
Quality
Test
Automation
Workload
Replay
Ingress
20 © Cloudera, Inc. All rights reserved.
METADATA / DATASET ABSTRACTION
The risk platform maintains metadata:
• Datasets: Storage system, location, format, partitioning
• Processes: Spark Scripts, HDFS / Hive actions
• Reports: Impala SQL queries, Spark dataframe
Many clients access data through REST:
• Provides API stability
• Hides implementation detail
• Provides runtime format conversion
• Enables direct HDFS access for bulk data egress
Multiple storage adapters :
• Available for Spark, Impala, Hive, Kafka, …
Client Platform Impala
fetch
query
de-abstract
resultset
render
21 © Cloudera, Inc. All rights reserved.
CURRENT RISK PLATFORM
Stress Testing, Historical Simulation, FRTB, US Basel III, CCAR
• High Performance and Scalability
• YARN Resource Management
• Very large dataset processing now possible
• 10x - 20x faster based on use cases
• Fast time-to-market
• Metadata-driven framework
• Component reusability
• Impact analysis / workflow tracking
• Automation
• Fully automated QA analysis
• Load and performance benchmarking
• Strong Security and Governance
• Supported Platform
• Internal Managed Hadoop Service
• 24 x 5 Support from Cloudera
• Full audit compliance
• Large Scale Deployment
• Multiple petabytes
• 1,000s cores
• Production Sandbox Capability
• Faster data exploration
• Activity segregated from production
• Standardisation
• Built-in data lineage / glossary
© Cloudera, Inc. All rights reserved.
THE NEXT STAGE OF THE JOURNEY…
23 © Cloudera, Inc. All rights reserved.
RISK PLATFORM: FUTURE WORK
• Framework Enhancements
• Robotic Process Automation (RPA)
• Self-service access for end-users
• Increasing Scale and Adoption
• 15-20x data volume expected
• Ever increasing regulatory demands
• More Automation
• Machine-learning for release management
• End-to-end container-based testing
• Data Science
• Sandbox capability extended to DS users
• Opening up greater access to historical data
• CDSW now available in lab!
24 © Cloudera, Inc. All rights reserved.
WE’RE HIRING!
• Looking for:
• Architects
• Data Engineers
• Data Scientists
• Developers
• Get in touch!
• Email: naveen-a.gupta@db.com
• In Person: Come over and chat!
db.com/careers
THANK YOU

More Related Content

PPTX
Capital One's Next Generation Decision in less than 2 ms
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PDF
Evolution of unix environments and the road to faster deployments
PPTX
Cloud Migration
PDF
Initiative Based Technology Consulting Case Studies
PDF
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
PDF
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
PPTX
EPM Automate - Automating Enterprise Performance Management Cloud Solutions
Capital One's Next Generation Decision in less than 2 ms
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Evolution of unix environments and the road to faster deployments
Cloud Migration
Initiative Based Technology Consulting Case Studies
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
EPM Automate - Automating Enterprise Performance Management Cloud Solutions

What's hot (17)

PDF
Cloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
PDF
Kudu austin oct 2015.pptx
PDF
Univa Presentation at DAC 2020
PDF
Cloud-Native Data: What data questions to ask when building cloud-native apps
PDF
Digital Transformation: Highly Resilient Streaming Architecture and Strategies
PPTX
IoT Austin CUG talk
PDF
Oracle 12.2 - My Favorite Top 5 New or Improved Features
PDF
How InfluxDB Enables NodeSource to Run Extreme Levels of Node.js Processes
PPTX
Nutanix basic
PPTX
Aws migration case study_blr_meetup
PDF
Commerical imaging constellation
PPTX
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
PPTX
IBM Maximo Performance Tuning
PDF
Everything You Need to Know About Oracle 12c Indexes
PDF
dA Platform Overview
PPTX
Rackspace: Unlock Your Cloud - RightScale Compute 2013
PDF
Observability with Spring-based distributed systems
Cloud Migration Cookbook: A Guide To Moving Your Apps To The Cloud
Kudu austin oct 2015.pptx
Univa Presentation at DAC 2020
Cloud-Native Data: What data questions to ask when building cloud-native apps
Digital Transformation: Highly Resilient Streaming Architecture and Strategies
IoT Austin CUG talk
Oracle 12.2 - My Favorite Top 5 New or Improved Features
How InfluxDB Enables NodeSource to Run Extreme Levels of Node.js Processes
Nutanix basic
Aws migration case study_blr_meetup
Commerical imaging constellation
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
IBM Maximo Performance Tuning
Everything You Need to Know About Oracle 12c Indexes
dA Platform Overview
Rackspace: Unlock Your Cloud - RightScale Compute 2013
Observability with Spring-based distributed systems
Ad

Similar to Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATED SPACE (20)

PPTX
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
PPTX
Build a modern platform for anti-money laundering 9.19.18
PDF
Financial server blue print - Blueprints.pdf
PPTX
Relying on Data for Strategic Decision-Making--Financial Services Experience
PPTX
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
PPTX
Seeking Cybersecurity--Strategies to Protect the Data
PDF
Fighting cyber fraud with hadoop v2
PPTX
Introducing the data science sandbox as a service 8.30.18
PPTX
RecordService for Unified Access Control
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
PPTX
Innovation Without Compromise: The Challenges of Securing Big Data
PPTX
Turning Data into Business Value with a Modern Data Platform
PPTX
Modern Data Warehouse Fundamentals Part 1
PDF
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery
PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
PPTX
A deep dive into running data analytic workloads in the cloud
PPTX
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
PPTX
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
PDF
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
Build a modern platform for anti-money laundering 9.19.18
Financial server blue print - Blueprints.pdf
Relying on Data for Strategic Decision-Making--Financial Services Experience
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Seeking Cybersecurity--Strategies to Protect the Data
Fighting cyber fraud with hadoop v2
Introducing the data science sandbox as a service 8.30.18
RecordService for Unified Access Control
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Innovation Without Compromise: The Challenges of Securing Big Data
Turning Data into Business Value with a Modern Data Platform
Modern Data Warehouse Fundamentals Part 1
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
A deep dive into running data analytic workloads in the cloud
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Ad

More from Matt Stubbs (20)

PDF
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
PDF
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
PDF
Blueprint Series: Expedia Partner Solutions, Data Platform
PDF
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
PDF
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
PDF
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
PDF
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
PDF
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
PDF
Big Data LDN 2018: AI VS. GDPR
PDF
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
PDF
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
PDF
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
PDF
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
PDF
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
PDF
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
PDF
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
PDF
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
PDF
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
PDF
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: A TALE OF TWO BI STANDARDS: DATA WAREHOUSES AND DATA LAKES

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Introduction to Business Data Analytics.
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
Fluorescence-microscope_Botany_detailed content
IBA_Chapter_11_Slides_Final_Accessible.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Business Data Analytics.
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Supervised vs unsupervised machine learning algorithms
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd

Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATED SPACE

  • 1. THE PATH TO AUTOMATION IN A HIGHLY REGULATED SPACE Paul Wilkinson and Naveen Gupta
  • 2. 2 © Cloudera, Inc. All rights reserved. SPEAKERS (1) Naveen Gupta Background: • Architect / Developer ~ 20 Years • Financial Services ~ 14 Years Areas of Interest: • Enterprise Risk and Data Governance • Regulatory Demand in Market Risk • Data Architecture • White-Label Product Development in Risk Professional Experience: • Financial Services, Software, Civil Aviation, Healthcare Senior Vice President Market Risk, Deutsche Bank
  • 3. 3 © Cloudera, Inc. All rights reserved. SPEAKERS (2) Paul Wilkinson Background: • Architect / Developer ~ 15 Years • Big Data ~ 10 Years Areas of Interest: • Data Engineering at Scale • Platform Architecture • Application Development Specialist Professional Experience: • DB Resident Solution Architect ~ 3 Years • Financial Services, Cyber Security, Telco Principal Solution Architect Professional Services, Cloudera Author: Architecting Modern Data Platforms
  • 4. © Cloudera, Inc. All rights reserved. INTRODUCTION
  • 5. 5 © Cloudera, Inc. All rights reserved. AND FINANCIAL SERVICES 19 Banks “Too Big to Fail” (G-SIB*) 12 Banks with Risk Use Cases 10,000+ Nodes Deployed 5+ Years in Production (* Global, Systemically Important Bank)
  • 6. 6 © Cloudera, Inc. All rights reserved. BIG DATA USE CASES IN FINANCIAL SERVICES Customer Journey/360 Risk & Compliance Financial Crime Customer Loyalty, Retention PersonalizationNext Best Offer Risk Analytics, Underwriting, Actuarial BASEL, BCBS-239, FRTB, IFRS-9 Risk Aggregation, Stress Testing AML, Payments Fraud Insider Threats Cyber Security Investment Research Portfolio Analytics, Robo- Advisors IoT, Blockchain, ML/AI Product & Service Efficiencies Modern Data Architecture ETL/ Warehouse Optimization, Active Archive, Storage Optimization & Embedded Analytics
  • 7. 7 © Cloudera, Inc. All rights reserved. POST-CRISIS REGULATIONS AFFECTING RISK TECHNOLOGIES Each crisis leads to a desire to prevent future occurrences: • Increased stress testing, more scenarios (CCAR). • Historical simulation of VaR (HISTSIM) • Backtesting requirements (US Basel III) • Improved risk data, technology and process controls (BCBS-239). • Individual model regulatory review and back-testing (FRTB).
  • 8. 8 © Cloudera, Inc. All rights reserved. REGULATORY IMPACT (1) Huge Storage and Processing Growth – FRTB Estimates Data Storage Impact Position Size Reg. 1Yr Size (TB) 5-Yr History (PBs) Reg. 1Yr Size (TB) 5-Yr History (PB's) 100,000 Positions Basel 2.5: (TB) 3 0.15 FRTB: (TB) 72 3.6 1 Million Positions Basel 2.5: (TB) 30 1.5 FRTB: (TB) 720 36 Computational Impact Position Size Reg. Type Computations Reg. Type Computations 100,000 Positions Basel 2.5: (TB) Scenario Valuations 50 Million (Daily) & 50 Million (Week) FRTB: (TB) Scenario Valuations 1.57 Billion (Daily) Approximate 20x increase in Historical Data Storage Requirements Approximate 30x increase in Computational Requirements
  • 9. 9 © Cloudera, Inc. All rights reserved. REGULATORY IMPACT (2) BCBS-239: Principles for Effective Risk Data Aggregation and Risk Reporting • Overarching Principals • Governance • Data Architecture and IT Infrastructure • Risk Data Aggregation • Accuracy and Integrity • Completeness • Timeliness • Adaptability • Risk Reporting Practices • Accuracy • Comprehensiveness • Clarity • Frequency • Distribution Significant impact on system design
  • 10. © Cloudera, Inc. All rights reserved. THE EARLY JOURNEY
  • 11. 11 © Cloudera, Inc. All rights reserved. EARLY ADOPTION AT DEUTSCHE BANK Challenges and Successes People • Difficult to find talent • Most developers had a relational mind-set Suppliers • Few suppliers had capabilities Technology • Apache Hadoop & Spark Standalone – no YARN • Open source with no vendor support Successes! • Delivered portfolio stress testing! • Historical simulation ingestion • Platform gained wider acceptance • Still needed to build an enterprise-class capability • Beginning of engagement with Cloudera PS
  • 12. 12 © Cloudera, Inc. All rights reserved. EARLY RISK PLATFORM POC ~ 5 YEARS AGO Stress Testing, Historical and Monte Carlo Simulation • Poor Performance and Scalability • Lots of resource contention • Inconsistent SLAs • Platform bugs • Slow time-to-market • Bespoke solutions • No component reusability • Automation • Manual QA reconciliation • Long release process • Security or Governance • Compliance issue • Unsupported Platform • Difficult to get issues resolved • Platform support used developer time • Lack of Production Sandbox Capability • Lack of resource management • Activity contended with production • Lack of Standardisation • Segregated models for each function • Implicit Data Glossary
  • 13. © Cloudera, Inc. All rights reserved. MODERNISING RISK DATA ANALYSIS Architect: Naveen Gupta
  • 14. 14 © Cloudera, Inc. All rights reserved. “A CENTRALLY-MANAGED, UNIFIED DATA PLATFORM THAT COORDINATES RISK PROCESSING.”
  • 15. 15 © Cloudera, Inc. All rights reserved. RISK DATA SERVICES - VISION • Shorten time-to-market • For data acquisition, processing and distribution • Provide workflow management • Decouple processing to enable scheduling • Standardise information architecture • Provide monitoring & notifications • Dismantle Legacy Systems • Accelerate Strategic Delivery • Rollout Strategy • Decommission • Improve Efficiency • Process data with continuous context • Provide Standardised Data Models • Conceptual, logical and physical • Glossary for data discovery • Increased agility • Measure Data Quality • Record-level and dataset-level acceptance criteria • Automated data profiling • Quality assessment & reporting • Implement Data Lineage • Full traceability from acquisition to distribution • Change management • Automate Testing • Continuous integration • Automated QA reconciliation
  • 16. 16 © Cloudera, Inc. All rights reserved. MOTIVATING EXAMPLE: WORKLOAD MANAGEMENT Direct Execution • Consider a simple two-stage workload: 1. Load CSV data into HDFS staging: 2. Run a Spark ETL job to load into Hive/Parquet: RISK COMPONENTS Metadata Driven Automation Metadata Data Processing Egress Me A Notifications Housekeeping Data Quality Test Automation Ingress 15 © Cloudera, Inc. All rights r RISK COMPONENTS Metadata Driven Automation Metadata Data Processing Egress Metrics Alerts Notifications SLA Management Housekeeping Data Quality Test Automation Ingress hdfs dfs -put localfiles/*.csv staging spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments] LOAD PROCESS Resources Time Max
  • 17. 17 © Cloudera, Inc. All rights reserved. MOTIVATING EXAMPLE: WORKLOAD MANAGEMENT The Problem with Direct Execution • Users can execute workloads at arbitrary times • Without YARN, this can lead to platform contention and instability • No contextual understanding of the data / action • Did the process run at the right time? • Were the right number of results produced? • Are there downstream processes that should now be run? • The context is implicit • SLAs were difficult to guarantee to the business 15 RISK COMPONENTS Metadata Driven Automation Metadata Data Processing SLA Management Housekeeping Data Quality Test Automation Ingress 15 Metadata Driven Automation Metadata Processing Egress Metrics Alerts Notifications SLA Management Housekeeping Data Quality Test Automation Ingress LOAD PROCESS Resources Time Max LOAD PROCESS LOAD PROCESS Time Max LOAD PROCESS Resources
  • 18. 18 © Cloudera, Inc. All rights reserved. WORKLOAD MANAGEMENT Alternative Approach: Managed Execution • What if we decoupled ingress and processing? • A workload manager could run processes intelligently… … but what else could it do? RISK COMPONENTS Metadata Driven Automation Metadata Data Processing Egress Me A Notifications Housekeeping Data Quality Test Automation Ingress Workload Manager PROCESS LOAD PROCESS Resources Time Max LOAD
  • 19. 19 © Cloudera, Inc. All rights reserved. WORKLOAD MANAGEMENT Metadata-Driven Execution The platform: • Chooses when to run processes • SLA / QoS Management • Configures them at runtime • Monitors their execution • Alerts operators when needed • Notifies up- and down-stream systems • Performs follow-on work: • Data quality analysis • Completeness & accuracy • Retention & Replication to DR • Is complementary to YARN Resourcing • Provides a powerful data context Workflow Management Data Processing Egress Metrics Alerts Notifications SLA Management Metadata Sandbox State Management Data Management Data Quality Test Automation Workload Replay Ingress
  • 20. 20 © Cloudera, Inc. All rights reserved. METADATA / DATASET ABSTRACTION The risk platform maintains metadata: • Datasets: Storage system, location, format, partitioning • Processes: Spark Scripts, HDFS / Hive actions • Reports: Impala SQL queries, Spark dataframe Many clients access data through REST: • Provides API stability • Hides implementation detail • Provides runtime format conversion • Enables direct HDFS access for bulk data egress Multiple storage adapters : • Available for Spark, Impala, Hive, Kafka, … Client Platform Impala fetch query de-abstract resultset render
  • 21. 21 © Cloudera, Inc. All rights reserved. CURRENT RISK PLATFORM Stress Testing, Historical Simulation, FRTB, US Basel III, CCAR • High Performance and Scalability • YARN Resource Management • Very large dataset processing now possible • 10x - 20x faster based on use cases • Fast time-to-market • Metadata-driven framework • Component reusability • Impact analysis / workflow tracking • Automation • Fully automated QA analysis • Load and performance benchmarking • Strong Security and Governance • Supported Platform • Internal Managed Hadoop Service • 24 x 5 Support from Cloudera • Full audit compliance • Large Scale Deployment • Multiple petabytes • 1,000s cores • Production Sandbox Capability • Faster data exploration • Activity segregated from production • Standardisation • Built-in data lineage / glossary
  • 22. © Cloudera, Inc. All rights reserved. THE NEXT STAGE OF THE JOURNEY…
  • 23. 23 © Cloudera, Inc. All rights reserved. RISK PLATFORM: FUTURE WORK • Framework Enhancements • Robotic Process Automation (RPA) • Self-service access for end-users • Increasing Scale and Adoption • 15-20x data volume expected • Ever increasing regulatory demands • More Automation • Machine-learning for release management • End-to-end container-based testing • Data Science • Sandbox capability extended to DS users • Opening up greater access to historical data • CDSW now available in lab!
  • 24. 24 © Cloudera, Inc. All rights reserved. WE’RE HIRING! • Looking for: • Architects • Data Engineers • Data Scientists • Developers • Get in touch! • Email: naveen-a.gupta@db.com • In Person: Come over and chat! db.com/careers