SlideShare a Scribd company logo
Reaching scale limits on a Hadoop platform.
Lessons from a journey of speed and agility
Data innovation
Santander UK
Customer
Technology
Data People
Nici Bullivant
Head of Engineering
Data innovation| Santander UK
Santander UK is a scale challenger
150 years of banking in our DNA. Culture of retail banking. Moderate
risk. A very robust and traditional bank… with a culture of technology.
5th biggest bank in the UK. Growth by successful mergers. From 2004
to 2013, Santander bought Abbey, Bradford&Bingley,
Alliance&Leicester and other smaller portfolios.
Now transforming into a Data driven organisation.
Santander UK
Data transformation
journey
Proof of Concept
• 3 months
• Test, find potential,
identify limits, plan the
future
• Hadoop, Hive, Impala,
Spark, Oozie, Hue, Solr
T+3months
Analytics
• Get a tool immediately
• Data, business and
technology in
collaboration
T+5months
Customer facing product - Spendlytics
• The showcase: real time
analytics for customers
• Building, installing, breaking,
fixing
• Flume, Kafka, Hbase, APIs
T+9months
Broadening
• Early adopters
• Feedback and improve
• Open up for business
• Establish the foundations
for growth
T+12months
Scaling up
• Expansion of data: 10X; Projects:
12X; Users: 30X
• Continuously accelerate by
automating and reusing
• Define value and build the
operating model around it
T+24months
Democratisation
• Guided self service
• Data is a public good
• Love your data
• Care for your data
T+24months
The Data Driven Organisation
• Scientific
• Embedded
• Real time
• Cultural change
T+?
Times of
challenge and
transformation
{Our current scale}
{3,500 tables regularly ingested, Over 20,000 tables}
{200+ tables streaming from core banking}
{3+ PB in total}
{30+ products in Production}
{1,000 users}
{50 BI applications in development}
{CDH 5.5.4 after upgrade from CDH 5.4.2 early 2017}
ACQUIRE EXPLOIT
UNDERSTAND
TraceDescribe
Engines
Visualise
(Micro
Strategy)
Iterate
(notebooks)
Organise
Analyse
(SAS)
Model
(DataRobot)
Ingest Certify
Capture
(CDC)
Stream
SERVE
Interfaces
New
databases
APIs
MANAGE
Access Monitor Support Evolve Operate
Data
from
source
systems
Analyst
Internal
user
Maintain the
information at
rest.
STORE
Process the data.
COMPUTE
Govern
Discover
Technology - Conceptual architecture
Applications
Data
scientist
Limits to performance
Metastore - too many partitions, and ungraceful stops
Namenode - large memory footprint
Small files, Joining files
Navigator & Replication
HBase
Overcoming the
limits
{But upgrading is a challenge in itself}
{Duplicate environments}
{Recompile and deploy}
{Use upgrade to surface latent business products}
{Automate, automate, automate}
Lessons from our
upgrade experience
UPGRADE!
Lessons from the
journey
{Is cheap and is now mature}
{Improvements in security, management,
operation}
{New processing frameworks: Spark, Flink…}
{Streaming reaches maturity}
{More and more integrated tools}
{Going into Big data now could mean
leapfrogging the early entrants}
{Incremental data}
{Snapshots occupy too much data}
{Incremental copies are more difficult to manage but
make better use of space}
{Incremental still run full scans in tables, unless you
implement Change Data Capture}
{…}
{Automate everything}
{The only path to speed and scale is automation}
{Architecture as a Service}
{Test driven development, test automation}
{CI/CD: Automation of deployment}
{Compliance as a Service:
Automation of documentation}
{Machine learning for Data Governance}
Generate value quickly
Identify use cases that are valuable, create learning and have a wow
factor.
As use cases increase, define products that need to remain and
patterns that repeat.
For those patterns, create services by using the learnings and tools
delivered by the use cases.
And maintain the fresh water…
Get original data
Minimise data duplication and inconsistencies
Speed of Innovation and Self-service overtakes governance
Balance self service with self governance
Cultural transformation slower than technology change
With value comes growth.
Growth creates a challenge of scale.
Scale requires planning and design.
A robust foundation is necessary for a data platform
to maintain agility.
Make your foundation flexible too.
Find a simple way to establish security, governance,
architecture and automation.
Thank you…
and get in touch
nicolette.bullivant@santander.co.uk
P.S. We are recruiting
https://www.santandertechnology.co.uk/businessareas/data-innovation

More Related Content

PPTX
Tools and approaches for migrating big datasets to the cloud
PPTX
Securing and governing a multi-tenant data lake within the financial industry
PPTX
Depositing Value from Transactional Data at Danske Bank
PPTX
Synchronicity of a distributed financial system
PDF
Achieving a 360-degree view of manufacturing via open source industrial data ...
PPTX
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Tools and approaches for migrating big datasets to the cloud
Securing and governing a multi-tenant data lake within the financial industry
Depositing Value from Transactional Data at Danske Bank
Synchronicity of a distributed financial system
Achieving a 360-degree view of manufacturing via open source industrial data ...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Intro to Spark & Zeppelin - Crash Course - HS16SJ

What's hot (20)

PDF
From an experiment to a real production environment
PDF
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPTX
Log I am your father
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
Solving Performance Problems on Hadoop
PDF
Machine Learning Everywhere
PPTX
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
PDF
What's New in Apache Hive 3.0?
PPTX
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
PPTX
Modernise your EDW - Data Lake
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Solving Big Data Problems using Hortonworks
PPTX
Multi-tenant Hadoop - the challenge of maintaining high SLAS
PDF
Intelligent Integration OOW2017 - Jeff Pollock
PPTX
Lessons learned processing 70 billion data points a day using the hybrid cloud
PPTX
Big data at United Airlines
PDF
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
PPTX
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
From an experiment to a real production environment
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Log I am your father
How Hadoop Makes the Natixis Pack More Efficient
Solving Performance Problems on Hadoop
Machine Learning Everywhere
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
High Performance Spatial-Temporal Trajectory Analysis with Spark
What's New in Apache Hive 3.0?
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Modernise your EDW - Data Lake
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Solving Big Data Problems using Hortonworks
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Intelligent Integration OOW2017 - Jeff Pollock
Lessons learned processing 70 billion data points a day using the hybrid cloud
Big data at United Airlines
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
Ad

Similar to Reaching scale limits on a Hadoop platform: issues and errors created by speed and agility (20)

PPTX
PDF
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
PDF
Future of Data - Big Data
PDF
Capturing big value in big data
PPTX
Big data journey to the cloud maz chaudhri 5.30.18
PDF
Exploring the Wider World of Big Data
PPT
Future of Data - Big Data
PDF
Simply Business' Data Platform
PPTX
Big data4businessusers
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PDF
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
PPTX
Fundamentals of Big Data
PDF
Big Data at a Gaming Company: Spil Games
PPTX
Big Data Practice_Planning_steps_RK
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
PDF
The Big Data Journey at Connexity - Big Data Day LA 2015
PDF
Big dataplatform operationalstrategy
PPTX
Big Data, Big Content, and Aligning Your Storage Strategy
PDF
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Future of Data - Big Data
Capturing big value in big data
Big data journey to the cloud maz chaudhri 5.30.18
Exploring the Wider World of Big Data
Future of Data - Big Data
Simply Business' Data Platform
Big data4businessusers
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
Fundamentals of Big Data
Big Data at a Gaming Company: Spil Games
Big Data Practice_Planning_steps_RK
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
The Big Data Journey at Connexity - Big Data Day LA 2015
Big dataplatform operationalstrategy
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity

Reaching scale limits on a Hadoop platform: issues and errors created by speed and agility

  • 1. Reaching scale limits on a Hadoop platform. Lessons from a journey of speed and agility Data innovation Santander UK
  • 2. Customer Technology Data People Nici Bullivant Head of Engineering Data innovation| Santander UK
  • 3. Santander UK is a scale challenger 150 years of banking in our DNA. Culture of retail banking. Moderate risk. A very robust and traditional bank… with a culture of technology. 5th biggest bank in the UK. Growth by successful mergers. From 2004 to 2013, Santander bought Abbey, Bradford&Bingley, Alliance&Leicester and other smaller portfolios. Now transforming into a Data driven organisation.
  • 5. Proof of Concept • 3 months • Test, find potential, identify limits, plan the future • Hadoop, Hive, Impala, Spark, Oozie, Hue, Solr T+3months
  • 6. Analytics • Get a tool immediately • Data, business and technology in collaboration T+5months
  • 7. Customer facing product - Spendlytics • The showcase: real time analytics for customers • Building, installing, breaking, fixing • Flume, Kafka, Hbase, APIs T+9months
  • 8. Broadening • Early adopters • Feedback and improve • Open up for business • Establish the foundations for growth T+12months
  • 9. Scaling up • Expansion of data: 10X; Projects: 12X; Users: 30X • Continuously accelerate by automating and reusing • Define value and build the operating model around it T+24months
  • 10. Democratisation • Guided self service • Data is a public good • Love your data • Care for your data T+24months
  • 11. The Data Driven Organisation • Scientific • Embedded • Real time • Cultural change T+?
  • 13. {Our current scale} {3,500 tables regularly ingested, Over 20,000 tables} {200+ tables streaming from core banking} {3+ PB in total} {30+ products in Production} {1,000 users} {50 BI applications in development} {CDH 5.5.4 after upgrade from CDH 5.4.2 early 2017}
  • 14. ACQUIRE EXPLOIT UNDERSTAND TraceDescribe Engines Visualise (Micro Strategy) Iterate (notebooks) Organise Analyse (SAS) Model (DataRobot) Ingest Certify Capture (CDC) Stream SERVE Interfaces New databases APIs MANAGE Access Monitor Support Evolve Operate Data from source systems Analyst Internal user Maintain the information at rest. STORE Process the data. COMPUTE Govern Discover Technology - Conceptual architecture Applications Data scientist
  • 15. Limits to performance Metastore - too many partitions, and ungraceful stops Namenode - large memory footprint Small files, Joining files Navigator & Replication HBase
  • 17. {But upgrading is a challenge in itself} {Duplicate environments} {Recompile and deploy} {Use upgrade to surface latent business products} {Automate, automate, automate} Lessons from our upgrade experience UPGRADE!
  • 19. {Is cheap and is now mature} {Improvements in security, management, operation} {New processing frameworks: Spark, Flink…} {Streaming reaches maturity} {More and more integrated tools} {Going into Big data now could mean leapfrogging the early entrants}
  • 20. {Incremental data} {Snapshots occupy too much data} {Incremental copies are more difficult to manage but make better use of space} {Incremental still run full scans in tables, unless you implement Change Data Capture} {…}
  • 21. {Automate everything} {The only path to speed and scale is automation} {Architecture as a Service} {Test driven development, test automation} {CI/CD: Automation of deployment} {Compliance as a Service: Automation of documentation} {Machine learning for Data Governance}
  • 22. Generate value quickly Identify use cases that are valuable, create learning and have a wow factor. As use cases increase, define products that need to remain and patterns that repeat. For those patterns, create services by using the learnings and tools delivered by the use cases.
  • 23. And maintain the fresh water… Get original data Minimise data duplication and inconsistencies Speed of Innovation and Self-service overtakes governance Balance self service with self governance Cultural transformation slower than technology change
  • 24. With value comes growth. Growth creates a challenge of scale. Scale requires planning and design. A robust foundation is necessary for a data platform to maintain agility. Make your foundation flexible too. Find a simple way to establish security, governance, architecture and automation.
  • 25. Thank you… and get in touch nicolette.bullivant@santander.co.uk P.S. We are recruiting https://www.santandertechnology.co.uk/businessareas/data-innovation

Editor's Notes

  • #3: Antonio has an extensive background in consultancy and financial services. With a strong trail leading a number of international transformation projects he integrated many legacy platforms and processes into the Santander strategic solution. Now he is the visionary leading Santander UK transformation into a Digital Data Driven organization, transforming the technology, the processes and the culture of the organization.
  • #20: But what about launching your own new platform??? Now, that is a game changer… I believe innovators don’t set out to launch a platform… they may have a vision of a platform in their subconscious mind and, by launching their products advance on their vision… APIS MICROSERVICES SERVICE ORIENTED ARCHITECTURE
  • #23: How do you integrate governance in continues delivery and scale it up (traditionally central team checkpoint controls) Traditional fight re customer number, Convince Risk, Finance, Marketing of sharing their customer view to understand the difference together – Consistency / Diversity understood / Find synergies The cop and the Villain of the movie
  • #24: Also speed of growth caused chaos as knowledge did not spread fast enough Quick productionalization in innovation and self service skipping governance, then the controls are requested when there is a failure instead of being proactive, late to fix design if not involved in earlier phases Democratization – reduce complexity of access and new areas have more data and more opportunities but these new areas need to be educated to do the right thing (heavy regulations and responsibilities in financial services) Biggest challenge to transform culture “carrot and stick” – give them a benefit to WANT to collaborate – how to evidence benefits when you are starting