SlideShare a Scribd company logo
Simply Business’
Data Platform
By Dani Solà
1. Introductions
2. Some context
3. Data platform evolution
4. Cool stuff we’ve done
5. Lessons learned
6. Peeking into the future
7. References
Table of contents
1. Introductions
Nice to meet you
Hello! I’m Dani :)
This is Simply Business
● Largest UK business insurance provider
● Over 450,000 policy holders
● Using BML, tech and data to disrupt the business insurance market
● Acquired in 2016 (£120M) and again by Travellers in 2017 (£402M)
● #1 best company to work for in 2015 and 2016 among other awards
● Certified B Corporation since 2017
Simply Business' Data Platform
Simply Business' Data Platform
Simply Business' Data Platform
Simply Business' Data Platform
2. Context, context, co...!
Is everything
Mission:
To enable Simply Business to
create value through data
Data Environment - The 5Vs
● ⏬ Low volume: about ~1M events/day
● High variety: nearly 100 event types and growing
● High velocity: sub-second for apps that need it
● ⏫ High veracity: using strong schemas for most data points
● ⏫ High value: as a data-driven company, all departments use data on a daily basis
Data and Analytics team values
● Simplicity: simple is easier to maintain and understand (it’s hard!)
● Adaptability: data tools and techniques change very fast, don’t fight it
● Empowerment and self-serve: we provide a platform to do the easy things easy
● Pioneering: we push the boundaries of what’s possible with data
Data Platform Capabilities
● KPIs and MI: obviously
● Product Analytics: understand how our products perform
● Customer Analytics: understand how our customers behave
● Experimentation Tools: to test all our assumptions
● Data Integration: bringing all our data in one place
● Customer Comms: it’s very data intensive
● Machine Learning: because understanding the present is not enough!
Analytics usage
3. Data platform evolution
“Change is the only constant” - A data engineer
The batch days: 2014-2015
Team: 2-3 data platform engineers
Tech:
● Vainilla Snowplow Analytics for the event pipeline that ran on EMR
● Homegrown Change Data Capture (CDC) pipeline to flatten MongoDB collections
● Looker for web and product analytics, SQL Server for top-level KPIs
Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
S3
Scalding on
EMR
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
hourly job
Data
modelling
cron jobs
NRT first steps: 2016-2017
Team: 3-4 data platform engineers
Changes:
● We added a NRT pipeline in order to expose event data back to transactional apps
● We used Kinesis as message bus; we didn’t want to manage anything
● The data is stored in MongoDB for real-time access
Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
S3
Scalding on
EMR
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
hourly job
MongoDB
Data
modelling
cron jobs
Spark
Streaming
API
4s batches
Current pipeline: 2017-2018
Team: 4-5 data platform engineers
Tech:
● We have gone NRT by default, there’s no batch layer
● We’ve introduced Airflow for batch job orchestration
● We’ve got rid of S3 to comply with GDPR without having to fiddle with files
Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
Spark
Streaming
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
3min batches
MongoDB
Data
modelling
Airflow
Spark
Streaming
API
4s batches
Potential changes in the near future
Migrate from Spark Streaming to Kafka Streams:
● Streaming-native API, much more powerful than Spark’s
● No need for external storage for stateful operations
● No need to have a YARN or Mesos cluster, any JVM app can have a streaming
component
● Can expose APIs to other services!
Potential changes in the near future
Migrate from Redshift to Snowflake:
● Decoupling storage from processing
● Handles semi-structured data natively
● Allows to isolate workloads much better
● Near-instant scaling, including stopping it when no one is using the cluster
● Infinite storage!
Potential changes in the near future
Migrate from EMR to Databricks for Spark batch jobs:
● Would allow us to have a dedicated cluster per app
● Easier to upgrade to newer Spark versions
● No cluster maintenance required, they’re transient
Sources Ingest Process Store Serve
Website
Event
Collector
Snowflake
MongoDB
Kafka
Streams
Change Data
Capture
Adwords
Email
Batch
Importer
...
3min batches
Data
modelling
Airflow
Kafka Streams + API
Batch
Exporter
4. Cool stuff we’ve done
Not everything is infrastructure!
Full Contact - A Kafka Streams App
Full Contact is the brain behind the decisions related to calling Simply Business
customers and prospects. It decides:
● If we need to call someone
● The reason to call someone
● The importance of a call (priority)
● When to make the call (scheduling)
Visualization made with https://zz85.github.io/kafka-streams-viz/
Visitor graphs analysis
We used GraphFrames to understand customer behaviour. We found/understood:
● Cross-device customer behaviour
● How people refer Simply Business to their friends
● That we have some brokers that buy on behalf of customers
Visualization made with gephi.org
Lead scoring
● We developed a lead scoring algorithm using AdaBoost which, using customer
behaviour, predicts how likely they are to convert
● This approach notably improved retargeting efficiency
● We are now developing a streaming version using LightGBM to plug it to Full
Contact and improve call centre efficiency
● We can tune it to not bother at all people who we think aren’t interested in buying
5. Lessons learned
Remember, this are our lessons
Distributed FS aren’t for everyone
Distributed FS have a set of properties that in many cases aren’t unique or that useful:
● Immutability: really cool until you need to mutate data
● Distributed: there are many options for distributed storage
● Schema-less data ingestion: you need to know what are you storing, especially if it
contains PII
● Files: do you really want to manage files?
● Other quirks: eventual consistency (S3), managing backups (HDFS), ...
Schemas everywhere!
Schemas are key to:
● Enforce data quality across multiple systems, right when it is created
● Allow multiple groups of people to talk and collaborate around data
● Make the data discoverable
Plan for flexibility and agility
Using the right tools, or our love-hate relationship with SQL:
● It’s great for querying, testing stuff and hacking things together quickly
● Not so good for building complex logic: lots of repetition and difficult to test
Make your architecture loosely coupled so that you can change bits at a time:
● Use Kafka to decouple real-time applications
● Use S3/HDFS/DB to decouple batch applications
6. Peeking into the future
Will probably get it wrong
Size doesn’t matter, so let’s go big
● Setting up and using “big data” tools is getting easier and easier
● Cloud providers and vendors host them for you
● Most tools are fine with little data volumes and scale horizontally
● CPU, storage and network are getting cheaper faster than (our) data needs
● Examples:
○ Spark: from a local notebook to processing petabytes
○ Kafka Streams: useful regardless of volume
Machine learning is commoditized
● Everyone is giving their algorithms for free: Tensorflow, Keras, MLFlow,…
● Cloud providers even provide infrastructure to train and serve models
● Invest in the things that will make a difference:
○ Skills
○ Data
Data and analytics are transactional
● Long gone are the days when data warehousing was done overnight and isolated
from the transactional systems
● Many products require real-time, reliable access to data systems:
○ Visible: Twitter reactions, bank account spending, ...
○ Invisible: marketing warehouses, transportation, recommenders, ...
The best is yet to come
● Data is one of the most effective competitive advantages, everyone will invest in it
● Data will be used to self-optimize pretty much everything that can be optimized
● Data-centric ways of thinking about software engineering:
○ Software changes constantly, but data survives much longer
○ Event driven architectures and microservices
● Make sure you learn how to teach machines :)
7. References
Learning from the best
References
● The Art of Platform Thinking - ThoughtWorks
● Sharing is Caring: Multi-tenancy in Distributed Data Systems - Jay Kreps
● Machine Learning: The High-Interest Credit Card of Technical Debt - Google
● Ways to think about machine learning - Benedict Evans
Questions?

More Related Content

PPT
Arquitectura Barroca EspañOla
PDF
Session 14 validation_steps_sap
PDF
SAP SD Training | SAP SD Configuration Guide | SAP SD Study Material
PDF
Movement types-in-sap-mm
PDF
Cash Management in SAP
PPTX
Arquitectura manierista analisis de obra
PPTX
El barroco
Arquitectura Barroca EspañOla
Session 14 validation_steps_sap
SAP SD Training | SAP SD Configuration Guide | SAP SD Study Material
Movement types-in-sap-mm
Cash Management in SAP
Arquitectura manierista analisis de obra
El barroco

What's hot (20)

DOCX
Blueprint process questions_ics
PPTX
Arte Visigodo
DOC
SAP - Transportation Module Study material
PDF
Network graphs in tableau
PDF
SAP SD Interview Questions with Explanation
PPT
ART 05.F. Pintura románica.ppt
PPT
TRIVIAL AL-ANDALUS
PPT
LOS VISIGODOS EN HISPANIA
PDF
Configure and customize automatic credit management
PPTX
OCS352 IOT -UNIT-2.pptx
PPT
San martin de fromista
PDF
Control systems Unit-I (Dr.D.Lenine, RGMCET, Nandyal)
DOC
Ckm3 material price analysis
PDF
Statement of Cash Flow .pdf
PPTX
SAP Order To Cash Cycle
PDF
Sap sd-study-material-1511
DOCX
Sap sd bbp template
PPT
La Escultura Románica
PPTX
Arquitectura italiana Renacimiento
Blueprint process questions_ics
Arte Visigodo
SAP - Transportation Module Study material
Network graphs in tableau
SAP SD Interview Questions with Explanation
ART 05.F. Pintura románica.ppt
TRIVIAL AL-ANDALUS
LOS VISIGODOS EN HISPANIA
Configure and customize automatic credit management
OCS352 IOT -UNIT-2.pptx
San martin de fromista
Control systems Unit-I (Dr.D.Lenine, RGMCET, Nandyal)
Ckm3 material price analysis
Statement of Cash Flow .pdf
SAP Order To Cash Cycle
Sap sd-study-material-1511
Sap sd bbp template
La Escultura Románica
Arquitectura italiana Renacimiento
Ad

Similar to Simply Business' Data Platform (20)

PDF
Architecting for analytics
ODP
BigData Hadoop
PDF
Data Platform in the Cloud
PDF
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
PPTX
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
PPTX
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
PDF
The Lyft data platform: Now and in the future
PDF
Lyft data Platform - 2019 slides
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PPTX
Key Skills Required for Data Engineering
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PDF
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
DOCX
Firebird to Snowflake Migration _ A comprehensive Guide.docx
PDF
MongoDB Breakfast Milan - Mainframe Offloading Strategies
PPTX
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
PDF
7 Emerging Data & Enterprise Integration Trends in 2022
PPTX
The Double win business transformation and in-year ROI and TCO reduction
PDF
Building Your First Digital File Submission
PDF
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
PDF
Microservices as an evolutionary architecture: lessons learned
Architecting for analytics
BigData Hadoop
Data Platform in the Cloud
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
The Lyft data platform: Now and in the future
Lyft data Platform - 2019 slides
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Key Skills Required for Data Engineering
Data Engineer's Lunch #85: Designing a Modern Data Stack
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Firebird to Snowflake Migration _ A comprehensive Guide.docx
MongoDB Breakfast Milan - Mainframe Offloading Strategies
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
7 Emerging Data & Enterprise Integration Trends in 2022
The Double win business transformation and in-year ROI and TCO reduction
Building Your First Digital File Submission
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
Microservices as an evolutionary architecture: lessons learned
Ad

Recently uploaded (20)

PPTX
Logistic Regression ml machine learning.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
Foundation of Data Science unit number two notes
Logistic Regression ml machine learning.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Moving the Public Sector (Government) to a Digital Adoption
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
Business Acumen Training GuidePresentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Foundation of Data Science unit number two notes

Simply Business' Data Platform

  • 2. 1. Introductions 2. Some context 3. Data platform evolution 4. Cool stuff we’ve done 5. Lessons learned 6. Peeking into the future 7. References Table of contents
  • 5. This is Simply Business ● Largest UK business insurance provider ● Over 450,000 policy holders ● Using BML, tech and data to disrupt the business insurance market ● Acquired in 2016 (£120M) and again by Travellers in 2017 (£402M) ● #1 best company to work for in 2015 and 2016 among other awards ● Certified B Corporation since 2017
  • 10. 2. Context, context, co...! Is everything
  • 11. Mission: To enable Simply Business to create value through data
  • 12. Data Environment - The 5Vs ● ⏬ Low volume: about ~1M events/day ● High variety: nearly 100 event types and growing ● High velocity: sub-second for apps that need it ● ⏫ High veracity: using strong schemas for most data points ● ⏫ High value: as a data-driven company, all departments use data on a daily basis
  • 13. Data and Analytics team values ● Simplicity: simple is easier to maintain and understand (it’s hard!) ● Adaptability: data tools and techniques change very fast, don’t fight it ● Empowerment and self-serve: we provide a platform to do the easy things easy ● Pioneering: we push the boundaries of what’s possible with data
  • 14. Data Platform Capabilities ● KPIs and MI: obviously ● Product Analytics: understand how our products perform ● Customer Analytics: understand how our customers behave ● Experimentation Tools: to test all our assumptions ● Data Integration: bringing all our data in one place ● Customer Comms: it’s very data intensive ● Machine Learning: because understanding the present is not enough!
  • 16. 3. Data platform evolution “Change is the only constant” - A data engineer
  • 17. The batch days: 2014-2015 Team: 2-3 data platform engineers Tech: ● Vainilla Snowplow Analytics for the event pipeline that ran on EMR ● Homegrown Change Data Capture (CDC) pipeline to flatten MongoDB collections ● Looker for web and product analytics, SQL Server for top-level KPIs
  • 18. Sources Ingest Process Store Serve Website Event Collector Redshift MongoDB S3 Scalding on EMR Change Data Capture Adwords Email Batch Importer ... Batch Exporter hourly job Data modelling cron jobs
  • 19. NRT first steps: 2016-2017 Team: 3-4 data platform engineers Changes: ● We added a NRT pipeline in order to expose event data back to transactional apps ● We used Kinesis as message bus; we didn’t want to manage anything ● The data is stored in MongoDB for real-time access
  • 20. Sources Ingest Process Store Serve Website Event Collector Redshift MongoDB S3 Scalding on EMR Change Data Capture Adwords Email Batch Importer ... Batch Exporter hourly job MongoDB Data modelling cron jobs Spark Streaming API 4s batches
  • 21. Current pipeline: 2017-2018 Team: 4-5 data platform engineers Tech: ● We have gone NRT by default, there’s no batch layer ● We’ve introduced Airflow for batch job orchestration ● We’ve got rid of S3 to comply with GDPR without having to fiddle with files
  • 22. Sources Ingest Process Store Serve Website Event Collector Redshift MongoDB Spark Streaming Change Data Capture Adwords Email Batch Importer ... Batch Exporter 3min batches MongoDB Data modelling Airflow Spark Streaming API 4s batches
  • 23. Potential changes in the near future Migrate from Spark Streaming to Kafka Streams: ● Streaming-native API, much more powerful than Spark’s ● No need for external storage for stateful operations ● No need to have a YARN or Mesos cluster, any JVM app can have a streaming component ● Can expose APIs to other services!
  • 24. Potential changes in the near future Migrate from Redshift to Snowflake: ● Decoupling storage from processing ● Handles semi-structured data natively ● Allows to isolate workloads much better ● Near-instant scaling, including stopping it when no one is using the cluster ● Infinite storage!
  • 25. Potential changes in the near future Migrate from EMR to Databricks for Spark batch jobs: ● Would allow us to have a dedicated cluster per app ● Easier to upgrade to newer Spark versions ● No cluster maintenance required, they’re transient
  • 26. Sources Ingest Process Store Serve Website Event Collector Snowflake MongoDB Kafka Streams Change Data Capture Adwords Email Batch Importer ... 3min batches Data modelling Airflow Kafka Streams + API Batch Exporter
  • 27. 4. Cool stuff we’ve done Not everything is infrastructure!
  • 28. Full Contact - A Kafka Streams App Full Contact is the brain behind the decisions related to calling Simply Business customers and prospects. It decides: ● If we need to call someone ● The reason to call someone ● The importance of a call (priority) ● When to make the call (scheduling)
  • 29. Visualization made with https://zz85.github.io/kafka-streams-viz/
  • 30. Visitor graphs analysis We used GraphFrames to understand customer behaviour. We found/understood: ● Cross-device customer behaviour ● How people refer Simply Business to their friends ● That we have some brokers that buy on behalf of customers
  • 32. Lead scoring ● We developed a lead scoring algorithm using AdaBoost which, using customer behaviour, predicts how likely they are to convert ● This approach notably improved retargeting efficiency ● We are now developing a streaming version using LightGBM to plug it to Full Contact and improve call centre efficiency ● We can tune it to not bother at all people who we think aren’t interested in buying
  • 33. 5. Lessons learned Remember, this are our lessons
  • 34. Distributed FS aren’t for everyone Distributed FS have a set of properties that in many cases aren’t unique or that useful: ● Immutability: really cool until you need to mutate data ● Distributed: there are many options for distributed storage ● Schema-less data ingestion: you need to know what are you storing, especially if it contains PII ● Files: do you really want to manage files? ● Other quirks: eventual consistency (S3), managing backups (HDFS), ...
  • 35. Schemas everywhere! Schemas are key to: ● Enforce data quality across multiple systems, right when it is created ● Allow multiple groups of people to talk and collaborate around data ● Make the data discoverable
  • 36. Plan for flexibility and agility Using the right tools, or our love-hate relationship with SQL: ● It’s great for querying, testing stuff and hacking things together quickly ● Not so good for building complex logic: lots of repetition and difficult to test Make your architecture loosely coupled so that you can change bits at a time: ● Use Kafka to decouple real-time applications ● Use S3/HDFS/DB to decouple batch applications
  • 37. 6. Peeking into the future Will probably get it wrong
  • 38. Size doesn’t matter, so let’s go big ● Setting up and using “big data” tools is getting easier and easier ● Cloud providers and vendors host them for you ● Most tools are fine with little data volumes and scale horizontally ● CPU, storage and network are getting cheaper faster than (our) data needs ● Examples: ○ Spark: from a local notebook to processing petabytes ○ Kafka Streams: useful regardless of volume
  • 39. Machine learning is commoditized ● Everyone is giving their algorithms for free: Tensorflow, Keras, MLFlow,… ● Cloud providers even provide infrastructure to train and serve models ● Invest in the things that will make a difference: ○ Skills ○ Data
  • 40. Data and analytics are transactional ● Long gone are the days when data warehousing was done overnight and isolated from the transactional systems ● Many products require real-time, reliable access to data systems: ○ Visible: Twitter reactions, bank account spending, ... ○ Invisible: marketing warehouses, transportation, recommenders, ...
  • 41. The best is yet to come ● Data is one of the most effective competitive advantages, everyone will invest in it ● Data will be used to self-optimize pretty much everything that can be optimized ● Data-centric ways of thinking about software engineering: ○ Software changes constantly, but data survives much longer ○ Event driven architectures and microservices ● Make sure you learn how to teach machines :)
  • 43. References ● The Art of Platform Thinking - ThoughtWorks ● Sharing is Caring: Multi-tenancy in Distributed Data Systems - Jay Kreps ● Machine Learning: The High-Interest Credit Card of Technical Debt - Google ● Ways to think about machine learning - Benedict Evans