Simply Business' Data Platform

Simply Business’
Data Platform
By Dani Solà

1. Introductions
2. Some context
3. Data platform evolution
4. Cool stuff we’ve done
5. Lessons learned
6. Peeking into the future
7. References
Table of contents

1. Introductions
Nice to meet you

This is Simply Business
● Largest UK business insurance provider
● Over 450,000 policy holders
● Using BML, tech and data to disrupt the business insurance market
● Acquired in 2016 (£120M) and again by Travellers in 2017 (£402M)
● #1 best company to work for in 2015 and 2016 among other awards
● Certified B Corporation since 2017

2. Context, context, co...!
Is everything

Mission:
To enable Simply Business to
create value through data

Data Environment - The 5Vs
● ⏬ Low volume: about ~1M events/day
● High variety: nearly 100 event types and growing
● High velocity: sub-second for apps that need it
● ⏫ High veracity: using strong schemas for most data points
● ⏫ High value: as a data-driven company, all departments use data on a daily basis

Data and Analytics team values
● Simplicity: simple is easier to maintain and understand (it’s hard!)
● Adaptability: data tools and techniques change very fast, don’t fight it
● Empowerment and self-serve: we provide a platform to do the easy things easy
● Pioneering: we push the boundaries of what’s possible with data

Data Platform Capabilities
● KPIs and MI: obviously
● Product Analytics: understand how our products perform
● Customer Analytics: understand how our customers behave
● Experimentation Tools: to test all our assumptions
● Data Integration: bringing all our data in one place
● Customer Comms: it’s very data intensive
● Machine Learning: because understanding the present is not enough!

3. Data platform evolution
“Change is the only constant” - A data engineer

The batch days: 2014-2015
Team: 2-3 data platform engineers
Tech:
● Vainilla Snowplow Analytics for the event pipeline that ran on EMR
● Homegrown Change Data Capture (CDC) pipeline to flatten MongoDB collections
● Looker for web and product analytics, SQL Server for top-level KPIs

Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
S3
Scalding on
EMR
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
hourly job
Data
modelling
cron jobs

NRT first steps: 2016-2017
Changes:
● We added a NRT pipeline in order to expose event data back to transactional apps
● We used Kinesis as message bus; we didn’t want to manage anything
● The data is stored in MongoDB for real-time access

Website
Event
Collector
Redshift
MongoDB
S3
Scalding on
EMR
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
hourly job
MongoDB
Data
modelling
cron jobs
Spark
Streaming
API
4s batches

Current pipeline: 2017-2018
Tech:
● We have gone NRT by default, there’s no batch layer
● We’ve introduced Airflow for batch job orchestration
● We’ve got rid of S3 to comply with GDPR without having to fiddle with files

Website
Event
Collector
Redshift
MongoDB
Spark
Streaming
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
3min batches
MongoDB
Data
modelling
Airflow
Spark
Streaming
API
4s batches

Potential changes in the near future
Migrate from Spark Streaming to Kafka Streams:
● Streaming-native API, much more powerful than Spark’s
● No need for external storage for stateful operations
● No need to have a YARN or Mesos cluster, any JVM app can have a streaming
component
● Can expose APIs to other services!

Migrate from Redshift to Snowflake:
● Decoupling storage from processing
● Handles semi-structured data natively
● Allows to isolate workloads much better
● Near-instant scaling, including stopping it when no one is using the cluster
● Infinite storage!

Migrate from EMR to Databricks for Spark batch jobs:
● Would allow us to have a dedicated cluster per app
● Easier to upgrade to newer Spark versions
● No cluster maintenance required, they’re transient

Website
Event
Collector
Snowflake
MongoDB
Kafka
Streams
Change Data
Capture
Adwords
Email
Batch
Importer
...
3min batches
Data
modelling
Airflow
Kafka Streams + API
Batch
Exporter

4. Cool stuff we’ve done
Not everything is infrastructure!

Full Contact - A Kafka Streams App
Full Contact is the brain behind the decisions related to calling Simply Business
customers and prospects. It decides:
● If we need to call someone
● The reason to call someone
● The importance of a call (priority)
● When to make the call (scheduling)

Visualization made with https://zz85.github.io/kafka-streams-viz/

Visitor graphs analysis
We used GraphFrames to understand customer behaviour. We found/understood:
● Cross-device customer behaviour
● How people refer Simply Business to their friends
● That we have some brokers that buy on behalf of customers

Visualization made with gephi.org

Lead scoring
● We developed a lead scoring algorithm using AdaBoost which, using customer
behaviour, predicts how likely they are to convert
● This approach notably improved retargeting efficiency
● We are now developing a streaming version using LightGBM to plug it to Full
Contact and improve call centre efficiency
● We can tune it to not bother at all people who we think aren’t interested in buying

5. Lessons learned
Remember, this are our lessons

Distributed FS aren’t for everyone
Distributed FS have a set of properties that in many cases aren’t unique or that useful:
● Immutability: really cool until you need to mutate data
● Distributed: there are many options for distributed storage
● Schema-less data ingestion: you need to know what are you storing, especially if it
contains PII
● Files: do you really want to manage files?
● Other quirks: eventual consistency (S3), managing backups (HDFS), ...

Schemas everywhere!
Schemas are key to:
● Enforce data quality across multiple systems, right when it is created
● Allow multiple groups of people to talk and collaborate around data
● Make the data discoverable

Plan for flexibility and agility
Using the right tools, or our love-hate relationship with SQL:
● It’s great for querying, testing stuff and hacking things together quickly
● Not so good for building complex logic: lots of repetition and difficult to test
Make your architecture loosely coupled so that you can change bits at a time:
● Use Kafka to decouple real-time applications
● Use S3/HDFS/DB to decouple batch applications

6. Peeking into the future
Will probably get it wrong

Size doesn’t matter, so let’s go big
● Setting up and using “big data” tools is getting easier and easier
● Cloud providers and vendors host them for you
● Most tools are fine with little data volumes and scale horizontally
● CPU, storage and network are getting cheaper faster than (our) data needs
● Examples:
○ Spark: from a local notebook to processing petabytes
○ Kafka Streams: useful regardless of volume

Machine learning is commoditized
● Everyone is giving their algorithms for free: Tensorflow, Keras, MLFlow,…
● Cloud providers even provide infrastructure to train and serve models
● Invest in the things that will make a difference:
○ Skills
○ Data

Data and analytics are transactional
● Long gone are the days when data warehousing was done overnight and isolated
from the transactional systems
● Many products require real-time, reliable access to data systems:
○ Visible: Twitter reactions, bank account spending, ...
○ Invisible: marketing warehouses, transportation, recommenders, ...

The best is yet to come
● Data is one of the most effective competitive advantages, everyone will invest in it
● Data will be used to self-optimize pretty much everything that can be optimized
● Data-centric ways of thinking about software engineering:
○ Software changes constantly, but data survives much longer
○ Event driven architectures and microservices
● Make sure you learn how to teach machines :)

7. References
Learning from the best

References
● The Art of Platform Thinking - ThoughtWorks
● Sharing is Caring: Multi-tenancy in Distributed Data Systems - Jay Kreps
● Machine Learning: The High-Interest Credit Card of Technical Debt - Google
● Ways to think about machine learning - Benedict Evans

Simply Business' Data Platform

More Related Content

What's hot (20)

Similar to Simply Business' Data Platform (20)

Recently uploaded (20)

Simply Business' Data Platform