Building real time data-driven products

Building Real-time
Data-driven Products
Øyvind Løkling & Lars Albertsson
Version 1.3 - 2016.10.12

Øyvind Løkling
Staff Software Engineer
Schibsted Product & Technology
Oslo - Stockholm - London - Barcelona - Krakow
http://jobs.schibsted.com/

Lars Albertsson
Independent Consultant
Mapflat.com

Presentation goals
● Spark your interest in building data driven products.
● Give an overview of components and how these relate.
● Suggest technologies and approaches that can be used in practice.
● Event Collection
● The Unified Log
● Stream Processing
● Serving Results
● Schema
5

Data driven products
• Services and applications
primarily driven by capturing
and making sense of data
• Health trackers
• Recommendations
• Analytics
CC BY © https://500px.com/tommrazek

Data driven products
• Hosted services need to
• Handle large volumes of data
• Cleaning and structuring data
• Serve individual users fast
CC BY © https://500px.com/tommrazek

Big Data, Fast Data, Smart Data
• Accelerating data volumes and
speeds
• Internet of Things
• AB Testing and Experiments
• Personalised products
CC BY © https://500px.com/erfgorini

Big Data, Fast Data, Smart Data
A need to make sense of data and
act on fast
• Faster development cycle
• Data driven organisations
• Data driven products
CC BY © https://500px.com/erfgorini

Time scales - what parts become obsolete?
10
Credits: Confluent
Credits: Netflix

The Log
• Other common names
• Commit log
• Journal
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Jay Kreps - I <3 Logs
The state machine replication principle:
If two identical, deterministic processes begin in
the same state and get the same inputs in the
same order, they will produce the same output
and end in the same state.

The Unified Log
Simple idea; All of the organizations data,
available in one unified log that is;
• Unified: the one source of truth
• Append only: Data items are immutable
• Ordered: addressable by offset unique per partition
• Fast and scalable: Able to handle 1000´s msg/sec
• Distributed, robust and fault tolerant.

Kafka
• “Kafka is used for
building real-time data
pipelines and streaming
apps. It is horizontally
scalable, fault-tolerant,
wicked fast, and runs in
production in
thousands of
companies.”
But… :-) http://www.confluent.io/kafka-summit-2016-101-ways-to-configure-kafka-badly

A streaming data product’s anatomy
18
Pub / sub
Unified log
Ingress Stream processing Egress
DB
Service
TopicJob
Pipeline
Service
Export
Visualisation
DB

Architectural patterns
The Unified Log and Lambda, Kappa architectures

Lambda Architecture
Example: Recommendation Engine @ Schibsted

Building real time data-driven products

Kappa Architecture
A software architecture pattern… where the canonical data
store is an append-only immutable log. From the log, data is
streamed through a computational system and fed into
auxiliary stores for serving.
A Lambda architecture system with batch processing
removed.
https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Lambda vs Kappa
● Lambda
○ Leverages existing batch infrastructure
○ Cognitive overhead maintaining two approaches in parallel
● Kappa
○ Is real-time processing is inherently approximate, less powerful, and more
lossy than batch processing. True?
○ Simpler model
https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Cold Storage
• Source of truth for replay in case of failure
• Available for ad-hoc batch querying (Spark)
• Wishlist; Fast writes, Reliable, Cheap
• Cloud storage - s3 (with Glacier)
• Traditional - HDFS, SAN
• Consider what read performance do you need for
a) error recovery
b) bootstrapping new deployments

Event collection
33
Cluster storage
HDFS
(NFS, S3, Google CS, C*)
Service
Reliable, simple, write available
Kafka
Event Bus with history
(Secor,
Camus)
Immediate handoff to append-only replicated log.
Once in the log, events eventually arrive in storage.
Unified log
Immutable events, append-only,
source of truth

Event collection - guarantees
34
Unified log
Service
(unimportant)
Events are safe
from here
Replicated
Topics
Non-critical data: Asynchronous fire-and-forget handoff
Critical data: Synchronous, replicated, with acknowledgement
Service
(important)

Event collection
• Source Types
• Firehose api´s
• Mobile apps and websites
• Internet of things / embedded sensors
• Event sourcing from existing systems

Event collection
• Considerations
• Can you control the data flow, ask sender to wait?
• Can clients be expected to have their logic updated?
• Can you afford to loose some data, make tradeoffs and
still solve your task?

Pipeline graph
• Parallelised jobs read and write to Kafka topics
• Jobs are completely decoupled
• Downstream jobs do not impact upstream
• Usually an acyclic graph
CC BY © https://500px.com/thakurdalipsingh
Illustration from Apache Samza Doc - Concepts

Stream processing components
● Building blocks
○ Aggregate
■ Calculate time windows
■ Aggregate state (database/in memory)
○ Filter
■ Slim down stream
■ Privacy, Security concerns
○ Join
■ Enrich by joining with datasets (geoip)
○ Transform
■ Bring data into same “shape”, schema

Stream Processing Platforms
• Spark Streaming
• Ideal if you are already using Spark, same model
• Bridges gap between data science / data
engineers, batch and stream
• Kafka Stream
• Library - New, positions itself as a lightweight
alternative
• Tightly coupled to on Kafka
• Others
○ Storm, Flink, Samza, Google Dataflow, AWS
Lambda
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

Schemas
• You always have a schema
• Schema on write
• Requires upfront schema design before data can be
received
• Synchronised deployment of whole pipeline
• Schema on read
• Allows data to be captured -as is-.
• Suitable for “Data lake”
• Often requires cleaning and transform bring datasets into
consistency downstream.

Schema on read or write?
43
DB
DB
DB
Service
Service
Export
Business
intelligenceChange agility important here
Production stability important here

Schemas
• Options In streaming applications
• Schema bundled with every record
• Schema registry + id in record
• Schema formats
• JSON Schema
• AVRO

Evolving Schemas
• Declare schema version even if no
guarantee. Captures intention of
source.
• Be prepared for bad and
non-validating data.
• Decide on strategy for bringing schema
versions in alignment.
• Maintain upgrade path through
transforms.
• What are the needs of the consumer.
• Data exploration vs Stable services.

Serving Results
● As streams
○ Internal Consumer
○ External Consumer bridges
■ ex. REST post to external ingest endpoint
● As Views
○ Serving index, NoSQL
○ SQL / cubes for BI

Reactive Streams
• [...] an initiative to provide a standard for asynchronous stream processing with
non-blocking back pressure. [...] aimed at runtime environments (JVM and
JavaScript) as well as network protocols.
• The scope [...] is to find a minimal set of interfaces, methods and protocols that will
describe the necessary operations and entities to achieve this goal.
• “Glue” between libraries.
Reactive Kafka ->
Akka Stream ->
RxJava

https://imgflip.com/memetemplate/17759370/dog-meditation-funny

Schemas
• You always have a schema
• Even if you are “Schemaless”
• Build tooling and workflows for handling schema changes

Building real time data-driven products

More Related Content

What's hot (20)

Similar to Building real time data-driven products (20)

More from Lars Albertsson (20)

Recently uploaded (20)

Building real time data-driven products