SlideShare a Scribd company logo
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

WELCOME Big Data and Fast Data
big and fast combined – is it
possible?
Guido Schmutz und Albert Blarer
24. April 2013
24. April 2013
Big Data und Fast Data
1
2013 © Trivadis
Guido Schmutz
•  Working for Trivadis for more than 16 years
•  Oracle ACE Director for Fusion Middleware and SOA
•  Co-Author of different books
•  Consultant, Trainer Software Architect for Java, Oracle, SOA
and EDA
•  Member of Trivadis Architecture Board
•  Technology Manager @ Trivadis
•  More than 25 years of software development 

experience
•  Contact: guido.schmutz@trivadis.com
•  Blog: http://guidoschmutz.wordpress.com
•  Twitter: gschmutz
14.06.2012
2
Where and When should I use the Oracle Service Bus (OSB)
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

2013 © Trivadis
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.
4
11 Trivadis Niederlassungen mit

über 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-
budget: CHF 5.0 / EUR 4 Mio.
Finanziell unabhängig und

nachhaltig profitabel
Erfahrung aus mehr als 1'900
Projekten pro Jahr bei über 800
Kunden
Stand 12/2012
Hamburg
Düsseldorf
Frankfurt
Freiburg
München
Wien
Basel
ZürichBern
Lausanne
4
Stuttgart
Datum
Trivadis – das Unternehmen
2013 © Trivadis
Credits
Nathan Marz
Author of „
Big Data – Principles and best practics of scalable
realtime data systems“ – Manning Press
Used to be working at Backtype and Twitter
Creator of
•  Storm
•  Cascalog
•  ElephantDB
24. April 2013
Big Data und Fast Data
5
2013 © Trivadis
Agenda
1.  Big Data, what is it?
2.  Motivation
3.  The Lambda Architecture
4.  Implementing the Lambda Architecture
5.  Summary
24. April 2013
Big Data und Fast Data
6
2013 © Trivadis
Big Data Definition (Gartner et al)
14.02.2013
Big Data 4 Sales
7
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
“Traditional” computing in RDBMS 

is not scalable enough. 

We search for “linear scalability”
“Only … structured information 

is not enough” – “95% of produced data in
unstructured”
Characteristics of Big Data: Its
Volume, Velocity and Variety in
combination
+ Veracity (IBM) - information uncertainty
+ Time to action ? – Big Data + Event Processing = Fast Data
2013 © Trivadis
Big Data Emerging Technologies
24. April 2013
Big Data und Fast Data
8
§  MapReduce (e.g. Apache Hadoop)
§  Event Stream Processing & CEP (e.g. Storm or Esper)
§  New messaging systems (e.g. Apache Kafka)
§  Integration tools (e.g. Spring or Camus)
§  New database paradigms (e.g. NoSQL or NewSQL)
§  Data mining tools (e.g. Apache Mahout )
§  Data extraction and detection tools (e.g. Apache Tika )
2013 © Trivadis
14.02.2013
Big Data 4 Sales
9
2013 © Trivadis
Volume Development
0
20
40
60
80
100
0
2000
4000
6000
8000
2005 2007 2009 2011 2013 2015
AggregateUncertainty%
GlobalDataVolumeinExabytes
Year
Sensors:
“internet of
things”
Social Media:
video, audio,
text
VoIP:
Skype, MSN,
ICQ, ...
Enterprise Data:
data dictionary,
ERD, ...
24. April 2013
Big Data und Fast Data
10
2013 © Trivadis
Velocity
24. April 2013
Big Data und Fast Data
11
§  Velocity requirement examples:
§  Recommendation Engine
§  Predictive Analytics
§  Marketing Campaign Analysis
§  Customer Retention and Churn Analysis
§  Social Graph Analysis
§  Capital Markets Analysis
§  Risk Management
§  Rogue Trading
§  Fraud Detection
§  Retail Banking
§  Network Monitoring
§  Research and Development
2013 © Trivadis
Agenda
1.  Big Data, what is it?
2.  Motivation
3.  The Lambda Architecture
4.  Implementing the Lambda Architecture
5.  Summary
24. April 2013
Big Data und Fast Data
12
2013 © Trivadis
What is a data system?
•  A system that manages the storage and querying of data with a
lifetime measured in years encompassing every version of the
application to ever exist, every hardware failure and every human
mistake ever made.
•  A data system answers questions based on information that was
acquired in the past
•  Not all bits of information are equal
•  Some information is derived from other
24. April 2013
Big Data und Fast Data
13
2013 © Trivadis
Desired Properties of a (Big) Data System
Robust and fault-tolerant
Low latency reads and updates
Scalable
General
Extensible
Allows ad hoc queries
Minimal maintenance
Debuggable
24. April 2013
Big Data und Fast Data
14
2013 © Trivadis
Typical problem in today’s

architecture/systems
Bugs will be deployed to production over the lifetime of a data system
Operational mistakes will be made
Humans are part of the overall system
•  Just like hard disks, CPUs, memory, software
•  design for human error like you design for any other fault
Examples of human error
•  Deploy a bug that increments counters by two instead of by one
•  Accidentally delete data from database
•  Accidental DOS on important internal service
Worst two consequences: data loss or data corruption
As long as an error doesn‘t lose or corrupt good data, you can fix what
went wrong
24. April 2013
Big Data und Fast Data
15
Lack of Human Fault Tolerance
2013 © Trivadis
Mutability
The U and D in CRUD
A mutable system updates the current state of the world
Mutable systems inherently lack human fault-tolerance
Easy to corrupt or lose data
24. April 2013
Big Data und Fast Data
16
Capturing change traditionally
Lack of Human Fault Tolerance
Name City
Guido Berne
Albert Zurich
Name City
Guido Basel
Albert Zurich
2013 © Trivadis
Immutability
An immutable system captures historical records of events
Each event happens at a particular time and is always true
24. April 2013
Big Data und Fast Data
17
Capturing change by storing events
Lack of Human Fault Tolerance
Name City Timestamp
Guido Berne 1.8.1999
Albert Zurich 10.5.1988
Name City Timestamp
Guido Berne 1.8.1999
Albert Zurich 10.5.1988
Guido Basel 1.4.2013
2013 © Trivadis
Immutability
Immutability greatly restricts the range of errors that can cause data loss or
data corruption
Vastly more human fault-tolerant
Much easier to reason about systems based on immutability
Conclusion: Your source of truth should always be immutable
24. April 2013
Big Data und Fast Data
18
Lack of Human Fault Tolerance
2013 © Trivadis
What about traditional/today’s architectures ? 

Source of Truth is mutable!
Rather than build systems like this ….
24. April 2013
Big Data und Fast Data
19
Mutable
Database
Application
(Query)
RDBMS
NoSQL
NewSQL
Mobile
Web
RIA
Rich Client
Source of Truth
Source of Truth
2013 © Trivadis
A different kind of architecture with immutable source of truth
… why not building them like this
24. April 2013
Big Data und Fast Data
20
HDFS
NoSQL
NewSQL
RDBMS
View on
Data
Mobile
Web
RIA
Rich Client
Source of Truth
Immutable
data
View on
Data
Application
(Query)
Source of Truth
2013 © Trivadis
How to create the views on the Immutable data?
On the fly ?
Materialized, i.e. Pre-computed ?
24. April 2013
Big Data und Fast Data
21
Immutable
data
View
Immutable
data
Pre-

Computed

Views
Query
Query
2013 © Trivadis
Data = the most raw information
Data is information which is not derived from anywhere else
•  The most raw form of information
•  Data is the special information from which everything else is derived
Questions on data can be answered by running functions that take data
as input
The most general purpose data system can answer questions by running
functions that take the entire dataset as input
query = function (all data)
The lambda architecture provides a general purpose approach for
implementing arbitrary functions on an arbitrary datasets
24. April 2013
Big Data und Fast Data
22
2013 © Trivadis
Data = the most raw information
24. April 2013
Big Data und Fast Data
23
1.2.13 Add iPAD 64GB
10.3.13 Add Sony RX-100
11..3.13 Add Canon GX-10
11.3.13 Remove Sony RX-100
12.3.13 Add Nikon S-100
14.4.13 Add BoseQC-15
15.4.13 Add MacBook Pro 15
20.4.13 Remove Canon GX10
iPAD 64GB
Nikon S-100
BoseQC-15
MacBook Pro 15
4derive derive
Favorite Product List Changes
Current Favorite 

Product List
Current
Product
Count
Raw information => data
Information => derived
2013 © Trivadis
Big Data and Batch Processing
24. April 2013
Big Data und Fast Data
24
Immutable
data
Batch
View
Query??
Incoming
Data
How to compute the batch views ?
How to compute queries from the views ?
2013 © Trivadis
Big Data and Batch Processing
24. April 2013
Big Data und Fast Data
25
Fully processed data Last full
batch period
Time for

batch job
time
now
non-processed data
time
now
batch-processed data
§  Using only batch processing, leaves you always with a portion of non-
processed data.
Adapted from Ted Dunning (March 2012):
http://www.youtube.com/watch?v=7PcmbI5aC20
But we are not done yet …
2013 © Trivadis
Adding Real-Time Processing
24. April 2013
Big Data und Fast Data
26
Immutable
data
Batch
Views
Query
?
Data
Stream
Realtime
Views
Incoming
Data
How to compute queries 

from the views ?How to compute real-time views
2013 © Trivadis
Adding Real-Time Processing
24. April 2013
Big Data und Fast Data
27
1.2.13 Add iPAD 64GB
10.3.13 Add Sony RX-100
11..3.13 Add Canon GX-10
11.3.13 Remove Sony RX-100
12.3.13 Add Nikon S-100
14.4.13 Add BoseQC-15
15.4.13 Add MacBook Pro 15
20.4.13 Remove Canon GX10
Now Add Canon Scanner
iPAD 64GB
Nikon S-100
BoseQC-15
MacBook Pro 15
5
compute
Favorite Product List Changes
Current Favorite 

Product List
Current
Product
Count
Now Canon ScannercomputeAdd Canon Scanner
Stream of
Favorite Product List Changes
Immutable data
Views
Data Stream
Query
2013 © Trivadis
Big Data and Real Time Processing
24. April 2013
Big Data und Fast Data
28
time
Fully processed data Last full
batch period
now
Time for

batch job
batch processing

worked fine here
(e.g. Hadoop)
real time processing

works here
blended view for end user
Adapted from Ted Dunning (March 2012):
http://www.youtube.com/watch?v=7PcmbI5aC20
2013 © Trivadis
Agenda
1.  Big Data, what is it?
2.  Motivation
3.  The Lambda Architecture
4.  Implementing the Lambda Architecture
5.  Summary
24. April 2013
Big Data und Fast Data
29
2013 © Trivadis
Lambda Architecture
24. April 2013
Big Data und Fast Data
30
Immutable
data
Batch
View
Query
Data
Stream
Realtime
View
Incoming
Data
Serving Layer
Speed Layer
Batch Layer
A
B
C D
E
F
G
2013 © Trivadis
Lambda Architecture
A.  All data is sent to both the batch and speed layer
B.  Master data set is an immutable, append-only set of data
C.  Batch layer pre-computes query functions from scratch, result is called Batch
Views. Batch layer constantly re-computes the batch views.
D.  Batch views are indexed and stored in a scalable database to get particular
values very quickly. Swaps in new batch views when they are available
E.  Speed layer compensates for the high latency of updates to the Batch Views in
the Serving layer.
F.  Uses fast incremental algorithms and read/write databases to produce real-
time views
G.  Queries are resolved by getting results from both batch and real-time views
24. April 2013
Big Data und Fast Data
31
2013 © Trivadis
Layered Architecture
Stores the immutable constantly growing dataset
Computes arbitrary views from this dataset using BigData
technologies (can take hours)
Can be always recreated
Responsible for indexing and exposing the pre-computed batch
views so that they can be queried
Exposes the incremented real-time views
Merges the batch and the real-time views into a consistent result
Computes the views from the constant stream of data it receives
Needed to compensate for the high latency of the batch layer
Incremental model and views are transient
24. April 2013
Big Data und Fast Data
32
Serving Layer
Batch Layer
Speed Layer
2013 © Trivadis
Agenda
1.  Big Data, what is it?
2.  Motivation
3.  The Lambda Architecture
4.  Implementing the Lambda Architecture
5.  Summary
24. April 2013
Big Data und Fast Data
33
2013 © Trivadis
Lambda Architecture
24. April 2013
Big Data und Fast Data
34
Speed Layer
Precompute
Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge
2013 © Trivadis
Lambda Architecture
24. April 2013
Big Data und Fast Data
35
one possible product/framework mapping
Speed Layer
Precompute
Views
query
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge
2013 © Trivadis
Implementing Batch Layer
Immutable Data
•  Append only
•  Normalized
•  Stores master copy of all data
Pre-computed information
•  Function that takes all data as input
query = function(all-data)
•  High Latency, Batch processing
•  Unrestrained computation
•  Horizontal scalable
24. April 2013
Big Data und Fast Data
36
Immutable
data
Batch

Views
compute
Precompute
Views
Batch Layer
Precomputed
information
All data
Batch
recompute
Batch Layer Serving Layer
2013 © Trivadis
Apache Hadoop HDFS
HDFS = the Hadoop Distributed File System
A distributed file storage system
Redundant storage
Designed to reliably store data using commodity hardware
Designed to expect hardware failures
Intended for large files
Designed for batch inserts
24. April 2013
Big Data und Fast Data
37
Batch Layer
2013 © Trivadis
Apache Hadoop Map Reduce
24. April 2013
Big Data und Fast Data
38
§  Hadoop Map Reduce is an open source implementation of the
MapReduce framework.
§  Map Reduce is
§  a programming model, introduced by Google, for processing large data sets,
in a distributed environment
§  De-facto standard to compute huge amounts of data
§  An execution framework for organizing and performing such computations
MAP
master
node
REDUCE
worker node 1
worker node 2
worker node 3
problem
data
solution
data
Batch Layer
2013 © Trivadis
Hadoop MapReduce Flow
24. April 2013
Big Data und Fast Data
39
Source: Bill Graham, Twitter Inc.
Batch Layer
2013 © Trivadis
Hadoop MapReduce
24. April 2013
Big Data und Fast Data
40
Batch Layer
2013 © Trivadis
Cascading
Application framework for Java developers to simply develop robust Data
Analytics and Data Management applications on Apache Hadoop
adds an abstraction layer over the Hadoop API
core concepts of the cascading API:
•  Pipe: a series of processing steps (parsing, looping, filtering, etc) defining the
data processing to be done
•  Flow: association of a pipe (or set of pipes) with a data-source and data-sink
24. April 2013
Big Data und Fast Data
41
Batch Layer
2013 © Trivadis
Casading
24. April 2013
Big Data und Fast Data
42
2013 © Trivadis
Apache Pig
Apache Pig is a platform for analyzing large data sets
Key Properties
•  Ease of programming
•  Optimization opportunities
•  Extensibility
24. April 2013
Big Data und Fast Data
43
Batch Layer
2013 © Trivadis
Implementing Serving Layer

for Batch Views
Need a database that
•  Is batch-writable
•  Adding new information is atomic
•  Has fast random reads
•  Is scalable
•  Is highly available
•  Can be optimized for Storage
•  Information can be de-normalized
•  But no Random writes required!
•  Can be a simple database
24. April 2013
Big Data und Fast Data
44
Serving Layer
batch view
batch view
Batch Layer
Precomputed
information
Immutable
data
Batch

Views
compute
Batch Layer Serving Layer
2013 © Trivadis
SploutSQL
Full SQL => unlike NoSQL
For BigData => unlike RDBMS
Web latency & throughput => unlike Apache Hive, Apache Drill
Why does it scale
•  Data is partitioned
•  Partitions are distributed 

across nodes
•  Adding more nodes 

increase capacity
•  Generation does not 

impact serving
24. April 2013
Big Data und Fast Data
45
Serving Layer
Source: Datasalt.
2013 © Trivadis
SploutSQL
24. April 2013
Big Data und Fast Data
46
Serving Layer
2013 © Trivadis
Implementing Speed Layer

Stream Processing
Continuous computation
Transactional
Storing a limited window of data
•  Compensating for the last few 

hours of data
All the complexity is isolated in the 

speed layer
•  If anything goes wrong, it‘s 

autocorrected by the next batch run
24. April 2013
Big Data und Fast Data
47
Speed Layer
Incremented
information
Process stream
Realtime
increment
Data
Stream
Realtime

Views
derive
Speed Layer Serving Layer
2013 © Trivadis
Apache Kafka
A high throughput distributed messaging system
Originated at LinkedIn
Sequential disk access
24. April 2013
Big Data und Fast Data
48
2013 © Trivadis
Twitter Storm – the “real-time Hadoop”
24. April 2013
Big Data und Fast Data
49
§  Strom is a distributed and fault-tolerant real-time computing platform
§  data flow model, data flows through network of transformation entities
§  Key concepts
§  Tuple: ordered list of elements
§  Streams: unbounded sequence of tuples
§  Spouts: Source of streams
§  Bolts: Process tuples and create new streams
§  Topologies: directed graph of Spouts and Bolts
§  Use Cases
§  Stream Processing
§  Continuous Computation
§  Distributed RPC
SPOUT
BOLT
„MAP“ „REDUCE“
„PERSIST“
problem
data
data
source
solution
data
Speed Layer
Serving Layer
BOLT
BOLT
2013 © Trivadis
Twitter Storm
24. April 2013
Big Data und Fast Data
50
Speed Layer
Serving Layer
2013 © Trivadis
Twitter Trident
Higher level abstraction over Storm
Trident State
Grouped Stream
Functions, Filters
Aggregators
Query
Similar to Pig and Cascading
24. April 2013
Big Data und Fast Data
51
Speed Layer
Serving Layer
2013 © Trivadis
Twitter Trident
24. April 2013
Big Data und Fast Data
52
Speed Layer
Serving Layer
2013 © Trivadis
Implementing Serving Layer

for Real-Time Views
Incremental updates are made available as real-time views
Requires a database that support random read and random writes
•  Relational, NoSQL or NewSQL (in memory) databases can be used
•  Here we are typically not in the BigData range
Results are only needed until the data made it through the batch layer
Complexity isolation
24. April 2013
Big Data und Fast Data
53
Data
Stream
Realtime

Views
derive
Speed Layer Serving Layer
Speed Layer Serving Layer
real time view
real time view
Incremented
information
2013 © Trivadis
Cassandra
Fully distributed, no single-point-of-failure
Linearly scalable
Fault tolerant
Performant
Durable
Integrated caching
Tunable consistency
24. April 2013
Big Data und Fast Data
54
Serving Layer
2013 © Trivadis
Implementing Serving Layer

Merge of Batch and Realtime Views
An interesting feature of Storm /
Trident is the ability to execute
distributed RPC (DRPC) calls in
parallel
This can be used to implement the
merge functionality when a query is
executed
24. April 2013
Big Data und Fast Data
55
Serving Layer
batch view
batch view
real time view
real time view
Realtime

Views
Serving Layer
Batch Views
Merge
query
2013 © Trivadis
Storm / Trident DRPC
24. April 2013
Big Data und Fast Data
56
Serving Layer
2013 © Trivadis
Agenda
1.  Big Data, what is it?
2.  Motivation
3.  The Lambda Architecture
4.  Implementing the Lambda Architecture
5.  Summary
24. April 2013
Big Data und Fast Data
57
2013 © Trivadis
Summary – The lambda architecture
24. April 2013
Big Data und Fast Data
58
§  The Lambda Architecture
§  Can discard batch views and real-time views and recreate everything from
scratch
§  Mistakes corrected via re-computation
§  Data storage layer optimized independently from query resolution layer
§  Still in a very early …. But a very interesting idea!
-  Today a zoo of technologies are needed => Operations won‘t like it
§  Different query language for batch and real time
§  An abstraction over batch and speed layer needed
-  Cascading and Trident are already similar
§  Industry standards needed!
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

THANK YOU.
Trivadis AG
Guido Schmutz & Albert Blarer
Europa-Strasse 5

CH-8095 Glattbrugg
info@trivadis.com

www.trivadis.com
24. April 2013
Big Data und Fast Data
59

More Related Content

PDF
Big Data and Fast Data - Lambda Architecture in Action
PDF
Event-Processing-und-BigData-kombiniert-guido_schmutz
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Architektur von Big Data Lösungen
PDF
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
PDF
Blueprints for the analysis of social media
PDF
Oracle Stream Explorer - Simplifying Event/Stream Processing
PDF
Big data and cloud computing 9 sep-2017
Big Data and Fast Data - Lambda Architecture in Action
Event-Processing-und-BigData-kombiniert-guido_schmutz
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Architektur von Big Data Lösungen
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
Blueprints for the analysis of social media
Oracle Stream Explorer - Simplifying Event/Stream Processing
Big data and cloud computing 9 sep-2017

What's hot (20)

PDF
Introduction to Cloud Computing and Big Data
PDF
Big data trends challenges opportunities
PPTX
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
PPTX
Fixing data science & Accelerating Artificial Super Intelligence Development
PDF
Introduction to big data and apache spark
PDF
Introduction to Big Data
PPTX
Practical Petabyte Pushing
PDF
Bio-IT Trends From The Trenches (digital edition)
PPTX
The rise of “Big Data” on cloud computing
PDF
A beginners guide to Cloudera Hadoop
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
PDF
2015 CDC Workshop on ScienceDMZ
PDF
The Evolution of Big Data Frameworks
PPTX
Data vault what's Next: Part 2
PPTX
Cloud Security for Life Science R&D
PDF
How to build streaming data applications - evaluating the top contenders
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PPTX
Trends from the Trenches: 2019
PPTX
Cloud-Based Big Data Analytics
PDF
Top 5 Considerations for a Big Data Solution
Introduction to Cloud Computing and Big Data
Big data trends challenges opportunities
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Fixing data science & Accelerating Artificial Super Intelligence Development
Introduction to big data and apache spark
Introduction to Big Data
Practical Petabyte Pushing
Bio-IT Trends From The Trenches (digital edition)
The rise of “Big Data” on cloud computing
A beginners guide to Cloudera Hadoop
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
2015 CDC Workshop on ScienceDMZ
The Evolution of Big Data Frameworks
Data vault what's Next: Part 2
Cloud Security for Life Science R&D
How to build streaming data applications - evaluating the top contenders
Introduction to Cloud computing and Big Data-Hadoop
Trends from the Trenches: 2019
Cloud-Based Big Data Analytics
Top 5 Considerations for a Big Data Solution
Ad

Viewers also liked (20)

PDF
Fast Data – the New Big Data
DOC
1%2 b inteligencias%2bm%25c3%25baltiples%2bok
PDF
Archivos Web y Economía Digital. María Fernández Rancaño
PDF
Resources for Mobiles
PDF
Defib Course
PPTX
Beefcious quotes slideshare
PDF
Brochure EQUIPMAG 2016
ODP
Santa maria 5
PDF
Buenas prácticas Dinamización Parques "Yo soy tetuan"
PDF
Detection of brown dwarf like objects in the core of ngc3603
PPTX
La pobreza en el distrito federal en el 2004
PPT
CONOCIENDO LA CAPITAL DE MI PROVINCIA
PPS
Taller de velomancia Sabado 12 de Mayo
PDF
Condiciones Generales ADESLAS COMPLETA
PDF
REVISTA NUMERO 27 CANDÁS MARINERO
PDF
Herausforderung „Multi-Channel“-Architektur
PDF
African Leadership in ICT and Knowledge Societies: Issues, Tensions and Oppor...
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
DOC
Bảo mật mạng máy tính và tường lửa
Fast Data – the New Big Data
1%2 b inteligencias%2bm%25c3%25baltiples%2bok
Archivos Web y Economía Digital. María Fernández Rancaño
Resources for Mobiles
Defib Course
Beefcious quotes slideshare
Brochure EQUIPMAG 2016
Santa maria 5
Buenas prácticas Dinamización Parques "Yo soy tetuan"
Detection of brown dwarf like objects in the core of ngc3603
La pobreza en el distrito federal en el 2004
CONOCIENDO LA CAPITAL DE MI PROVINCIA
Taller de velomancia Sabado 12 de Mayo
Condiciones Generales ADESLAS COMPLETA
REVISTA NUMERO 27 CANDÁS MARINERO
Herausforderung „Multi-Channel“-Architektur
African Leadership in ICT and Knowledge Societies: Issues, Tensions and Oppor...
Building a Turbo-fast Data Warehousing Platform with Databricks
Bảo mật mạng máy tính và tường lửa
Ad

Similar to Big Data and Fast Data - big and fast combined, is it possible? (20)

PDF
Big Data and Fast Data – Big and Fast Combined, is it Possible?
PDF
Big Data and Fast Data combined – is it possible?
PPTX
Big data oracle_introduccion
PPTX
TidalScale Overview
PPTX
The Future of Data Management: The Enterprise Data Hub
PDF
Data Virtualization: An Introduction
PDF
Using Big Data Analytics
PDF
Data Virtualization: An Introduction
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
PDF
Expanded top ten_big_data_security_and_privacy_challenges
PDF
Top ten big data security and privacy challenges
PDF
Horses for Courses: Database Roundtable
PDF
Building the Enterprise Data Lake: A look at architecture
PDF
Fast analytics kudu to druid
PDF
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
PDF
Think Big - How to Design a Big Data Information Architecture
PPT
Webinar - The Agility Challenge - Powering Cloud Apps with Multi-Model & Mixe...
PDF
Hadoop and the Future of SQL: Using BI Tools with Big Data
PPTX
Big Data PPT by Rohit Dubey
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data combined – is it possible?
Big data oracle_introduccion
TidalScale Overview
The Future of Data Management: The Enterprise Data Hub
Data Virtualization: An Introduction
Using Big Data Analytics
Data Virtualization: An Introduction
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Expanded top ten_big_data_security_and_privacy_challenges
Top ten big data security and privacy challenges
Horses for Courses: Database Roundtable
Building the Enterprise Data Lake: A look at architecture
Fast analytics kudu to druid
How the Journey to Modern Data Management is Paved with an Inclusive Edge-to-...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Think Big - How to Design a Big Data Information Architecture
Webinar - The Agility Challenge - Powering Cloud Apps with Multi-Model & Mixe...
Hadoop and the Future of SQL: Using BI Tools with Big Data
Big Data PPT by Rohit Dubey

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
ksqlDB - Stream Processing simplified!
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Fundamentals Big Data and AI Architecture
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
ksqlDB - Stream Processing simplified!
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Fundamentals Big Data and AI Architecture
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Big Data and Fast Data - big and fast combined, is it possible?

  • 1. 2013 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
 WELCOME Big Data and Fast Data big and fast combined – is it possible? Guido Schmutz und Albert Blarer 24. April 2013 24. April 2013 Big Data und Fast Data 1
  • 2. 2013 © Trivadis Guido Schmutz •  Working for Trivadis for more than 16 years •  Oracle ACE Director for Fusion Middleware and SOA •  Co-Author of different books •  Consultant, Trainer Software Architect for Java, Oracle, SOA and EDA •  Member of Trivadis Architecture Board •  Technology Manager @ Trivadis •  More than 25 years of software development 
 experience •  Contact: guido.schmutz@trivadis.com •  Blog: http://guidoschmutz.wordpress.com •  Twitter: gschmutz 14.06.2012 2 Where and When should I use the Oracle Service Bus (OSB)
  • 3. 2013 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

  • 4. 2013 © Trivadis Mit über 600 IT- und Fachexperten bei Ihnen vor Ort. 4 11 Trivadis Niederlassungen mit
 über 600 Mitarbeitenden 200 Service Level Agreements Mehr als 4'000 Trainingsteilnehmer Forschungs- und Entwicklungs- budget: CHF 5.0 / EUR 4 Mio. Finanziell unabhängig und
 nachhaltig profitabel Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden Stand 12/2012 Hamburg Düsseldorf Frankfurt Freiburg München Wien Basel ZürichBern Lausanne 4 Stuttgart Datum Trivadis – das Unternehmen
  • 5. 2013 © Trivadis Credits Nathan Marz Author of „ Big Data – Principles and best practics of scalable realtime data systems“ – Manning Press Used to be working at Backtype and Twitter Creator of •  Storm •  Cascalog •  ElephantDB 24. April 2013 Big Data und Fast Data 5
  • 6. 2013 © Trivadis Agenda 1.  Big Data, what is it? 2.  Motivation 3.  The Lambda Architecture 4.  Implementing the Lambda Architecture 5.  Summary 24. April 2013 Big Data und Fast Data 6
  • 7. 2013 © Trivadis Big Data Definition (Gartner et al) 14.02.2013 Big Data 4 Sales 7 Velocity Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing “Traditional” computing in RDBMS 
 is not scalable enough. 
 We search for “linear scalability” “Only … structured information 
 is not enough” – “95% of produced data in unstructured” Characteristics of Big Data: Its Volume, Velocity and Variety in combination + Veracity (IBM) - information uncertainty + Time to action ? – Big Data + Event Processing = Fast Data
  • 8. 2013 © Trivadis Big Data Emerging Technologies 24. April 2013 Big Data und Fast Data 8 §  MapReduce (e.g. Apache Hadoop) §  Event Stream Processing & CEP (e.g. Storm or Esper) §  New messaging systems (e.g. Apache Kafka) §  Integration tools (e.g. Spring or Camus) §  New database paradigms (e.g. NoSQL or NewSQL) §  Data mining tools (e.g. Apache Mahout ) §  Data extraction and detection tools (e.g. Apache Tika )
  • 10. 2013 © Trivadis Volume Development 0 20 40 60 80 100 0 2000 4000 6000 8000 2005 2007 2009 2011 2013 2015 AggregateUncertainty% GlobalDataVolumeinExabytes Year Sensors: “internet of things” Social Media: video, audio, text VoIP: Skype, MSN, ICQ, ... Enterprise Data: data dictionary, ERD, ... 24. April 2013 Big Data und Fast Data 10
  • 11. 2013 © Trivadis Velocity 24. April 2013 Big Data und Fast Data 11 §  Velocity requirement examples: §  Recommendation Engine §  Predictive Analytics §  Marketing Campaign Analysis §  Customer Retention and Churn Analysis §  Social Graph Analysis §  Capital Markets Analysis §  Risk Management §  Rogue Trading §  Fraud Detection §  Retail Banking §  Network Monitoring §  Research and Development
  • 12. 2013 © Trivadis Agenda 1.  Big Data, what is it? 2.  Motivation 3.  The Lambda Architecture 4.  Implementing the Lambda Architecture 5.  Summary 24. April 2013 Big Data und Fast Data 12
  • 13. 2013 © Trivadis What is a data system? •  A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure and every human mistake ever made. •  A data system answers questions based on information that was acquired in the past •  Not all bits of information are equal •  Some information is derived from other 24. April 2013 Big Data und Fast Data 13
  • 14. 2013 © Trivadis Desired Properties of a (Big) Data System Robust and fault-tolerant Low latency reads and updates Scalable General Extensible Allows ad hoc queries Minimal maintenance Debuggable 24. April 2013 Big Data und Fast Data 14
  • 15. 2013 © Trivadis Typical problem in today’s
 architecture/systems Bugs will be deployed to production over the lifetime of a data system Operational mistakes will be made Humans are part of the overall system •  Just like hard disks, CPUs, memory, software •  design for human error like you design for any other fault Examples of human error •  Deploy a bug that increments counters by two instead of by one •  Accidentally delete data from database •  Accidental DOS on important internal service Worst two consequences: data loss or data corruption As long as an error doesn‘t lose or corrupt good data, you can fix what went wrong 24. April 2013 Big Data und Fast Data 15 Lack of Human Fault Tolerance
  • 16. 2013 © Trivadis Mutability The U and D in CRUD A mutable system updates the current state of the world Mutable systems inherently lack human fault-tolerance Easy to corrupt or lose data 24. April 2013 Big Data und Fast Data 16 Capturing change traditionally Lack of Human Fault Tolerance Name City Guido Berne Albert Zurich Name City Guido Basel Albert Zurich
  • 17. 2013 © Trivadis Immutability An immutable system captures historical records of events Each event happens at a particular time and is always true 24. April 2013 Big Data und Fast Data 17 Capturing change by storing events Lack of Human Fault Tolerance Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988 Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988 Guido Basel 1.4.2013
  • 18. 2013 © Trivadis Immutability Immutability greatly restricts the range of errors that can cause data loss or data corruption Vastly more human fault-tolerant Much easier to reason about systems based on immutability Conclusion: Your source of truth should always be immutable 24. April 2013 Big Data und Fast Data 18 Lack of Human Fault Tolerance
  • 19. 2013 © Trivadis What about traditional/today’s architectures ? 
 Source of Truth is mutable! Rather than build systems like this …. 24. April 2013 Big Data und Fast Data 19 Mutable Database Application (Query) RDBMS NoSQL NewSQL Mobile Web RIA Rich Client Source of Truth Source of Truth
  • 20. 2013 © Trivadis A different kind of architecture with immutable source of truth … why not building them like this 24. April 2013 Big Data und Fast Data 20 HDFS NoSQL NewSQL RDBMS View on Data Mobile Web RIA Rich Client Source of Truth Immutable data View on Data Application (Query) Source of Truth
  • 21. 2013 © Trivadis How to create the views on the Immutable data? On the fly ? Materialized, i.e. Pre-computed ? 24. April 2013 Big Data und Fast Data 21 Immutable data View Immutable data Pre-
 Computed
 Views Query Query
  • 22. 2013 © Trivadis Data = the most raw information Data is information which is not derived from anywhere else •  The most raw form of information •  Data is the special information from which everything else is derived Questions on data can be answered by running functions that take data as input The most general purpose data system can answer questions by running functions that take the entire dataset as input query = function (all data) The lambda architecture provides a general purpose approach for implementing arbitrary functions on an arbitrary datasets 24. April 2013 Big Data und Fast Data 22
  • 23. 2013 © Trivadis Data = the most raw information 24. April 2013 Big Data und Fast Data 23 1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10 iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15 4derive derive Favorite Product List Changes Current Favorite 
 Product List Current Product Count Raw information => data Information => derived
  • 24. 2013 © Trivadis Big Data and Batch Processing 24. April 2013 Big Data und Fast Data 24 Immutable data Batch View Query?? Incoming Data How to compute the batch views ? How to compute queries from the views ?
  • 25. 2013 © Trivadis Big Data and Batch Processing 24. April 2013 Big Data und Fast Data 25 Fully processed data Last full batch period Time for
 batch job time now non-processed data time now batch-processed data §  Using only batch processing, leaves you always with a portion of non- processed data. Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20 But we are not done yet …
  • 26. 2013 © Trivadis Adding Real-Time Processing 24. April 2013 Big Data und Fast Data 26 Immutable data Batch Views Query ? Data Stream Realtime Views Incoming Data How to compute queries 
 from the views ?How to compute real-time views
  • 27. 2013 © Trivadis Adding Real-Time Processing 24. April 2013 Big Data und Fast Data 27 1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10 Now Add Canon Scanner iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15 5 compute Favorite Product List Changes Current Favorite 
 Product List Current Product Count Now Canon ScannercomputeAdd Canon Scanner Stream of Favorite Product List Changes Immutable data Views Data Stream Query
  • 28. 2013 © Trivadis Big Data and Real Time Processing 24. April 2013 Big Data und Fast Data 28 time Fully processed data Last full batch period now Time for
 batch job batch processing
 worked fine here (e.g. Hadoop) real time processing
 works here blended view for end user Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20
  • 29. 2013 © Trivadis Agenda 1.  Big Data, what is it? 2.  Motivation 3.  The Lambda Architecture 4.  Implementing the Lambda Architecture 5.  Summary 24. April 2013 Big Data und Fast Data 29
  • 30. 2013 © Trivadis Lambda Architecture 24. April 2013 Big Data und Fast Data 30 Immutable data Batch View Query Data Stream Realtime View Incoming Data Serving Layer Speed Layer Batch Layer A B C D E F G
  • 31. 2013 © Trivadis Lambda Architecture A.  All data is sent to both the batch and speed layer B.  Master data set is an immutable, append-only set of data C.  Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views. D.  Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available E.  Speed layer compensates for the high latency of updates to the Batch Views in the Serving layer. F.  Uses fast incremental algorithms and read/write databases to produce real- time views G.  Queries are resolved by getting results from both batch and real-time views 24. April 2013 Big Data und Fast Data 31
  • 32. 2013 © Trivadis Layered Architecture Stores the immutable constantly growing dataset Computes arbitrary views from this dataset using BigData technologies (can take hours) Can be always recreated Responsible for indexing and exposing the pre-computed batch views so that they can be queried Exposes the incremented real-time views Merges the batch and the real-time views into a consistent result Computes the views from the constant stream of data it receives Needed to compensate for the high latency of the batch layer Incremental model and views are transient 24. April 2013 Big Data und Fast Data 32 Serving Layer Batch Layer Speed Layer
  • 33. 2013 © Trivadis Agenda 1.  Big Data, what is it? 2.  Motivation 3.  The Lambda Architecture 4.  Implementing the Lambda Architecture 5.  Summary 24. April 2013 Big Data und Fast Data 33
  • 34. 2013 © Trivadis Lambda Architecture 24. April 2013 Big Data und Fast Data 34 Speed Layer Precompute Views query Source: Marz, N. & Warren, J. (2013) Big Data. Manning. Batch Layer Precomputed information All data Incremented information Process stream Incoming Data Batch recompute Realtime increment Serving Layer batch view batch view real time view real time view Merge
  • 35. 2013 © Trivadis Lambda Architecture 24. April 2013 Big Data und Fast Data 35 one possible product/framework mapping Speed Layer Precompute Views query Batch Layer Precomputed information All data Incremented information Process stream Incoming Data Batch recompute Realtime increment Serving Layer batch view batch view real time view real time view Merge
  • 36. 2013 © Trivadis Implementing Batch Layer Immutable Data •  Append only •  Normalized •  Stores master copy of all data Pre-computed information •  Function that takes all data as input query = function(all-data) •  High Latency, Batch processing •  Unrestrained computation •  Horizontal scalable 24. April 2013 Big Data und Fast Data 36 Immutable data Batch
 Views compute Precompute Views Batch Layer Precomputed information All data Batch recompute Batch Layer Serving Layer
  • 37. 2013 © Trivadis Apache Hadoop HDFS HDFS = the Hadoop Distributed File System A distributed file storage system Redundant storage Designed to reliably store data using commodity hardware Designed to expect hardware failures Intended for large files Designed for batch inserts 24. April 2013 Big Data und Fast Data 37 Batch Layer
  • 38. 2013 © Trivadis Apache Hadoop Map Reduce 24. April 2013 Big Data und Fast Data 38 §  Hadoop Map Reduce is an open source implementation of the MapReduce framework. §  Map Reduce is §  a programming model, introduced by Google, for processing large data sets, in a distributed environment §  De-facto standard to compute huge amounts of data §  An execution framework for organizing and performing such computations MAP master node REDUCE worker node 1 worker node 2 worker node 3 problem data solution data Batch Layer
  • 39. 2013 © Trivadis Hadoop MapReduce Flow 24. April 2013 Big Data und Fast Data 39 Source: Bill Graham, Twitter Inc. Batch Layer
  • 40. 2013 © Trivadis Hadoop MapReduce 24. April 2013 Big Data und Fast Data 40 Batch Layer
  • 41. 2013 © Trivadis Cascading Application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop adds an abstraction layer over the Hadoop API core concepts of the cascading API: •  Pipe: a series of processing steps (parsing, looping, filtering, etc) defining the data processing to be done •  Flow: association of a pipe (or set of pipes) with a data-source and data-sink 24. April 2013 Big Data und Fast Data 41 Batch Layer
  • 42. 2013 © Trivadis Casading 24. April 2013 Big Data und Fast Data 42
  • 43. 2013 © Trivadis Apache Pig Apache Pig is a platform for analyzing large data sets Key Properties •  Ease of programming •  Optimization opportunities •  Extensibility 24. April 2013 Big Data und Fast Data 43 Batch Layer
  • 44. 2013 © Trivadis Implementing Serving Layer
 for Batch Views Need a database that •  Is batch-writable •  Adding new information is atomic •  Has fast random reads •  Is scalable •  Is highly available •  Can be optimized for Storage •  Information can be de-normalized •  But no Random writes required! •  Can be a simple database 24. April 2013 Big Data und Fast Data 44 Serving Layer batch view batch view Batch Layer Precomputed information Immutable data Batch
 Views compute Batch Layer Serving Layer
  • 45. 2013 © Trivadis SploutSQL Full SQL => unlike NoSQL For BigData => unlike RDBMS Web latency & throughput => unlike Apache Hive, Apache Drill Why does it scale •  Data is partitioned •  Partitions are distributed 
 across nodes •  Adding more nodes 
 increase capacity •  Generation does not 
 impact serving 24. April 2013 Big Data und Fast Data 45 Serving Layer Source: Datasalt.
  • 46. 2013 © Trivadis SploutSQL 24. April 2013 Big Data und Fast Data 46 Serving Layer
  • 47. 2013 © Trivadis Implementing Speed Layer
 Stream Processing Continuous computation Transactional Storing a limited window of data •  Compensating for the last few 
 hours of data All the complexity is isolated in the 
 speed layer •  If anything goes wrong, it‘s 
 autocorrected by the next batch run 24. April 2013 Big Data und Fast Data 47 Speed Layer Incremented information Process stream Realtime increment Data Stream Realtime
 Views derive Speed Layer Serving Layer
  • 48. 2013 © Trivadis Apache Kafka A high throughput distributed messaging system Originated at LinkedIn Sequential disk access 24. April 2013 Big Data und Fast Data 48
  • 49. 2013 © Trivadis Twitter Storm – the “real-time Hadoop” 24. April 2013 Big Data und Fast Data 49 §  Strom is a distributed and fault-tolerant real-time computing platform §  data flow model, data flows through network of transformation entities §  Key concepts §  Tuple: ordered list of elements §  Streams: unbounded sequence of tuples §  Spouts: Source of streams §  Bolts: Process tuples and create new streams §  Topologies: directed graph of Spouts and Bolts §  Use Cases §  Stream Processing §  Continuous Computation §  Distributed RPC SPOUT BOLT „MAP“ „REDUCE“ „PERSIST“ problem data data source solution data Speed Layer Serving Layer BOLT BOLT
  • 50. 2013 © Trivadis Twitter Storm 24. April 2013 Big Data und Fast Data 50 Speed Layer Serving Layer
  • 51. 2013 © Trivadis Twitter Trident Higher level abstraction over Storm Trident State Grouped Stream Functions, Filters Aggregators Query Similar to Pig and Cascading 24. April 2013 Big Data und Fast Data 51 Speed Layer Serving Layer
  • 52. 2013 © Trivadis Twitter Trident 24. April 2013 Big Data und Fast Data 52 Speed Layer Serving Layer
  • 53. 2013 © Trivadis Implementing Serving Layer
 for Real-Time Views Incremental updates are made available as real-time views Requires a database that support random read and random writes •  Relational, NoSQL or NewSQL (in memory) databases can be used •  Here we are typically not in the BigData range Results are only needed until the data made it through the batch layer Complexity isolation 24. April 2013 Big Data und Fast Data 53 Data Stream Realtime
 Views derive Speed Layer Serving Layer Speed Layer Serving Layer real time view real time view Incremented information
  • 54. 2013 © Trivadis Cassandra Fully distributed, no single-point-of-failure Linearly scalable Fault tolerant Performant Durable Integrated caching Tunable consistency 24. April 2013 Big Data und Fast Data 54 Serving Layer
  • 55. 2013 © Trivadis Implementing Serving Layer
 Merge of Batch and Realtime Views An interesting feature of Storm / Trident is the ability to execute distributed RPC (DRPC) calls in parallel This can be used to implement the merge functionality when a query is executed 24. April 2013 Big Data und Fast Data 55 Serving Layer batch view batch view real time view real time view Realtime
 Views Serving Layer Batch Views Merge query
  • 56. 2013 © Trivadis Storm / Trident DRPC 24. April 2013 Big Data und Fast Data 56 Serving Layer
  • 57. 2013 © Trivadis Agenda 1.  Big Data, what is it? 2.  Motivation 3.  The Lambda Architecture 4.  Implementing the Lambda Architecture 5.  Summary 24. April 2013 Big Data und Fast Data 57
  • 58. 2013 © Trivadis Summary – The lambda architecture 24. April 2013 Big Data und Fast Data 58 §  The Lambda Architecture §  Can discard batch views and real-time views and recreate everything from scratch §  Mistakes corrected via re-computation §  Data storage layer optimized independently from query resolution layer §  Still in a very early …. But a very interesting idea! -  Today a zoo of technologies are needed => Operations won‘t like it §  Different query language for batch and real time §  An abstraction over batch and speed layer needed -  Cascading and Trident are already similar §  Industry standards needed!
  • 59. 2013 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
 THANK YOU. Trivadis AG Guido Schmutz & Albert Blarer Europa-Strasse 5
 CH-8095 Glattbrugg info@trivadis.com
 www.trivadis.com 24. April 2013 Big Data und Fast Data 59