"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

Tobias Johansson
@ntjohansson
27/10/2016
Big data analytics
Einstürzenden Neudaten: Building an analytics engine from scratch

• Big data analytics engine
• Focusing on simplicity from an usage perspective
• Single process containing
• Time-series repository
• Semi-structured repository
• Execution engine
• Etc.
• Written in Scala/C++/Lua
What is Valo

• REST based
What is Valo
PUT /streams/sensors/environment/air
{
“sampleTime”: { “type”: “datetime” },
“sensor” : { “type”: “contributor” },
“pollution” : { “type”: “double” }
}
POST /streams/sensors/environment/air
{
“sampleTime”: “2016/10/27 15:13:00”,
“sensor” : “131e90ad-e32a”,
“pollution” : 85.6
}

• Data friendly
What is Valo
Content-Type: application/json
Content-Type: application/cbor
Content-Type: application/csv
Content-Type: application/bson
Time-series Semi-structured

• Real-time and historical queries
What is Valo

Looks simple?
Trust me, it is not.

Looks simple?
Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries

Know your cluster
It will crash

Know your cluster
• You need a cluster to run big data analytics on. But it is based on;
• Commodity hardware which can fail
• Unreliable network

Know your cluster
• Issues;
• Unreachable nodes
• Dropped messages
• Delayed messages
• No response

Know your cluster
• Issues;
• Unreachable nodes
• Dropped messages
• Delayed messages
• No response
• Split network
• Multiple working clusters
• Mutable state is likely to diverge

Know your cluster
• Accept these issues and don’t try to fight it. Make life simpler by;
• Not having a single point of failure
• No leaders
• No master/slave
• No special nodes
• Making it eventually consistent
• Use CRDTs for sets, counters, etc.
• Use vector-clocks for configuration

• Do not treat all data the same
• Time-series repository
• CPU data, market data, ECG
• Semi-structured repository
• Log files, emails
• KV repository
• Configuration
• Unless you are Oracle or Microsoft, make your data immutable, append only.
• Streams are facts at points in time, and facts do not change
Know your data

• Build properties into your data distribution policies. Properties which;
• Maximise resilience
• Avoid replicas on the same physical server rack
• Optimise data locality
• Minimise number of data transfers required when adding/removing
nodes
• Deterministically tell where data lives in the cluster
• Where does data for T0 to T1 sit in the cluster?
Know your data

• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data

Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x x D x x x
E x x x E x x x
F x x F x x x
G x G x x x
K x x K x x x
L x x L x x
M x M x
N N

Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x D x x x
E x X x E x x x
F x X F x x x
G x X G x x x
K x x K x x x
L x x L x x
M x M x
N N

Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)

Know your algos
select avg(kernel)
Avg
Avg
Avg Avg

Know your algos
select avg(kernel)
Avg
Avg
Avg
Avg
Avg
Avg
Avg

Know your algos
select avg(kernel)
Avg
Avg
Avg Avg
Avg Avg Avg

Know your algos
Init: () -> β
Apply: β -> 'a list -> β
Reduce: β -> β -> β
Finalise: β -> 'r
class AverageDouble {
def apply(value: NamedDouble): Unit
def reset(): Unit
def merge(state: Parser)
def restore(state: Parser)
def getResult: NamedDouble
def save(gen: Generator)
}

Travelling algos
Avg AvgAvg
Avg Avg Avg
Node / Segment 1 2 3 4 5 6 8 9
A x
B x x
C x x x
D x x x
E x x x
F x x x
G x x x
K x x x
L x x
M x
N
from historical /streams/demo/infrastructure/itime
group by timeStamp window of 5 minutes every 5 minutes fill last, alpha
select alpha, timeStamp, last(a) as la
partition every 1 hour as implicit

Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language
KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical modelsDistributed CRDTs
Transports
Real-time queries
./valo

www.valo.io
Thank you
Meet us at the Startup Area
tobias@valo.io
@ntjohansson

Algos
MicroTickFrequency
MicroVolatility
OnlineMisraGries
Anomaly
Histogram
Bivar
Univar
Skyline
EMA
MovingKurtosis
MovingDerivative
RecursiveEMA
MovingVariance
MovingVariance
Average
Sum
Sum
TopK
Quantiles

What has brought us here today

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

More Related Content

What's hot (20)

Similar to "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io (20)

More from Dataconomy Media (20)

Recently uploaded (20)

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io