SlideShare a Scribd company logo
Tobias Johansson
@ntjohansson
27/10/2016
Big data analytics
Einstürzenden Neudaten: Building an analytics engine from scratch
• Big data analytics engine
• Focusing on simplicity from an usage perspective
• Single process containing
• Time-series repository
• Semi-structured repository
• Execution engine
• Etc.
• Written in Scala/C++/Lua
What is Valo
• REST based
What is Valo
PUT /streams/sensors/environment/air
{
“sampleTime”: { “type”: “datetime” },
“sensor” : { “type”: “contributor” },
“pollution” : { “type”: “double” }
}
POST /streams/sensors/environment/air
{
“sampleTime”: “2016/10/27 15:13:00”,
“sensor” : “131e90ad-e32a”,
“pollution” : 85.6
}
• Data friendly
What is Valo
POST /streams/sensors/environment/air
Content-Type: application/json
POST /streams/sensors/environment/air
Content-Type: application/cbor
POST /streams/sensors/environment/air
Content-Type: application/csv
POST /streams/sensors/environment/air
Content-Type: application/bson
Time-series Semi-structured
• Real-time and historical queries
What is Valo
Looks simple?
Trust me, it is not.
Looks simple?
Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries
Looks simple?
Trust me, it is not.
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical models
Distributed CRDTs
Transports
Realtime queries
Know your cluster
It will crash
Know your cluster
• You need a cluster to run big data analytics on. But it is based on;
• Commodity hardware which can fail
• Unreliable network
Know your cluster
• Issues;
• Unreachable nodes
• Dropped messages
• Delayed messages
• No response
Know your cluster
• Issues;
• Unreachable nodes
• Dropped messages
• Delayed messages
• No response
• Split network
• Multiple working clusters
• Mutable state is likely to diverge
Know your cluster
• Accept these issues and don’t try to fight it. Make life simpler by;
• Not having a single point of failure
• No leaders
• No master/slave
• No special nodes
• Making it eventually consistent
• Use CRDTs for sets, counters, etc.
• Use vector-clocks for configuration
Know your data
• Do not treat all data the same
• Time-series repository
• CPU data, market data, ECG
• Semi-structured repository
• Log files, emails
• KV repository
• Configuration
• Unless you are Oracle or Microsoft, make your data immutable, append only.
• Streams are facts at points in time, and facts do not change
Know your data
• Build properties into your data distribution policies. Properties which;
• Maximise resilience
• Avoid replicas on the same physical server rack
• Optimise data locality
• Minimise number of data transfers required when adding/removing
nodes
• Deterministically tell where data lives in the cluster
• Where does data for T0 to T1 sit in the cluster?
Know your data
• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x x D x x x
E x x x E x x x
F x x F x x x
G x G x x x
K x x K x x x
L x x L x x
M x M x
N N
• Consistent hashing
• Minimises number of data transfers in the cluster
• Time-based distribution
• Distribute data in the cluster in second, minute, hour, day buckets
Know your data
Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9
A x x A x
B x x x B x x
C x x x x C x x x
D x x x D x x x
E x X x E x x x
F x X F x x x
G x X G x x x
K x x K x x x
L x x L x x
M x M x
N N
Know your algos
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Avg
Avg
Avg Avg
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Avg
Avg
Avg
Avg
Avg
Avg
Avg
Know your algos
from historical /streams/demo/infrastructure/cpu
select avg(kernel)
Avg
Avg
Avg Avg
Avg Avg Avg
Know your algos
Init: () -> β
Apply: β -> 'a list -> β
Reduce: β -> β -> β
Finalise: β -> 'r
class AverageDouble {
def apply(value: NamedDouble): Unit
def reset(): Unit
def merge(state: Parser)
def restore(state: Parser)
def getResult: NamedDouble
def save(gen: Generator)
}
Travelling algos
Avg AvgAvg
Avg Avg Avg
Node / Segment 1 2 3 4 5 6 8 9
A x
B x x
C x x x
D x x x
E x x x
F x x x
G x x x
K x x x
L x x
M x
N
from historical /streams/demo/infrastructure/itime
group by timeStamp window of 5 minutes every 5 minutes fill last, alpha
select alpha, timeStamp, last(a) as la
partition every 1 hour as implicit
Dynamo style clustering and vector-clocks
Eventual consistency
Gossip protocols
Distributed algorithms
Distributed execution engine
Expression trees and runtime code generation
Query rewriting and optimization
Consistent hashing
Time-series repository
Semi-structured repository
Data atomicity
Back pressure
Elasticity
Advanced ML algorithms
IO
Actor systems
Data distribution
Cluster management
B+ trees
Query language
KV-store
REST-api
Jump consistent hashing
Off-heap memory
Data formats
Distributed joins
Time semantics
Gap-filling
Statistical modelsDistributed CRDTs
Transports
Real-time queries
./valo
www.valo.io
Thank you
Meet us at the Startup Area
tobias@valo.io
@ntjohansson
Algos
MicroTickFrequency
MicroVolatility
OnlineMisraGries
Anomaly
Histogram
Bivar
Univar
Skyline
EMA
MovingKurtosis
MovingDerivative
RecursiveEMA
MovingVariance
MovingVariance
Average
Sum
Sum
TopK
Quantiles
What has brought us here today

More Related Content

PDF
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
PDF
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
PDF
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
PDF
The Future of Real-Time in Spark
PDF
Visualizing big data in the browser using spark
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
Proofpoint: Fraud Detection and Security on Social Media
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
The Future of Real-Time in Spark
Visualizing big data in the browser using spark
Presto: Optimizing Performance of SQL-on-Anything Engine
Proofpoint: Fraud Detection and Security on Social Media

What's hot (20)

PDF
Big data serving: Processing and inference at scale in real time
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Data Pipelines with Spark & DataStax Enterprise
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
PDF
Big data real time architectures
PPTX
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
PDF
Data streaming at VRT
PPSX
Hadoop Ecosystem
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PDF
Streaming computing: architectures, and tchnologies
PDF
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
PPTX
Improving Organizational Knowledge with Natural Language Processing Enriched ...
PPTX
Backup multi-cloud solution based on named pipes
PPTX
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
PDF
Spark Summit EU talk by Zoltan Zvara
PDF
Big Telco - Yousun Jeong
PDF
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Big data serving: Processing and inference at scale in real time
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Data Pipelines with Spark & DataStax Enterprise
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big data real time architectures
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Data streaming at VRT
Hadoop Ecosystem
RISELab:Enabling Intelligent Real-Time Decisions
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Streaming computing: architectures, and tchnologies
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Backup multi-cloud solution based on named pipes
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Spark Summit EU talk by Zoltan Zvara
Big Telco - Yousun Jeong
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Ad

Similar to "Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io (20)

PDF
Introduction to Data streaming - 05/12/2014
PDF
JDD2014: Real Big Data - Scott MacGregor
PDF
Target Holding - Big Dikes and Big Data
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Cloud storage
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PDF
Cassandra background-and-architecture
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PDF
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
PDF
Big data on_aws in korea by abhishek sinha (lunch and learn)
PPTX
Multivariate algorithms in distributed data processing computing.pptx
PPTX
Multivariate algorithms in distributed data processing computing.pptx
PDF
Building highly reliable data pipeline @datadog par Quentin François
PPT
Big Data & NoSQL - EFS'11 (Pavlo Baron)
PDF
Big Data , Big Problem?
PDF
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
IRJET- Big Data Processes and Analysis using Hadoop Framework
PPTX
Software architecture for data applications
Introduction to Data streaming - 05/12/2014
JDD2014: Real Big Data - Scott MacGregor
Target Holding - Big Dikes and Big Data
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Cloud storage
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Cassandra background-and-architecture
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Big data on_aws in korea by abhishek sinha (lunch and learn)
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
Building highly reliable data pipeline @datadog par Quentin François
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data , Big Problem?
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
IRJET- Big Data Processes and Analysis using Hadoop Framework
Software architecture for data applications
Ad

More from Dataconomy Media (20)

PDF
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
PDF
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
PDF
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
PDF
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
PPTX
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
PPTX
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
PPTX
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
PDF
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
PPTX
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
PDF
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
PPTX
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
PDF
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
PDF
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
PDF
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
PDF
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
PPTX
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
PDF
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
PPTX
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
PPTX
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
PPTX
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Global journeys: estimating international migration
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Knowledge Engineering Part 1
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Global journeys: estimating international migration
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Database Infoormation System (DBIS).pptx
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
Reliability_Chapter_ presentation 1221.5784
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
Moving the Public Sector (Government) to a Digital Adoption

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson, Lead Developer at Valo.io

  • 1. Tobias Johansson @ntjohansson 27/10/2016 Big data analytics Einstürzenden Neudaten: Building an analytics engine from scratch
  • 2. • Big data analytics engine • Focusing on simplicity from an usage perspective • Single process containing • Time-series repository • Semi-structured repository • Execution engine • Etc. • Written in Scala/C++/Lua What is Valo
  • 3. • REST based What is Valo PUT /streams/sensors/environment/air { “sampleTime”: { “type”: “datetime” }, “sensor” : { “type”: “contributor” }, “pollution” : { “type”: “double” } } POST /streams/sensors/environment/air { “sampleTime”: “2016/10/27 15:13:00”, “sensor” : “131e90ad-e32a”, “pollution” : 85.6 }
  • 4. • Data friendly What is Valo POST /streams/sensors/environment/air Content-Type: application/json POST /streams/sensors/environment/air Content-Type: application/cbor POST /streams/sensors/environment/air Content-Type: application/csv POST /streams/sensors/environment/air Content-Type: application/bson Time-series Semi-structured
  • 5. • Real-time and historical queries What is Valo
  • 7. Looks simple? Trust me, it is not. Dynamo style clustering and vector-clocks Eventual consistency Gossip protocols Distributed algorithms Distributed execution engine Expression trees and runtime code generation Query rewriting and optimization Consistent hashing Time-series repository Semi-structured repository Data atomicity Back pressure Elasticity Advanced ML algorithms IO Actor systems Data distribution Cluster management B+ trees Query language KV-store REST-api Jump consistent hashing Off-heap memory Data formats Distributed joins Time semantics Gap-filling Statistical models Distributed CRDTs Transports Realtime queries
  • 8. Looks simple? Trust me, it is not. Dynamo style clustering and vector-clocks Eventual consistency Gossip protocols Distributed algorithms Distributed execution engine Expression trees and runtime code generation Query rewriting and optimization Consistent hashing Time-series repository Semi-structured repository Data atomicity Back pressure Elasticity Advanced ML algorithms IO Actor systems Data distribution Cluster management B+ trees Query language KV-store REST-api Jump consistent hashing Off-heap memory Data formats Distributed joins Time semantics Gap-filling Statistical models Distributed CRDTs Transports Realtime queries
  • 9. Know your cluster It will crash
  • 10. Know your cluster • You need a cluster to run big data analytics on. But it is based on; • Commodity hardware which can fail • Unreliable network
  • 11. Know your cluster • Issues; • Unreachable nodes • Dropped messages • Delayed messages • No response
  • 12. Know your cluster • Issues; • Unreachable nodes • Dropped messages • Delayed messages • No response • Split network • Multiple working clusters • Mutable state is likely to diverge
  • 13. Know your cluster • Accept these issues and don’t try to fight it. Make life simpler by; • Not having a single point of failure • No leaders • No master/slave • No special nodes • Making it eventually consistent • Use CRDTs for sets, counters, etc. • Use vector-clocks for configuration
  • 15. • Do not treat all data the same • Time-series repository • CPU data, market data, ECG • Semi-structured repository • Log files, emails • KV repository • Configuration • Unless you are Oracle or Microsoft, make your data immutable, append only. • Streams are facts at points in time, and facts do not change Know your data
  • 16. • Build properties into your data distribution policies. Properties which; • Maximise resilience • Avoid replicas on the same physical server rack • Optimise data locality • Minimise number of data transfers required when adding/removing nodes • Deterministically tell where data lives in the cluster • Where does data for T0 to T1 sit in the cluster? Know your data
  • 17. • Consistent hashing • Minimises number of data transfers in the cluster • Time-based distribution • Distribute data in the cluster in second, minute, hour, day buckets Know your data
  • 18. • Consistent hashing • Minimises number of data transfers in the cluster • Time-based distribution • Distribute data in the cluster in second, minute, hour, day buckets Know your data Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9 A x x A x B x x x B x x C x x x x C x x x D x x x x D x x x E x x x E x x x F x x F x x x G x G x x x K x x K x x x L x x L x x M x M x N N
  • 19. • Consistent hashing • Minimises number of data transfers in the cluster • Time-based distribution • Distribute data in the cluster in second, minute, hour, day buckets Know your data Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9 A x x A x B x x x B x x C x x x x C x x x D x x x D x x x E x X x E x x x F x X F x x x G x X G x x x K x x K x x x L x x L x x M x M x N N
  • 21. Know your algos from historical /streams/demo/infrastructure/cpu select avg(kernel)
  • 22. Know your algos from historical /streams/demo/infrastructure/cpu select avg(kernel) Avg Avg Avg Avg
  • 23. Know your algos from historical /streams/demo/infrastructure/cpu select avg(kernel) Avg Avg Avg Avg Avg Avg Avg
  • 24. Know your algos from historical /streams/demo/infrastructure/cpu select avg(kernel) Avg Avg Avg Avg Avg Avg Avg
  • 25. Know your algos Init: () -> β Apply: β -> 'a list -> β Reduce: β -> β -> β Finalise: β -> 'r class AverageDouble { def apply(value: NamedDouble): Unit def reset(): Unit def merge(state: Parser) def restore(state: Parser) def getResult: NamedDouble def save(gen: Generator) }
  • 26. Travelling algos Avg AvgAvg Avg Avg Avg Node / Segment 1 2 3 4 5 6 8 9 A x B x x C x x x D x x x E x x x F x x x G x x x K x x x L x x M x N from historical /streams/demo/infrastructure/itime group by timeStamp window of 5 minutes every 5 minutes fill last, alpha select alpha, timeStamp, last(a) as la partition every 1 hour as implicit
  • 27. Dynamo style clustering and vector-clocks Eventual consistency Gossip protocols Distributed algorithms Distributed execution engine Expression trees and runtime code generation Query rewriting and optimization Consistent hashing Time-series repository Semi-structured repository Data atomicity Back pressure Elasticity Advanced ML algorithms IO Actor systems Data distribution Cluster management B+ trees Query language KV-store REST-api Jump consistent hashing Off-heap memory Data formats Distributed joins Time semantics Gap-filling Statistical modelsDistributed CRDTs Transports Real-time queries ./valo
  • 28. www.valo.io Thank you Meet us at the Startup Area tobias@valo.io @ntjohansson
  • 30. What has brought us here today