SlideShare a Scribd company logo
@ Rubyslava 2014
Michal Hariš : michal.haris@visualdna.com
- Technical Architect, joined VisualDNA in 2012
Where were we 3 years ago
●

10 people working around one mysql table holding 50M+ user profiles
Where were we 3 years ago
●

10 people working around one mysql table holding 50M+ user profiles

●

LAMP Architecture
SCALABILITY ISSUES
Where were we 3 years ago
●

10 people working around one mysql table holding 50M+ user profiles

●

LAMP Architecture
SCALABILITY ISSUES
DECISION TO GO BIG (DATA) !
Where were we 18 months ago
●

30 strong team, of that a single tech team of roughly 15 people

●

Basically a batch architecture
●
●
●
●
●
●

●

just not MySQL but CASSANDRA + HADOOP at the back
http+php trackers with piped custom log batch process
s3 upload every 5 min
daily hdfs distcp
POC = daily hadoop inference > 6 node cassandra -> batch integrations
POC was a daily batch job which on bad days took 30 hours

One of the first commercial Cassandra cluster in the world
● very unstable
Where are we today
● Stack
● Java
● Scala
● Hadoop
● Cassandra
● Kafka
● Redis
● R
● AngularJS for the front-end
Where are we today
●

Auto-scaling geo-located Tracker Clusters - well, almost auto-scaling

●

Robust Streaming Infrastructure - aggregation of all data streams in
central infrastructure
●

bringing in 8.5k events/ second at peak

●

●

Real-time end-user products, scoring services, integrations with third
parties where possible, pre-computation infrastructure that scales more
predictively
● These are primary events which get multiplied by various speed-layer
ETL Pipeline - offloading data streams and pre-computing materialised
views onto HDFS > 30TB of primary data

●

● some data we keep only last 60 or 90 days, others we keep for ever
Decision Analytics Pipeline (or RD Pipe) > 100TB+ of secondary data i
●

Using feature-extraction machine learning methods
Where are we today
●

Still one Cassandra ring, just bigger and more stable, 16 nodes, 250M+
active user profiles

●

Lambda Architecture for real-time products like WHY Analytics
●
●
●
●
●

RD Pipe is the "batch" layer (daily) that generates active profiles as a
cassandra ("view layer")
Primary Events are enriched for user profiles produced daily by the
Enrichment service ("speed layer")
Combination of probabilistic counters and Redis cubes calculates the
current audience profiles for subscribed websites ("speed layer")
API on top of the Redis cubes serves the current audience profiles for the
front end suite of real-time analytics products ("serving layer")
Audience Analytics product suite is the good looking bit - http://www.
visualdna.com/why/
Where are we today
● 120-strong team, of that tech is roughly 60:
●
●
●
●
●

Sysadmin Team
Architecture Tech Team
Decision Analytics Tech Team
Consumer Tech Team
WHY Analytics Team
What have we learned
●

Architecture:
●

Updating json blobs in Cassandra columns is a trap
● Logging is better http://engineering.linkedin.com/distributed-systems/log-what-everysoftware-engineer-should-know-about-real-time-datas-unifying

●

●

●

Metrics are crucial in large distributed systems
● yammer metrics + graphite + icinga works well for infrastructure
● but complex event/anomalies detection and pattern analysis gives the
edge
Real-Time processing of Data Streams is not only cool, but scales
well ... until you find a bottleneck in a single component which will limit the
entire system
Batch still matters
● but could be much faster than Hadoop which falls on too much
redundant I/O and requires a coordinated ETL pipeline
What have we learned
●

Engineering:
●

●

the unix philosophy of building short, simple, clear, modular, and
extendable code applies also to a design of distributed systems not
just an OS
bad tests are better than no tests but they are still bad and most tests
only test positive outcome
● the story of Math.abs() -> actually can return negative number ->
but none of the unit-tests anticipated this -> which is why metrics
and systems with feedback control are crucial

●

●
Process:
●

●

It is possible to co-operate remotely even on complex and not-well
defined systems - atm some of the architecture team is working remotely
on permanent basis
QA is intrinsic to Architecture and local to products
Interesting issues we’re facing
1. SLAs vs. Start-up dynamics - Separate process (and to some
degree architecture) for different levels of guarantee of service

2. Globally-distributed highly-available API for random
access to our profiles - enabling decisions based on VDNA profiles on-demand
3. Our Lambda has a bottleneck at the enrichment point

-

although if we solve (2.) we will be half-way through

4. Complex data pooling attribution model
5. Cassandra still gives us some pain - it's the drivers! - interesting
about consistency: http://aphyr.com/posts/294-call-me-maybe-cassandra/

6. Preserving start-up dynamics and culture in a company
of 200+ with offices in several cities
We’re hiring for Bratislava office!
● We’re looking for engineers and analysts and
more to be based in Bratislava

careers-cee@visualdna.com

More Related Content

PPTX
Scylla Summit 2018: Grab and Scylla: Driving Southeast Asia Forward
PPTX
Captial One: Why Stream Data as Part of Data Transformation?
PPTX
How SkyElectric Uses Scylla to Power Its Smart Energy Platform
PDF
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
PDF
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
PPTX
ClustrixDB at Samsung Cloud
PDF
Introducing the R2DBC async Java connector
PDF
Under the hood: SkySQL monitoring
Scylla Summit 2018: Grab and Scylla: Driving Southeast Asia Forward
Captial One: Why Stream Data as Part of Data Transformation?
How SkyElectric Uses Scylla to Power Its Smart Energy Platform
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
ClustrixDB at Samsung Cloud
Introducing the R2DBC async Java connector
Under the hood: SkySQL monitoring

What's hot (20)

PDF
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
PDF
Pavel Prischepa. Fffast Drupal backend.
PDF
Introducing workload analysis
PDF
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
PDF
The architecture of SkySQL
PDF
MariaDB Enterprise Tools introduction
PPTX
How Pixid dropped Oracle and went hybrid with MariaDB
PPTX
Implementing a Distributed NoSQL Database in a Persistent Distributed Ledger ...
PDF
TiDB Introduction
PDF
Cassandra Lunch #23: Lucene Based Indexes on Cassandra
PDF
Presto Summit 2018 - 02 - LinkedIn
PDF
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
PDF
Productionalizing a spark application
PDF
Journey and evolution of Presto@Grab
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Spark stack for Model life-cycle management
PDF
Introducing the ultimate MariaDB cloud, SkySQL
PDF
Orchestrating Cassandra with Kubernetes
PDF
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
PPTX
CCV: migrating our payment processing system to MariaDB
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
Pavel Prischepa. Fffast Drupal backend.
Introducing workload analysis
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
The architecture of SkySQL
MariaDB Enterprise Tools introduction
How Pixid dropped Oracle and went hybrid with MariaDB
Implementing a Distributed NoSQL Database in a Persistent Distributed Ledger ...
TiDB Introduction
Cassandra Lunch #23: Lucene Based Indexes on Cassandra
Presto Summit 2018 - 02 - LinkedIn
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
Productionalizing a spark application
Journey and evolution of Presto@Grab
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Spark stack for Model life-cycle management
Introducing the ultimate MariaDB cloud, SkySQL
Orchestrating Cassandra with Kubernetes
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
CCV: migrating our payment processing system to MariaDB
Ad

Viewers also liked (20)

ODP
Microdata, Authorship, Google+ and Joomla! - Ruth Cheesley - Joomla! World Co...
PDF
SST 2014; The Reluctant SME
PDF
Business of Front-end Web Development
PDF
Web accessibiilty and Drupal
PPTX
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
PDF
Título de experto en programación con tecnologías web
PDF
Rapid Product Design in the Wild - Agile Iceland
PPTX
Datatium - radiation free responsive experiences
PPTX
Something from Nothing: Simple Ways to Look Sharp When Time is Short
PDF
OpenID and decentralised social networks
PDF
Groovy & Grails eXchange 2012 - Building an e-commerce business with gr8 tec...
KEY
Rails traps
PPTX
UXD v. Analytics - WIAD13 Ann Arbor
PDF
Alternative Design Workflows in a "PostPSD" Era
PPTX
FSharp for Trading - CodeMesh 2013
PDF
Using Cascalog to build an app with City of Palo Alto Open Data
PPTX
ReactJS maakt het web eenvoudig
PDF
Big Data, Big Changes: Data-Driven Product Development at Etsy
PPTX
Taxonomy of Scala
PDF
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
Microdata, Authorship, Google+ and Joomla! - Ruth Cheesley - Joomla! World Co...
SST 2014; The Reluctant SME
Business of Front-end Web Development
Web accessibiilty and Drupal
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
Título de experto en programación con tecnologías web
Rapid Product Design in the Wild - Agile Iceland
Datatium - radiation free responsive experiences
Something from Nothing: Simple Ways to Look Sharp When Time is Short
OpenID and decentralised social networks
Groovy & Grails eXchange 2012 - Building an e-commerce business with gr8 tec...
Rails traps
UXD v. Analytics - WIAD13 Ann Arbor
Alternative Design Workflows in a "PostPSD" Era
FSharp for Trading - CodeMesh 2013
Using Cascalog to build an app with City of Palo Alto Open Data
ReactJS maakt het web eenvoudig
Big Data, Big Changes: Data-Driven Product Development at Etsy
Taxonomy of Scala
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
Ad

Similar to About VisualDNA Architecture @ Rubyslava 2014 (20)

PDF
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PDF
CDP.pl - tech case study by Divante
PDF
CDP.pl - tech case study by Divante
PDF
Data ops in practice - Swedish style
ODP
Web-scale data processing: practical approaches for low-latency and batch
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
PPTX
DOES14 - David Ashman - Blackboard Learn - Keep Your Head in the Clouds
PPTX
DOES14 - David Ashman, Blackboard Learn - Keep Your Head in the Clouds Tuesda...
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
PDF
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
PDF
Accelerating Digital Transformation: It's About Digital Enablement
PDF
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
From monolith to microservices
PDF
DevOps at Obama for America(2012) and the DNC (DevOps Days NYC Jan 2013)
PPTX
Gluent Extending Enterprise Applications with Hadoop
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
CDP.pl - tech case study by Divante
CDP.pl - tech case study by Divante
Data ops in practice - Swedish style
Web-scale data processing: practical approaches for low-latency and batch
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
DOES14 - David Ashman - Blackboard Learn - Keep Your Head in the Clouds
DOES14 - David Ashman, Blackboard Learn - Keep Your Head in the Clouds Tuesda...
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
Accelerating Digital Transformation: It's About Digital Enablement
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
From monolith to microservices
DevOps at Obama for America(2012) and the DNC (DevOps Days NYC Jan 2013)
Gluent Extending Enterprise Applications with Hadoop

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

About VisualDNA Architecture @ Rubyslava 2014

  • 1. @ Rubyslava 2014 Michal Hariš : michal.haris@visualdna.com - Technical Architect, joined VisualDNA in 2012
  • 2. Where were we 3 years ago ● 10 people working around one mysql table holding 50M+ user profiles
  • 3. Where were we 3 years ago ● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture SCALABILITY ISSUES
  • 4. Where were we 3 years ago ● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture SCALABILITY ISSUES DECISION TO GO BIG (DATA) !
  • 5. Where were we 18 months ago ● 30 strong team, of that a single tech team of roughly 15 people ● Basically a batch architecture ● ● ● ● ● ● ● just not MySQL but CASSANDRA + HADOOP at the back http+php trackers with piped custom log batch process s3 upload every 5 min daily hdfs distcp POC = daily hadoop inference > 6 node cassandra -> batch integrations POC was a daily batch job which on bad days took 30 hours One of the first commercial Cassandra cluster in the world ● very unstable
  • 6. Where are we today ● Stack ● Java ● Scala ● Hadoop ● Cassandra ● Kafka ● Redis ● R ● AngularJS for the front-end
  • 7. Where are we today ● Auto-scaling geo-located Tracker Clusters - well, almost auto-scaling ● Robust Streaming Infrastructure - aggregation of all data streams in central infrastructure ● bringing in 8.5k events/ second at peak ● ● Real-time end-user products, scoring services, integrations with third parties where possible, pre-computation infrastructure that scales more predictively ● These are primary events which get multiplied by various speed-layer ETL Pipeline - offloading data streams and pre-computing materialised views onto HDFS > 30TB of primary data ● ● some data we keep only last 60 or 90 days, others we keep for ever Decision Analytics Pipeline (or RD Pipe) > 100TB+ of secondary data i ● Using feature-extraction machine learning methods
  • 8. Where are we today ● Still one Cassandra ring, just bigger and more stable, 16 nodes, 250M+ active user profiles ● Lambda Architecture for real-time products like WHY Analytics ● ● ● ● ● RD Pipe is the "batch" layer (daily) that generates active profiles as a cassandra ("view layer") Primary Events are enriched for user profiles produced daily by the Enrichment service ("speed layer") Combination of probabilistic counters and Redis cubes calculates the current audience profiles for subscribed websites ("speed layer") API on top of the Redis cubes serves the current audience profiles for the front end suite of real-time analytics products ("serving layer") Audience Analytics product suite is the good looking bit - http://www. visualdna.com/why/
  • 9. Where are we today ● 120-strong team, of that tech is roughly 60: ● ● ● ● ● Sysadmin Team Architecture Tech Team Decision Analytics Tech Team Consumer Tech Team WHY Analytics Team
  • 10. What have we learned ● Architecture: ● Updating json blobs in Cassandra columns is a trap ● Logging is better http://engineering.linkedin.com/distributed-systems/log-what-everysoftware-engineer-should-know-about-real-time-datas-unifying ● ● ● Metrics are crucial in large distributed systems ● yammer metrics + graphite + icinga works well for infrastructure ● but complex event/anomalies detection and pattern analysis gives the edge Real-Time processing of Data Streams is not only cool, but scales well ... until you find a bottleneck in a single component which will limit the entire system Batch still matters ● but could be much faster than Hadoop which falls on too much redundant I/O and requires a coordinated ETL pipeline
  • 11. What have we learned ● Engineering: ● ● the unix philosophy of building short, simple, clear, modular, and extendable code applies also to a design of distributed systems not just an OS bad tests are better than no tests but they are still bad and most tests only test positive outcome ● the story of Math.abs() -> actually can return negative number -> but none of the unit-tests anticipated this -> which is why metrics and systems with feedback control are crucial ● ● Process: ● ● It is possible to co-operate remotely even on complex and not-well defined systems - atm some of the architecture team is working remotely on permanent basis QA is intrinsic to Architecture and local to products
  • 12. Interesting issues we’re facing 1. SLAs vs. Start-up dynamics - Separate process (and to some degree architecture) for different levels of guarantee of service 2. Globally-distributed highly-available API for random access to our profiles - enabling decisions based on VDNA profiles on-demand 3. Our Lambda has a bottleneck at the enrichment point - although if we solve (2.) we will be half-way through 4. Complex data pooling attribution model 5. Cassandra still gives us some pain - it's the drivers! - interesting about consistency: http://aphyr.com/posts/294-call-me-maybe-cassandra/ 6. Preserving start-up dynamics and culture in a company of 200+ with offices in several cities
  • 13. We’re hiring for Bratislava office! ● We’re looking for engineers and analysts and more to be based in Bratislava careers-cee@visualdna.com