SlideShare a Scribd company logo
Big Data :
Bits of History, Words of Advice
Venu Vasudevan
GLSEC Big Data Meetup
Big Data :
Bits of History, Words of Advice
Big Data Past
Big
Fast
intelligent
media
IoT
satellites
Big Data : Behavioral
Big Data
- The ‘V’ view of Big Data challenges
- Number of V’s up for debate
Big Data : Architectural
untidy
data
firehose
clean
analytics
fast &
good
slower & much better
Lambda
architecture
Lake architecture
Stream architecture
Technical
Technical
This Talk
Behavioral
View
Technology
Solution
Stack
‘Middleware’
(benefit of
hindsight)
some more some
governance culture (gap)
data economics
ownership
foodfights
dataeconomics
3 data points
Big
Fast
intelligent
media
IoT
satellites
Iridium
• mobile routers (10K mph), fixed
people
• no repeated patterns
• satellites N-S movement
• earth E-W movement
• regular topology, irregular
exceptions
• solar flares
• military satellite presence
Fast Data Problem
• cellular frequency allocation
(graph coloring problem)
• frequent fast recalculations (fast
routers + semi-fast earth)
• transmit-no transmit (solar flares,
military satellite presence)
• moving ‘seam’
seam
irregularities
Fast Data Problem
• cellular frequency allocation
(graph coloring problem)
• frequent fast recalculations (fast
routers + semi-fast earth)
• transmit-no transmit (solar flares,
military satellite presence)
• moving ‘seam’
• + ‘France’
seam
irregularities
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)
Fast Data Problem
• quest for (OO)DB technology to
address ‘France’ as make-or-
break use case
• query expressive power
• complex constraint satisfaction
• query handling throughput
• 3-4 month benchmarking effort
seam
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)
Fast Data Problem
• quest for (OO)DB technology to
address ‘France’
• query expressive power
• query handling throughput
• 3-4 month benchmarking effort
• France solved ‘out-of-
band’ (legally)
seam
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)don’t overfit your architecture to
an extreme requirement
unless it’s from an extreme (paying) user
Big Data Problem
• systems management
• manage 66 ‘nodes’
• nodes moving at 10K mph
• ‘seam’ moving of 20K mph
• sounds harder than trivial, but
not too hard
‘Pre’ Lambda Solution
• Dumb edge | smart core
approach
• 15K events/sec/satellite
• 1M events/sec
• Fast & Approximate - FMEA:
’compiled’ lookup table for
failure modes
• Slow & Precise - Model-based
reasoning on satellite models
untidy
satellite
firehose
(1M events/sec)
actionable
insights
‘Pre’ Lambda
architecture
Model-Based
Reasoning
FMEA
‘Pre’ Lambda Solution
• Dumb edge | smart core
approach
• 15K events/sec/satellite
• Fast & Approximate - FMEA:
’compiled’ lookup table for
failure modes
• Slow & Precise - Model-based
reasoning on satellite models
• Simple, straightforward &
wrong.
untidy
satellite
firehose
(1M events/sec)
actionable
insights
‘Pre’ Lambda
architecture
Model-Based
Reasoning
real-time
expert system
FMEA
Yet, an architecture that is
‘rinsed and repeated’
over the years
why does dumb edge
smart cloud endure?
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide) ,
they make headlines
$
$$$$$
why dumb edge smart
cloud
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide)
and make headlines
• nobody messes with an
‘edge’ once it works
• clouds don’t make for good
news headlines
$
T-0
$$$$$
T-30 yrs
why dumb edge smart
cloud
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide)
and make news headlines
• nobody messes with an
‘edge’ once it works
• thus, implementing an end-
to-end architecture causes
culture clashes
over my
dead body
iterate &
refine
an almost repeat
(Industrial IoT)
• edges are messy & domain
specific
• creating them means
dealing with culture clashes
• but .. an ounce of edge is
worth a pound of cloud
$$$$$
T-30 yrs
$
T-0
Things to consider
• Problem statement. What’s your ‘France’?
• colorful sub-problem. strategy overfit.
• Architecture. small fixes to IT/OT gap can go a long way to
a simpler problem
• Technology Choices. best practices & the risk of ‘rewardless
risk’
• right - make average programmers productive with new
tech
• frequent - turn great programmers into average
Big Data to Deep Metadata
streaming video(TV) ~ 1 petabyte/day
second
minute
hour
day/week
epochal
detect &
replace ads
Create Playlists by
Player,
Play, Sentiment
Identify minor characters
with rabid fan following
rejuvenate old content
derivenewcontent
‘chapterize’ by
Player,
Play, Sentiment
Platform Triage Challenge
new Product, new market
• one core technology, many
markets
• platform triaging challenge.
what drives the platform?
• highest (but uncertain) $
potential?
• ‘extreme’ requirement?
• sparsest competition?
• use case outlier is your biggest
customer
deep
metadata
technology
SaaS
data
platform
Advertising
Search
Video
concept
maps
ad replacement use case
• speed
• few days (on-demand content)
• few seconds (real-time rebroadcast with
new ads)
• precision
• low - best effort, for low cost
international content for niche audiences
• high - frame level for expensive content.
e.g. Sports/$10M/episode programming
• errors
• 90% accuracy - ok for long tail content
• ‘five nines’ for premium content
precision accuracy
speed
ad replacement
opportunity space
largest
customer
occam’s razor works (again)
• build to simplicity
• loose coupling between data
engg & equipment engg
• modularize complexity
• ‘differentiate your product’
changes
• ‘necessary evil’ changes
data-only
approach
+1st party integration
(dynamically configure
ad splicers)
3rd party knobs
(dynamically refresh CDN)
Architecture
but, what if ..
• Data is untidy
• Interpretation is subjective/cultural
• Automation is aspirational but quixotic
human-powered analytics
• some analytics tasks are too
‘slippery’ for machines
• data hard to characterize
• uneven video quality of ‘old’
archives
• untidy
• insights are subjective
human-powered analytics
• some analytics tasks are too
‘slippery’ for machines
• need for human
augmentation
• humans generate ‘training’
sets to bootstrap m/c learning
• humans completely take over
some tasks
machines vs humans
• crowdsourcing & human-
powered computing
• has been the ‘next big thing’
for a while
• checkered history:
• uneven output
• fraud
• uneven throughput
Machines Humans
fast slow
brittle malleable
objective subjective
clear nuanced
machines vs humans
• much of that has changed
• Amazon Mech Turk
• 500K active users
• the ‘human machine’ can
return substantial jobs in
under 30 mins
• quantifiable as a machine for
many media tasks - latency,
quality, error rate, thruput
Hybrid Architecture
Things to consider
• Beware ‘France’ in other forms:
• customer with loudest voice & ‘holy grail’ hairball
• Dealing with data quality & variability
• crowdsourcing has come a long way as credible ‘engine’
• If big data the answer, what is the question? (have strong opinion held
weakly)
• decision rationalization
• process automation
• human ‘power tool’ (e.g. compelling visualization) vs imperfect
automation
startup data jiu-jitsu
• How to create a data-
driven strategy before
the data shows up?
• rationalize future
SaaS revenue
models
• justify product
decisions in a data-
driven manner
need data
for product
need product
for data
startup data jiu-jitsu
• How to create a data-
driven strategy before
the data shows up?
• how ‘intelligent’ can
lighting control be
with 50-100K users?
• how do people use
dimmers (continuous
or quantized) — UX
implications
data set dilemma
• standard sources (e.g. Kaggle & UCI) insufficient
• few ‘physical world’ datasets
• expensive to collect
• may be specialized (vendor-specific)
• dataset proxies for IoT actuation may not work
• energy utilization != switch usage
big data, small start
• physical world data likely to
be smaller (1-10 homes, few
months)
• setup costs limit size of public
datasets
• e.g. UMass Smart* light switch
dataset
big data, small start
• consider data
‘augmentation’
• standard practice in AI (deep
learning) - horizontally flipping,
random crops …
• under-used in data space
• may need some thought on
perturbation models for your
domain
real
synthesized
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
In short ..
• big data success - equal parts tech & non-tech
• solving right problem, not just problem right
• revisit problem, and what success means
@venuv62
venu.vasudevan@nextio.co

More Related Content

PDF
Deep Learning for IoT : is there a shallow end of the pool?
PPTX
Jay Y
PDF
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
PDF
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
PPTX
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
PPTX
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
PDF
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
PDF
Introduction to big data
Deep Learning for IoT : is there a shallow end of the pool?
Jay Y
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Introduction to big data

Similar to Big Data : Bits of History, Words of Advice (20)

PPTX
Marketing Technology Master Class SapientNitro XI 2014
PDF
Big Data and artificial intelligence and it's usage in artificial intelligence
PDF
Be3 experimentingbigdatainabox-part1:comprehendingthescenario
PDF
Big data - An Introduction
PDF
Big Data : Risks and Opportunities
PDF
DNA - Einstein - Data science ja bigdata
PDF
Introduction to Big Data
PDF
Big Data Rampage
PPT
Intelligent Data Processing for the Internet of Things
PDF
Big Data overview
PPTX
FinalPPT-StJoseph (3).pptx
PDF
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
PPT
Emerging Technology
PDF
Big Data and Fast Data combined – is it possible?
PDF
Big Data & Artificial Intelligence
PPTX
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
PPTX
Big Data @ NT - A Network Technology Perspective
PDF
The_Future_of_Data-Centres_-_Prof._Ian_Bitterlin_Emerson
PDF
Analytics&IoT
PDF
SuanIct-Bigdata desktop-final
Marketing Technology Master Class SapientNitro XI 2014
Big Data and artificial intelligence and it's usage in artificial intelligence
Be3 experimentingbigdatainabox-part1:comprehendingthescenario
Big data - An Introduction
Big Data : Risks and Opportunities
DNA - Einstein - Data science ja bigdata
Introduction to Big Data
Big Data Rampage
Intelligent Data Processing for the Internet of Things
Big Data overview
FinalPPT-StJoseph (3).pptx
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
Emerging Technology
Big Data and Fast Data combined – is it possible?
Big Data & Artificial Intelligence
DN2017 | From Big Data to Smart Data | Kirk Borne | Booz Allen Hamilton
Big Data @ NT - A Network Technology Perspective
The_Future_of_Data-Centres_-_Prof._Ian_Bitterlin_Emerson
Analytics&IoT
SuanIct-Bigdata desktop-final
Ad

More from Venu Vasudevan (11)

PDF
Chatbots 101
PDF
IIoT : Old Wine in a New Bottle?
PDF
Retrofit IoT
PDF
Mobile services for immobile users
PDF
Effortless Interfaces for Appified TV
PDF
Fun and games for profit
PDF
Can Couch Potatoes be Collaborators?
PDF
Dual screen tv
PDF
PDF
A social web for consumer and embedded devices
PDF
The Evolution of Mobile Information Services
Chatbots 101
IIoT : Old Wine in a New Bottle?
Retrofit IoT
Mobile services for immobile users
Effortless Interfaces for Appified TV
Fun and games for profit
Can Couch Potatoes be Collaborators?
Dual screen tv
A social web for consumer and embedded devices
The Evolution of Mobile Information Services
Ad

Recently uploaded (20)

PPTX
Internet Safety for Seniors presentation
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PDF
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
PPT
250152213-Excitation-SystemWERRT (1).ppt
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PPTX
Slides PPTX: World Game (s): Eco Economic Epochs.pptx
PDF
simpleintnettestmetiaerl for the simple testint
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
PPTX
1402_iCSC_-_RESTful_Web_APIs_--_Josef_Hammer.pptx
PPTX
artificial intelligence overview of it and more
PPTX
t_and_OpenAI_Combined_two_pressentations
PDF
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
PPTX
Introduction to cybersecurity and digital nettiquette
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPTX
Database Information System - Management Information System
Internet Safety for Seniors presentation
Exploring VPS Hosting Trends for SMBs in 2025
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
250152213-Excitation-SystemWERRT (1).ppt
artificialintelligenceai1-copy-210604123353.pptx
Slides PPTX: World Game (s): Eco Economic Epochs.pptx
simpleintnettestmetiaerl for the simple testint
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Design_with_Watersergyerge45hrbgre4top (1).ppt
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
1402_iCSC_-_RESTful_Web_APIs_--_Josef_Hammer.pptx
artificial intelligence overview of it and more
t_and_OpenAI_Combined_two_pressentations
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
Introduction to cybersecurity and digital nettiquette
Power Point - Lesson 3_2.pptx grad school presentation
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
Database Information System - Management Information System

Big Data : Bits of History, Words of Advice

  • 1. Big Data : Bits of History, Words of Advice Venu Vasudevan GLSEC Big Data Meetup
  • 2. Big Data : Bits of History, Words of Advice
  • 4. Big Data : Behavioral Big Data - The ‘V’ view of Big Data challenges - Number of V’s up for debate
  • 5. Big Data : Architectural untidy data firehose clean analytics fast & good slower & much better Lambda architecture Lake architecture Stream architecture
  • 8. This Talk Behavioral View Technology Solution Stack ‘Middleware’ (benefit of hindsight) some more some governance culture (gap) data economics ownership foodfights dataeconomics
  • 10. Iridium • mobile routers (10K mph), fixed people • no repeated patterns • satellites N-S movement • earth E-W movement • regular topology, irregular exceptions • solar flares • military satellite presence
  • 11. Fast Data Problem • cellular frequency allocation (graph coloring problem) • frequent fast recalculations (fast routers + semi-fast earth) • transmit-no transmit (solar flares, military satellite presence) • moving ‘seam’ seam irregularities
  • 12. Fast Data Problem • cellular frequency allocation (graph coloring problem) • frequent fast recalculations (fast routers + semi-fast earth) • transmit-no transmit (solar flares, military satellite presence) • moving ‘seam’ • + ‘France’ seam irregularities broadcast = +$$$ broadcast = -$$$ (lawsuit)
  • 13. Fast Data Problem • quest for (OO)DB technology to address ‘France’ as make-or- break use case • query expressive power • complex constraint satisfaction • query handling throughput • 3-4 month benchmarking effort seam broadcast = +$$$ broadcast = -$$$ (lawsuit)
  • 14. Fast Data Problem • quest for (OO)DB technology to address ‘France’ • query expressive power • query handling throughput • 3-4 month benchmarking effort • France solved ‘out-of- band’ (legally) seam broadcast = +$$$ broadcast = -$$$ (lawsuit)don’t overfit your architecture to an extreme requirement unless it’s from an extreme (paying) user
  • 15. Big Data Problem • systems management • manage 66 ‘nodes’ • nodes moving at 10K mph • ‘seam’ moving of 20K mph • sounds harder than trivial, but not too hard
  • 16. ‘Pre’ Lambda Solution • Dumb edge | smart core approach • 15K events/sec/satellite • 1M events/sec • Fast & Approximate - FMEA: ’compiled’ lookup table for failure modes • Slow & Precise - Model-based reasoning on satellite models untidy satellite firehose (1M events/sec) actionable insights ‘Pre’ Lambda architecture Model-Based Reasoning FMEA
  • 17. ‘Pre’ Lambda Solution • Dumb edge | smart core approach • 15K events/sec/satellite • Fast & Approximate - FMEA: ’compiled’ lookup table for failure modes • Slow & Precise - Model-based reasoning on satellite models • Simple, straightforward & wrong. untidy satellite firehose (1M events/sec) actionable insights ‘Pre’ Lambda architecture Model-Based Reasoning real-time expert system FMEA Yet, an architecture that is ‘rinsed and repeated’ over the years
  • 18. why does dumb edge smart cloud endure? • edges are expensive ($2B) • when edges go wrong (break/blow up /collide) , they make headlines $ $$$$$
  • 19. why dumb edge smart cloud • edges are expensive ($2B) • when edges go wrong (break/blow up /collide) and make headlines • nobody messes with an ‘edge’ once it works • clouds don’t make for good news headlines $ T-0 $$$$$ T-30 yrs
  • 20. why dumb edge smart cloud • edges are expensive ($2B) • when edges go wrong (break/blow up /collide) and make news headlines • nobody messes with an ‘edge’ once it works • thus, implementing an end- to-end architecture causes culture clashes over my dead body iterate & refine
  • 21. an almost repeat (Industrial IoT) • edges are messy & domain specific • creating them means dealing with culture clashes • but .. an ounce of edge is worth a pound of cloud $$$$$ T-30 yrs $ T-0
  • 22. Things to consider • Problem statement. What’s your ‘France’? • colorful sub-problem. strategy overfit. • Architecture. small fixes to IT/OT gap can go a long way to a simpler problem • Technology Choices. best practices & the risk of ‘rewardless risk’ • right - make average programmers productive with new tech • frequent - turn great programmers into average
  • 23. Big Data to Deep Metadata streaming video(TV) ~ 1 petabyte/day second minute hour day/week epochal detect & replace ads Create Playlists by Player, Play, Sentiment Identify minor characters with rabid fan following rejuvenate old content derivenewcontent ‘chapterize’ by Player, Play, Sentiment
  • 24. Platform Triage Challenge new Product, new market • one core technology, many markets • platform triaging challenge. what drives the platform? • highest (but uncertain) $ potential? • ‘extreme’ requirement? • sparsest competition? • use case outlier is your biggest customer deep metadata technology SaaS data platform Advertising Search Video concept maps
  • 25. ad replacement use case • speed • few days (on-demand content) • few seconds (real-time rebroadcast with new ads) • precision • low - best effort, for low cost international content for niche audiences • high - frame level for expensive content. e.g. Sports/$10M/episode programming • errors • 90% accuracy - ok for long tail content • ‘five nines’ for premium content precision accuracy speed ad replacement opportunity space largest customer
  • 26. occam’s razor works (again) • build to simplicity • loose coupling between data engg & equipment engg • modularize complexity • ‘differentiate your product’ changes • ‘necessary evil’ changes data-only approach +1st party integration (dynamically configure ad splicers) 3rd party knobs (dynamically refresh CDN)
  • 28. but, what if .. • Data is untidy • Interpretation is subjective/cultural • Automation is aspirational but quixotic
  • 29. human-powered analytics • some analytics tasks are too ‘slippery’ for machines • data hard to characterize • uneven video quality of ‘old’ archives • untidy • insights are subjective
  • 30. human-powered analytics • some analytics tasks are too ‘slippery’ for machines • need for human augmentation • humans generate ‘training’ sets to bootstrap m/c learning • humans completely take over some tasks
  • 31. machines vs humans • crowdsourcing & human- powered computing • has been the ‘next big thing’ for a while • checkered history: • uneven output • fraud • uneven throughput Machines Humans fast slow brittle malleable objective subjective clear nuanced
  • 32. machines vs humans • much of that has changed • Amazon Mech Turk • 500K active users • the ‘human machine’ can return substantial jobs in under 30 mins • quantifiable as a machine for many media tasks - latency, quality, error rate, thruput
  • 34. Things to consider • Beware ‘France’ in other forms: • customer with loudest voice & ‘holy grail’ hairball • Dealing with data quality & variability • crowdsourcing has come a long way as credible ‘engine’ • If big data the answer, what is the question? (have strong opinion held weakly) • decision rationalization • process automation • human ‘power tool’ (e.g. compelling visualization) vs imperfect automation
  • 35. startup data jiu-jitsu • How to create a data- driven strategy before the data shows up? • rationalize future SaaS revenue models • justify product decisions in a data- driven manner need data for product need product for data
  • 36. startup data jiu-jitsu • How to create a data- driven strategy before the data shows up? • how ‘intelligent’ can lighting control be with 50-100K users? • how do people use dimmers (continuous or quantized) — UX implications
  • 37. data set dilemma • standard sources (e.g. Kaggle & UCI) insufficient • few ‘physical world’ datasets • expensive to collect • may be specialized (vendor-specific) • dataset proxies for IoT actuation may not work • energy utilization != switch usage
  • 38. big data, small start • physical world data likely to be smaller (1-10 homes, few months) • setup costs limit size of public datasets • e.g. UMass Smart* light switch dataset
  • 39. big data, small start • consider data ‘augmentation’ • standard practice in AI (deep learning) - horizontally flipping, random crops … • under-used in data space • may need some thought on perturbation models for your domain real synthesized https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
  • 40. In short .. • big data success - equal parts tech & non-tech • solving right problem, not just problem right • revisit problem, and what success means