Big Data is changing abruptly, and where it is likely heading

Big Data is changing abruptly,
and where it is likely heading
Big Data Spain
2014-11-17
bigdataspain.org
Paco Nathan
@pacoid

Some questions, and an emerging idea…
The economics of datacenters has shifted toward
warehouse scale, based on commodity hardware
that includes multicore and large memory spaces.
!
Incumbent technologies for Big Data do not
embrace those changes, however newer OSS
projects have emerged to leverage them.
2

Some questions, and an emerging idea…
Also, the data requirements are changing abruptly
as sensor data, microsatellites, and other IoT use
cases boost data rates by orders of magnitude.
!
We have math available to address those industrial
needs, but have software frameworks kept up?
3

If you’re not using them, you’re missing out…
!
See recent announcements by Google, Amazon, etc.
!
Containers are the tip of the iceberg: OS have
evolved quietly, while practices near to customers
have perhaps not kept pace
5
Containers

Key benefits based on Mesos case studies:
• mixed workloads
• higher utilization rates
• lower latency for data products
• significantly lower Ops overhead
!
Examples: Mesos @Twitter, eBay, Netflix, many
startups, etc.; Borg/Omega @Google,
Symphony @IBM, Autopilot @Microsoft, etc.
!
Major cloud users understand this shift; however
much of the world still thinks in terms of VMware
6
Containers

We need much more than Docker, we need better
definitions for what is to be used outside of the
container as well – e.g., for microservices
• Intro to Mesos: goo.gl/PbUfNe
• https://mesosphere.com/
• A project to watch: Weave
7
Containers

8
Containers
One problem: how does an institution, such
as a major bank, navigate this kind of change?

Speaking of clouds, forgive me for being biased…
For nearly a decade, I’ve been talking with people
about whether cloud economics makes sense?
Consider: datacenter util rates in single-digits (~8%)
10
Clouds

Companies accustomed to cheap energy rates,
with staff who thinking in terms of individual VMs…
these will not find clouds to be a friendly territory.
Not so much about software, as it is about
sophisticated engineering practice.
The change recalls the shift of BI ⇒ Data Science
That was not simple…
11
Clouds

The majors understand this new world: Google,
Amazon, Microsoft, IBM, Apple, etc.
Sub-majors can create their own clouds: Facebook,
Twitter, LinkedIn, etc.
12
Clouds
What about industries approaching Big Data now
which cannot afford to buy small armies of expert
Ops people? How they must proceed becomes a
key question.

Current research points to significant changes
ahead, where machine learning plays a key role
in the context of advanced cluster scheduling:
13
Clouds
Improving Resource Efficiency
with Apache Mesos
Christina Delimitrou
youtu.be/YpmElyi94AA
Experiences with Quasar+Mesos showed:
• 88% apps get >95% performance
• ~10% overprovisioning instead of 500%
• up to 70% cluster util at steady state
• 23% shorter scenario completion

A question on StackOverflow in 2013 challenged
Twitter’s use of Abstract Algebra libraries for their
revenue apps…
cs.stackexchange.com/questions/9648/what-use-are-groups-
monoids-and-rings-in-database-computations
The answers are enlightening: think of a kind of
containerization for business logic – leading to
exponential increase in performance, particularly
in the context of streaming cases.
15
Abstract Algebra

Abstract Algebra
A cheat-sheet:
non-empty
set
Semigroup
has an identity element
Group
has a binary associative
operation, with closure
each element has an inverse
has two binary
associative operations:
addition and multiplication
Ring
Monoid

Add ALL the Things:
Abstract Algebra Meets Analytics
infoq.com/presentations/abstract-algebra-
analytics
Avi Bryant, Strange Loop (2013)
• grouping doesn’t matter (associativity)
• ordering doesn’t matter (commutativity)
• zeros get ignored
In other words, while partitioning data
at scale is quite difficult, you can let the
math allow your code to be flexible at
scale
Avi Bryant
@avibryant
Abstract Algebra

Algebra for Analytics
speakerdeck.com/johnynek/
algebra-for-analytics
Oscar Boykin, Strata SC (2014)
• “Associativity allows parallelism
in reducing” by letting you put
the () where you want
• “Lack of associativity increases
latency exponentially”
Oscar Boykin
@posco
Abstract Algebra

Functional Programming for Big Data
Theory, eight decades ago:
what can be computed?
Haskell Curry
haskell.org
Alonso Church
wikipedia.org
Praxis, four decades ago:
algebra for applicative systems
John Backus
acm.org
David Turner
wikipedia.org
Reality, two decades ago:
machine data from web apps
Pattie Maes
MIT Media Lab
20

circa 2002:
mitigate risk of large distributed workloads lost
due to disk failures on commodity hardware…
Google File System
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
research.google.com/archive/gfs.html
!
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat
research.google.com/archive/mapreduce.html
21

DryadLINQ influenced a new class of workflow
abstractions based on functional programming:
Spark, Flink, Scalding, Scoobi, Scrunch, etc.
• needed for leveraging the newer hardware:
multicore, large memory spaces, etc.
• significant code volume reduction ⇒ eng costs
22

2002
2004
MapReduce paper
2002
MapReduce @ Google
2004 2006 2008 2010 2012 2014
2006
Hadoop @ Yahoo!
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
23

MapReduce
Pregel Giraph
Dremel Drill
S4 Storm
F1
MillWheel
General Batch Processing Specialized Systems:
Impala
GraphLab
iterative, interactive, streaming, graph, etc.
Tez
MR doesn’t compose well for large applications,
and so specialized systems emerged as workarounds
24

circa 2010:
a unified engine for enterprise data workflows,
based on commodity hardware a decade later…
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury,
Michael Franklin, Scott Shenker, Ion Stoica
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
!
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
25

One minor problem: where do we find
many graduating students who are good at
Scala, Clojure, F#, etc.?
26

+1 generations of computer scientists has
grown up thinking that Data implies SQL…
Excellent as a kind of declarative, functional
language for describing data transformations
However, in the long run we should move away
from thinking in terms of SQL
Computational Thinking as a rubric in
general would be preferred
28
Databases

A step further is to move away from thinking of
data as an engineering issue – it is about business
So much of the US curricula for math is still stuck
in the Cold War: years of calculus to identify the
best people to build ICBMs
We need more business people thinking in terms
of math, beyond calculus, to contend with difficult
problems in supply chain, maintenance schedules,
energy needs, transportation routes, etc.
29
Databases

Databases
1. starting with
real-world data ⇒
2. leverage graph queries
for representation ⇒
3. convert to sparse matrix,
use FP and abstract algebra ⇒
4. achieve high-ROI parallelism at scale,
mostly about optimization ⇒

Approximations
19-20c. statistics emphasized defensibility
in lieu of predictability, based on analytic
variance and goodness-of-fit tests
!
That approach inherently led toward a
manner of computational thinking based
on batch windows
!
They missed a subtle point…
32

21c. shift towards modeling based on probabilistic
approximations: trade bounded errors for greatly
reduced resource costs
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-
mining/
33
Approximations

21c. shift towards modeling based on probabil
approximations: trade bounded errors for greatly
reduced resource costs
Twitter catch-phrase:
“Hash, don’t sample”
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-
mining/
34
Approximations

a fascinating and relatively new area, pioneered
by relatively few people – e.g., Philippe Flajolet
provides approximation, with error bounds –
in general uses significantly less resources
(RAM, CPU, etc.)
many algorithms can be constructed from
combinations of read and write monoids
aggregate different ranges by composing
hashes, instead of repeating full-queries
35
Approximations

algorithm use case example
Count-Min Sketch frequency summaries code
HyperLogLog set cardinality code
Bloom Filter set membership
MinHash
set similarity
DSQ streaming quantiles
SkipList ordered sequence search
36
Approximations

algorithm use case example
Count-Min Sketch frequency summaries code
HyperLogLog set cardinality code
suggestion: consider these
as your most quintessential
collections data types at scale
Bloom Filter set membership
MinHash
set similarity
DSQ streaming quantiles
SkipList ordered sequence search
37
Approximations

• sketch algorithms: trade bounded errors for
orders of magnitude less required resources,
e.g., fit more complex apps in memory
• multicore + large memory spaces (off heap) are
increasing the resources per node in a cluster
• containers allow for finer-grain allocation of
cluster resources and multi-tenancy
• monoids, etc.: guarantees of associativity within
the code allow for more effective distributed
computing, e.g., partial aggregates
• less resources must be spent sorting/windowing
data prior to working with a data set
• real-time apps, which don’t have the luxury of
anticipating data partitions, can respond quickly
38
Approximations

Probabilistic Data Structures for Web
Analytics and Data Mining
Ilya Katsov (2012-05-01)
A collection of links for streaming
algorithms and data structures
Debasish Ghosh
Aggregate Knowledge blog (now Neustar)
Timon Karnezos, Matt Curcio, et al.
Probabilistic Data Structures and
Breaking Down Big Sequence Data
C. Titus Brown, O'Reilly (2010-11-10)
Algebird
Avi Bryant, Oscar Boykin, et al. Twitter (2012)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman,
Jeff Ullman, Cambridge (2011)
39
Approximations

Spreadsheets represented a revolution in how
to think about computing
!
What Fernando Perez, et al., have done with
IPython is subtle and powerful
!
Technology is not just about writing code; it is
about people using that code
!
Enterprise data workflows are moving toward
cloud-based notebooks
41
Notebooks

Early emphasis on tools in Big Data now gives
way to workflows: people, automation, process,
integration, test, maintenance, scale
Google had the brilliant approach of shared
documents…
Now we see cloud-based notebooks, which
begin to displace web apps
42
Notebooks
http://nbviewer.ipython.org/
http://jupyter.org/

43
Notebooks
nature.com/news/ipython-interactive-demo-7.21492

What is Spark?
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –
including collection, ETL, storage, exploration and analytics –
should consider Spark for its in-memory performance and
the breadth of its model. It supports advanced analytics
solutions on Hadoop clusters, including the iterative model
required for machine learning and graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)

What is Spark?
Spark Core is the general execution engine for the
Spark platform that other functionality is built atop:
!
• in-memory computing capabilities deliver speed
• general execution model supports wide variety
of use cases
• ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)

What is Spark?
WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR

What is Spark?
Sustained exponential growth, as one of the most
active Apache projects ohloh.net/orgs/apache

Big Data is changing abruptly, and where it is likely heading

• generalized patterns
⇒ unified engine for many use cases
• lazy evaluation of the lineage graph
⇒ reduces wait states, better pipelining
• generational differences in hardware
⇒ off-heap use of large memory spaces
• functional programming / ease of use
⇒ reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shuffles
51
What is Spark?

What is Spark?
Kafka + Spark + Cassandra
datastax.com/documentation/datastax_enterprise/4.5/
datastax_enterprise/spark/sparkIntro.html
http://helenaedelson.com/?p=991
github.com/datastax/spark-cassandra-connector
github.com/dibbhatt/kafka-spark-consumer
unified compute
data streams columnar key-value

databricks.com/blog/2014/11/05/spark-officially-sets-
a-new-record-in-large-scale-sorting.html
53
What is Spark?

What is Spark?
Databricks Cloud
databricks.com/blog/2014/07/14/databricks-cloud-making-
big-data-easy.html
youtube.com/watch?v=dJQ5lV5Tldw#t=883

certification:
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
56

community:
spark.apache.org/community.html
video+slide archives: spark-summit.org
events worldwide: goo.gl/2YqJZK
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
new: MOOCs via edX and University of California
57

books:
Fast Data Processing
with Spark
Holden Karau
Packt (2013)
shop.oreilly.com/product/
9781782167068.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/
Learning Spark
Holden Karau,
Andy Konwinski,
Matei Zaharia
O’Reilly (2015*)
0636920028512.do
58

events:
Strata EU
Barcelona, Nov 19-21
strataconf.com/strataeu2014
Data Day Texas
Austin, Jan 10
datadaytexas.com
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
Strata EU
London, May 5-7
strataconf.com/big-data-conference-uk-2015
Spark Summit 2015
SF, Jun 15-17
spark-summit.org
59

presenter:
monthly newsletter for updates,
events, conf summaries, etc.:
liber118.com/pxn/
Just Enough Math
O’Reilly, 2014
justenoughmath.com
preview: youtu.be/TQ58cWgdCpA
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
0636920028536.do 60

Big Data is changing abruptly, and where it is likely heading

More Related Content

What's hot (19)

Similar to Big Data is changing abruptly, and where it is likely heading (20)

More from Paco Nathan (18)

Recently uploaded (20)

Big Data is changing abruptly, and where it is likely heading