Big Data - An Overview

big data - an overview
Arvind Kalyan
Engineer at LinkedIn

Size?
100 GB of 1980 U.S. Census database was
considered ‘big’ at the time and required
sophisticated machinery
http://www.columbia.edu/cu/computinghistory/mss.html

Size?
7 billion people on earth…
storing name (~20 chars) , age (7 bits), gender (1
bit) for everyone on earth:
7 billion * (20 * 8 + 7 +1)/8 = 147GB
In the last 25 years the number hasn’t grown too
much and is still in the same order of magnitude..

Size?
i.e., cardinality of real-world data doesn't
change/grow very fast…. so that really is not that
much data…

Size?
Server with 128GB RAM and multiple TB of
disk space is not difficult to find in 2015

Size?
but observations on those entities can be too
many
Those 7 billion people interacting with: other
people, web-sites, products, etc at different
points in time and at different locations quickly
explodes in the number of data points

Analysis
‘big-data’ is a challenge when you want to analyze
of all those observations to identify trends &
patterns

RDBMS for performing
analysis
RDBMS these days can hold *much* larger
volumes on a single machine

analysis
RDBMS is good at storing data & fetching
individual rows satisfying your query

analysis
MySQL for example, when used right can
guarantee data durability and at the same time
provide some of the lowest read-latencies

analysis
RDBMS can also be used for ad-hoc analysis
performance of such analysis can be improved by
sacrificing on ‘relational’ principles:
for example, by denormalizing tables (cost of
copy vs seek)

analysis
But this doesn’t scale when you want to run the
analysis across all rows for every user
some queries can turn out to be running for
seconds or even days depending on the size of
the tables

analysis
.. and then consider the overhead of doing this on
an on-going basis

analysis
RDBMS is a not the right system if you want
to look at & process the full data-set

analysis
These days data is an asset and businesses &
organizations need to extrapolate trends,
patterns, etc. from that data

What’s the solution?
.. if not RDBMS

“Numbers Everyone Should Know”
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

Let’s look closely at disk vs memory
Source: http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg

in general:
disk is slower than SSD, and
SSD is slower than memory
but more importantly…

Memory is not always faster than disk
SSD is not always faster than disk

.. and at network vs disk..
network is not always slower than disk
plus local machine and disk have limits but
network enables you to grow!

Lesson learned
access pattern is important
depending on data-set (size) & use-case, the
access pattern alone can make a difference
between possible & not possible!

The solution
more often, these days, it is better to have a
distributed system to crunch the data
one node gathers partial results from many other
nodes and produces final result
individual nodes are in-turn optimized to return
results quickly from their partial local data

big-data tech
‘big data’ technologies & tools help you define
your task in a higher order language, abstracting
out the details of distributed systems underneath

big-data tech
i.e., they encourage/force you to follow a certain
access pattern

big-data tech
these are still tools
following suggested best-practices help make the
best use of the tool
at the same time, following anti-patterns can make
the situation appear more challenging

big-data tech: DSLs
some of the most popular frameworks happen to
be DSLs that generate another set of
instructions that actually execute
pig, hive, scalding, cascading, etc.

big-data tech: DSLs
biggest challenge with these DSLs is getting
them to work, and long term maintenance

big-data tech: DSLs
when using DSLs, code is written in one
language and executed in another
if anything fails, the error message is usually
associated with the latter and you have to know
enough about the abstracted layer to be able to
translate it back to your code

big-data tech: DSLs
but it usually works for the most part, because..
some of these popular frameworks have active
user-groups and blog posts that’ll help you get
the job done

big-data tech: DSLs
so, more popular the technology, the better the
documentation & support
also, bigger the community, the better the likelihood of
evolution of the technology
so start with the most popular/common technology that
does the job; even if it means compromising some ‘cool’
feature provided by another, currently less popular tool
unless absolutely necessary

big-data tech: map/reduce
most of the current (as of Feb 2015) ‘big-data’
frameworks revolve around the map/reduce
paradigm
… and use hadoop technologies underneath

hadoop technologies for big data processing can
be seen as 2 major components
hdfs => distributed filesystem for storage
‘map/reduce’ => programming model

big-data tech: hdfs
hdfs stores data in immutable format
data is ‘partitioned’ & stored on different
machines
there are also multiple copies of each ‘part’ i.e.,
replicated

big-data tech: hdfs
‘partitioning’ enables faster processing in parallel
also helps with data locality: code can run where
data is and not the other way around

big-data tech: hdfs
‘replication’ increases availability of the data
itself
it also helps with overall performance &
availability of tasks running on it (speculating
execution): run the same task on multiple replicas,
and wait for one of them to finish

‘map/reduce’ is a functional programming concept
map => process & transform each set of data
points in parallel and outputs (key, value) pairs
reduce => gather those partial results and come
up with final result

‘map/reduce’ in hadoop also has one more
important step between map and reduce
shuffle: makes sure all values for a given ‘key’
ends up on the same reducer

‘map/reduce’ in hadoop also has a slightly
different ‘reduce’
reduce: values are aggregated per-key. Not
across the whole dataset

in general map/reduce shines for ‘embarrassingly
parallel’ problems
trying to run non-parallelizable jobs on hadoop
(like requiring global ordering, or something
similar) might work now, but may not scale in the
long run
http://en.wikipedia.org/wiki/Embarrassingly_parallel

But surprisingly a *lot* of ‘big-data’ problems can
be modeled directly on map/reduce with little to
no change

map/reduce on hadoop is a multi-tenant
distributed system running on disk-local data

non-interactive analysis
big-data analysis/processing has typically been
associated with non-interactive jobs. i.e., the user
doesn’t expect the results to come back in a few
seconds
the job usually takes a few mins to a few hours, or
even days

the need for speed
What if we made it run on in-memory data?

the need for speed
this is the current trend
spark, presto are some noteworthy examples
not based on map/reduce programming paradigm
but still take advantage of the underlying
distributed filesystem (hdfs)

faster big-data: spark
Spark is a Scala DSL
Resilient Distributed Datasets : primary
abstraction in Spark
RDDs: collections of data kept in-memory
Provides collections API comparable to Scala
lang that transform one RDD to another

fault-tolerance: retries computation on certain
failures
so it’s also good for the ‘backend’ / scheduled
jobs in addition to interactive usage

where it shines: complex, iterative algorithms like
in Machine learning.
Since RDDs can be cached in-memory and
used across the network, the computation
speeds up considerably in the absence of disk
I/O

faster big-data: presto
presto is from facebook; written in java
essentially a distributed read-only SQL query engine
designed specifically for interactive ad-hoc analysis
over Petabytes of data
data on disk, but processing pipeline is fully in-memory

faster big-data: presto
no fault-tolerance, but extremely fast
ideal for ad-hoc queries on extremely large data
that finish fast but not for long running or
scheduled jobs
as of today there is no UDF support

references
http://www.akkadia.org/drepper/cpumemory.pdf
http://static.googleusercontent.com/media/research.go
ogle.com/en/us/people/jeff/stanford-295-talk.pdf
http://queue.acm.org/detail.cfm?id=1563874
http://en.wikipedia.org/wiki/Apache_Hadoop
https://prestodb.io/

Leave questions & comments below, or reach out
through LinkedIn!
Arvind Kalyan
https://www.linkedin.com/in/base16

Big Data - An Overview

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big Data - An Overview (20)

Recently uploaded (20)

Big Data - An Overview