@PennAnData
The State of Big Data
2016
Summary
1 Data Facts
2 Hadoop Basics
3 Beyond Batch : Streaming
4 Columnar Storage
5 Ecosystem
Big Data Facts
PART 1
3V's
Volume Velocity Variety
Volume ...
data Production constantly growing
data Retention increase widely
Extract Value from you data
Storage Cost decrease
Velocity ...
Data produced Faster
Get Real Time insight
Move from capture to Analysis
Get Actionable insight
Variety ...
Not only Structured Data
Toward mostly Unstructured
Text (articles, comments, tweets,...)
Images (id cards, bills,...)
Logs, metrics,...
The State of BigData  -  meetup bigdata @ovh
Seek Time
• 5-10ms
• 200 move/s
Data Transfer Rate
Mbps 100 1000 10000
MB/s 12.5 125 1250
1 Mo 80ms 8ms 0.8ms
1 CD (700 Mo) 56s 5.6s 0.56s
1 Go (1000 Mo) 1m20 8s 0.8s
1 DVD (4700 Mo) 6m16 37.6s 3.76s
1 To (1000 Go) 22h13 2h13m 13m
Data Transfer Rate
Mbps 100 1000 10000
MB/s 12.5 125 1250
1 min 750 MB 7.5 GB 75 GB
15 min 11 GB 112 GB 1 TB
1 hour 45 GB 450 GB 4.5 TB
1 day 1TB 10.8 TB 108 TB
The State of BigData  -  meetup bigdata @ovh
Payload
Definition
“Big Data really is about having insights and making an
impact on your business. If you aren’t taking advantage of
the data you’re collecting, then you just have a pile of data,
you don’t have Big Data.”
#BigData
The State of BigData  -  meetup bigdata @ovh
Introducing Hadoop
PART 2
The State of BigData  -  meetup bigdata @ovh
#DougCutting
#Tools
Timeline
#HDFS
#Blocks
HDFS
/ HDFS
File
Blocks
DataNodes
File
Blocks
DataNodes
/ HDFS / Replication
DataNodes
NameNode
/ HDFS / NameNode
DataNodes
NameNodes
/ HDFS / Namespace
#Federation
/ HDFS / HA
#HighAvailability
NN1 NN2
/ HDFS / HA
Failover Controller
● NameNode Side
● Health monitor
● Manage HA State
● Zookeeper Side
● Monitor State
● Maintain or Try to
get Active Lock
#Five9rulez
/ HDFS / Client
#Read
#DataLocality
/ HDFS / Client
#Write
#ReplicationFactor3
#MapReduce
MapReduce
#MAP
MapReduce
#SHUFFLE
MapReduce
#REDUCE
MapReduce
<key1, val1> map
<key2, val2> map
reduce <okey1, oval1>
reduce <okey2, oval2>
<key3, val3> map
<key500, val500> map
<ikey2, ival521><key501, val501> map reduce <okey150, oval150>
<key502, val502> map <ikey150, ival522>
<ikey1, ival1>
<ikey2, ival2>
<ikey1, ival3>
<ikey2, ival4>
<ikey150, ival520>
Input Input
Pairs
Intermediate
Pairs
Output
Pairs
Output
Step 1:
Split )
Step 2:
Map
Step 3:
Shuffle / Sort
Step 4:
Reduce
Step 5:
Store )
MapReduce
MapReduce
#Pig
&
#Hive
Hive
● Tez
● Impala
● Presto.io
#HBase
HBase
#Model
HBase
#Model
HBase
#Model
HBase
#Model
HBase
#PhysicalStorage
HBase
#Scale
HBase
#Scale
HBase
#Meta
#HBaseArch
HBase
#SQL
HBase
#Features
● Coprocessor
● Auto-sharding
● Scan (full,range)
● Schemaless
● Cell versioning
● Battle tested
● Compactions
● Replications
● Custom filters
● Transactional
● Low Latency
● Active Community
Beyond Batch : Streaming
PART 3
/ Streaming / Data Platform #Transport
/ Streaming / Data Platform / Kafka
+ =
/ Streaming / Frameworks
/ Streaming / Storm / Topology #Storm
/ Streaming / Storm / Topology #Parallelism
/ Streaming / Flink #Job
/ Streaming / Flink
#DataSet API #DataStream API
Ok Steven, but a new DSL for each new hype tool ?
Come on...
The State of BigData  -  meetup bigdata @ovh
The State of BigData  -  meetup bigdata @ovh
Apache Beam
#Features
● Open Sourced Google DataFlow
● Unify bigdata developements
● Beam Model (from DataFlow model)
● Parallel Data processing Pipelines
● Pluggable runners: Flink or G Cloud DataFlow
● Portability
● SDKs : Java / Python
#Architecture
Lambda Architecture
Drawbacks
• Hard to merge
for serving layer
• Hard to maintain
and operate both realtime and
batch code in sync
Kappa Architecture
From Storm to Flink
#Yarn
Yarn
#MapReduce
Yarn
#MessagePassing
Yarn
#StreamProcessing
Yarn
#DistributedLoadTest
Yarn
#RessourceManagement
Yarn Frameworks
#Mesos
The State of BigData  -  meetup bigdata @ovh
Columnar Storage
PART 4
Columnar Storage
#ORC
#Parquet
Ecosystem
PART 5
The State of BigData  -  meetup bigdata @ovh
Vendors
Integration
?
@StevenLeRoux
2016

More Related Content

PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
PPTX
Presentacion de como crear una cuenta en twitter
PPTX
A gentle introduction to the world of BigData and Hadoop
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
Scaling Storage and Computation with Hadoop
PDF
Understanding Hadoop
PPTX
Introduction of Big data, NoSQL & Hadoop
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Presentacion de como crear una cuenta en twitter
A gentle introduction to the world of BigData and Hadoop
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Big Data Architecture Workshop - Vahid Amiri
Scaling Storage and Computation with Hadoop
Understanding Hadoop
Introduction of Big data, NoSQL & Hadoop

Similar to The State of BigData - meetup bigdata @ovh (20)

PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PDF
Hadoop Master Class : A concise overview
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
PPTX
Big Data and Hadoop
PPTX
Hadoop and friends
PDF
Hadoop paper
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
MapReduce Improvements in MapR Hadoop
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PDF
Survey Paper on Big Data and Hadoop
PPTX
Presentation sreenu dwh-services
PPTX
Introduction to Apache Hadoop
PDF
Hadoop introduction
PPT
Hadoop online-training
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
JDD2014: Real Big Data - Scott MacGregor
PDF
Semantic web meetup 14.november 2013
PPT
Hadoop - Introduction to HDFS
PDF
Introduction To Hadoop Ecosystem
PDF
The Hadoop Ecosystem
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Hadoop Master Class : A concise overview
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Big Data and Hadoop
Hadoop and friends
Hadoop paper
Hadoop_EcoSystem slide by CIDAC India.pptx
MapReduce Improvements in MapR Hadoop
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Survey Paper on Big Data and Hadoop
Presentation sreenu dwh-services
Introduction to Apache Hadoop
Hadoop introduction
Hadoop online-training
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
JDD2014: Real Big Data - Scott MacGregor
Semantic web meetup 14.november 2013
Hadoop - Introduction to HDFS
Introduction To Hadoop Ecosystem
The Hadoop Ecosystem
Ad

Recently uploaded (20)

PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Machine Learning and working of machine Learning
PDF
Best Data Science Professional Certificates in the USA | IABAC
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PDF
Global Data and Analytics Market Outlook Report
PPTX
IMPACT OF LANDSLIDE.....................
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
recommendation Project PPT with details attached
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPT
statistic analysis for study - data collection
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
chrmotography.pptx food anaylysis techni
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
Caseware_IDEA_Detailed_Presentation.pptx
Navigating the Thai Supplements Landscape.pdf
Machine Learning and working of machine Learning
Best Data Science Professional Certificates in the USA | IABAC
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
Global Data and Analytics Market Outlook Report
IMPACT OF LANDSLIDE.....................
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
recommendation Project PPT with details attached
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
AI AND ML PROPOSAL PRESENTATION MUST.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
statistic analysis for study - data collection
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
DU, AIS, Big Data and Data Analytics.ppt
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
chrmotography.pptx food anaylysis techni
Ad

The State of BigData - meetup bigdata @ovh