SlideShare a Scribd company logo
PRESENTATION TITLE GOES HERE
Solving Big Data Problems: Storage to the
Rescue?
John Webster
Evaluator Group
2
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Agenda
Big Data Analytics Storage Maxims
The Fundamental JBOD and DAS Architecture
Overview of Disk-based Alternatives
What are the Advantages and Disadvantages?
The Solid State and In-memory Alternatives
Summary and Q&A
Note: References to specific vendors and products are used as
real-world examples and do not imply an endorsement
04/10/15 2
3
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Big Data Storage Maxim #1
Deliver storage performance at large scale
and at low cost, and all at the same time
(Think early stage Google, Facebook, Twitter)
04/10/15 3
4
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Big Data Storage Maxim #2
Minimize the “distance” between processing
and data storage
04/10/15 4
5
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Big Data Storage Maxim #3
Big Data analytics is dominated by open
source
04/10/15 5
6
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Big Data Storage Maxim #4
Big Data analytics software developers manage
data at the clustered server level. Storage vendors
manage data at the storage system level.
04/10/15 6
7
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Shared Nothing, Asymmetrical Distributed
Computing
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
DAS DAS DAS DAS
C
O
N
T
R
O
L
DAS
Network
Layer
1 Gb Ethernet
Compute
Layer
Commodity
Servers
Storage
Layer
6-12 disks in
each server
typically JBOD
Scale to
thousands
of nodes
Only the Ethernet network is shared
In Hadoop, Control = Name Node; Node 1,2… = Data Node
8
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Apache Hadoop: A Platform for All
Applications?
Presentation & Application
Enable both existing and new applications to provide
value to the organization
Operations
Empower existing operations and
security tools to manage Hadoop
Metadata Management
HCatalog
Batch Online Real-
Time
In-
Memory
OthersSQLScript
Map
Reduce Pig Hive
Hbase
Accumulo Storm Spark
Multitenant Processing: YARN
(Hadoop Operating System)
Storage: HDFS
(Hadoop Distributed File System)
Data
Access
Data
Management
Data Integration
& Governance
Data Workflow
Data Lifecycle
Falcon
Real-time and
Batch Ingest
Flume
Sqoop
WebHDFS
NFS
Authentication
Authorization
Accountability
Data Protection
Across
Storage: HDFS
Resources:
YARN
Access: Hive,
…
Pipeline:
Falcon
Cluster: Knox
Provision,
Manage &
Monitor
Ambari
Scheduling
Oozie
Linux WindowsEnvironmen
t
On Premise Virtualize
Commodity HWAppliance
Cloud/
Hosted
Security Operations
Source:
Hortonworks
9
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
HDFS as a Persistent Storage Layer
Advantages
Storage performance at large scale and low cost
Minimize distance between data and compute
Node failures tolerated
Open Source
Disadvantages
Hadoop NameNode lacks active/active failover (i.e. it’s a SPOF)
For data integrity and protection, HDFS creates three full clone copies of data
3x the storage for each file – slow and inefficient
If all three copies are corrupted, you’re still hosed (reload and start over)
No storage tiering (recognition of different storage types now available in 2.3)
Limited ways to respond to corporate security and data governance policies
Data in/out processes can take longer than the actual query process
What is the single source of the truth?
Inability to dis-aggregate storage from compute so that the two can be scaled
independently
10
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
C
O
N
T
R
O
L
Network
Layer
Compute
Layer
Storage
Layer SAN or NAS, but more commonly Scale-out NAS
Shared Storage as Primary Storage
04/10/15 10
11
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
C
O
N
T
R
O
L
Network
Layer
Compute
Layer
Storage
Layer
Shared Storage as Secondary
Storage
04/10/15 11
SAN/NAS/Object Storage
12
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Hadoop On Scale-out Storage
Scale-out storage replaces
node-level DAS
HDFS implemented as “over
the wire” protocol or CDMI
interface to underlying FS
NameNode SPOF eliminated
Decoupled storage and
compute layers
Data services, data
protection, and DR by
storage-resident services
Examples include EMC
Isilon, IBM Elastic Storage,
Ceph
04/10/15 12
13
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Shared Primary/Secondary Storage
Advantages
Addresses the enterprise storage management
requirements
 Data protection/disaster recovery/business continuance
 Data governance/compliance/archiving
 Single source of the truth
Disadvantages
Additional cost
Potential performance impact
Using a vendor specific solution introduces
proprietary data/storage management software
04/10/15 13
14
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
What About SSD?
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
DAS DAS DAS DAS
C
O
N
T
R
O
L
DAS
Network
Layer
10+ Gb Ethernet
Compute
Layer
Commodity
Servers
Storage
Layer
SSD
in/attached
to each
server
Scale to
thousands
of nodes
Only the Ethernet network is shared
In Hadoop, Control = Name Node; Node 1,2… = Data Node
15
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
C
O
N
T
R
O
L
Network
Layer
Compute
Layer
Storage
Layer Scale-out Flash Storage
What About SSD?
04/10/15 15
16
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
What About In-Memory Computing?
Tachyon
UC Berkeley Amp Lab project
“Reliable, memory-centric storage for Big Data
Analytics clusters” (i.e. memory as persistent data
store across cluster nodes)
One in-memory data copy inside JVM, use operation
“lineage” to re-compute data if failure
Initial use in Apache Spark environments
04/10/15 16
17
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
What About In-memory Computing?
Apache Ignite
In-memory “data fabric”
Distributed in-memory platform for computing and
transacting on large-scale data sets in real-time
“Orders of magnitude faster than possible with
traditional disk-based or flash technologies.”
Tier -1 storage?
Originated as GridGain Data Fabric
In-Memory Computing Summit 6/29-30
imcsummit.org
04/10/15 17
18
2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.
Summary and Q&A
The need for a longer-term, persistent storage
layer is now recognized
For Hadoop, HDFS may or may not be that
storage layer
Enterprise storage architects and administrators
will be more directly involved in managing Big
Data analytics storage over time
Now is the time to research and understand the
options
04/10/15 18

More Related Content

PPTX
Copy Data Management & Storage Efficiency - Ravi Namboori
PDF
Tape and cloud strategies for VM backups
PPTX
Hype, Hopes, Hell & Hadoop (#bigdata and the enterprise of everything)
PDF
White paper whitewater-datastorageinthecloud
PPTX
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
PDF
IBM TS7610 ProtecTIER Deduplication Appliance Express – Enterprise Level Tech...
PDF
DDN Product Update from SC13
PPTX
DDN EXA 5 - Innovation at Scale
Copy Data Management & Storage Efficiency - Ravi Namboori
Tape and cloud strategies for VM backups
Hype, Hopes, Hell & Hadoop (#bigdata and the enterprise of everything)
White paper whitewater-datastorageinthecloud
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
IBM TS7610 ProtecTIER Deduplication Appliance Express – Enterprise Level Tech...
DDN Product Update from SC13
DDN EXA 5 - Innovation at Scale

What's hot (19)

PDF
Ddn Vision
PDF
DDN and Intel: Partnered for Exascale
PPTX
Webinar: Cleaning up the SDS Mess - Four Keys to Success
PPTX
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
PPTX
Why 2015 is the Year of Copy Data - What are the requirements?
PPTX
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
PPTX
Webinar: The Bifurcation of the Flash Market
PPT
Storage Conference 08 V2
PPTX
What is Object storage ?
PDF
cleversafe_definitive_guide_white_paper
PPTX
Deploying All-Flash Cloud Infrastructure without Breaking the Bank
PDF
IBM Object Storage and Software Defined Solutions - Cleversafe
PDF
Long Live Posix - HPC Storage and the HPC Datacenter
PDF
Achieving compute and storage independence for data-driven workloads
PPTX
Hadoop and Cloudian HyperStore
PPTX
Is the World Ready for Big Data Flash?
PPTX
2014 july 24_what_ishadoop
PDF
Net App Unified Storage Architecture
PDF
Architecting Virtualized Infrastructure for Big Data
Ddn Vision
DDN and Intel: Partnered for Exascale
Webinar: Cleaning up the SDS Mess - Four Keys to Success
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Why 2015 is the Year of Copy Data - What are the requirements?
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: The Bifurcation of the Flash Market
Storage Conference 08 V2
What is Object storage ?
cleversafe_definitive_guide_white_paper
Deploying All-Flash Cloud Infrastructure without Breaking the Bank
IBM Object Storage and Software Defined Solutions - Cleversafe
Long Live Posix - HPC Storage and the HPC Datacenter
Achieving compute and storage independence for data-driven workloads
Hadoop and Cloudian HyperStore
Is the World Ready for Big Data Flash?
2014 july 24_what_ishadoop
Net App Unified Storage Architecture
Architecting Virtualized Infrastructure for Big Data
Ad

Viewers also liked (16)

PDF
DODIG-2015-009
PDF
Business outgrowing accounting software?
PDF
Global Pack Hungary Presentation ENG - Tape & Go
DOCX
Bab i , ii, iii
PPTX
Ukázky první pomoci pro děti předškolního a školního věku a kurzy PP pro uči...
DOC
Tolios_trapezes_paragwgiki_anasygkrothsh
DOC
Phillip-Charles-Ashwood April 2016
PDF
PDF
La 7 xaxa
DOCX
Evaluation 3
PDF
Source 1 Direct Presentation
PPT
Unit 121 Imaging Software
PPTX
Intania
PDF
What are the Grounds for Divorce in Illinois?
PDF
De-ZZP-Maatschappij-KIZO-ECHTZZP
DODIG-2015-009
Business outgrowing accounting software?
Global Pack Hungary Presentation ENG - Tape & Go
Bab i , ii, iii
Ukázky první pomoci pro děti předškolního a školního věku a kurzy PP pro uči...
Tolios_trapezes_paragwgiki_anasygkrothsh
Phillip-Charles-Ashwood April 2016
La 7 xaxa
Evaluation 3
Source 1 Direct Presentation
Unit 121 Imaging Software
Intania
What are the Grounds for Divorce in Illinois?
De-ZZP-Maatschappij-KIZO-ECHTZZP
Ad

Similar to Solving Big Data Problems (20)

PDF
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
PPT
IBMHadoopofferingTechline-Systems2015
PDF
Sam fineberg big_data_hadoop_storage_options_3v9-1
PDF
Aioug big data and hadoop
PDF
Has Your Data Gone Rogue?
PDF
Flash memory summit 2015 gary lyng session 301-a
PDF
Infrastructure Considerations for Analytical Workloads
PDF
Tachyon: An Open Source Memory-Centric Distributed Storage System
PDF
A Brave new object store world
PPTX
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
PPTX
IBM Spectrum Scale Overview november 2015
PPTX
Watson christofer j_180208
PPTX
EMC HADOOP Storage Strategy
PPTX
Debunking Common Myths of Hadoop Backup & Test Data Management
PDF
IMCSummit 2015 - Day 2 Keynote - In-Memory Computing and the Emergence of Tie...
PPTX
ECS/Cloud Object Storage - DevOps Day
PDF
Exploring the Wider World of Big Data- Vasalis Kapsalis
PPTX
EMC config Hadoop
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PDF
Chip ICT | Hgst storage brochure
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
IBMHadoopofferingTechline-Systems2015
Sam fineberg big_data_hadoop_storage_options_3v9-1
Aioug big data and hadoop
Has Your Data Gone Rogue?
Flash memory summit 2015 gary lyng session 301-a
Infrastructure Considerations for Analytical Workloads
Tachyon: An Open Source Memory-Centric Distributed Storage System
A Brave new object store world
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
IBM Spectrum Scale Overview november 2015
Watson christofer j_180208
EMC HADOOP Storage Strategy
Debunking Common Myths of Hadoop Backup & Test Data Management
IMCSummit 2015 - Day 2 Keynote - In-Memory Computing and the Emergence of Tie...
ECS/Cloud Object Storage - DevOps Day
Exploring the Wider World of Big Data- Vasalis Kapsalis
EMC config Hadoop
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Chip ICT | Hgst storage brochure

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Cloud computing and distributed systems.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Cloud computing and distributed systems.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25-Week II
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity

Solving Big Data Problems

  • 1. PRESENTATION TITLE GOES HERE Solving Big Data Problems: Storage to the Rescue? John Webster Evaluator Group
  • 2. 2 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Agenda Big Data Analytics Storage Maxims The Fundamental JBOD and DAS Architecture Overview of Disk-based Alternatives What are the Advantages and Disadvantages? The Solid State and In-memory Alternatives Summary and Q&A Note: References to specific vendors and products are used as real-world examples and do not imply an endorsement 04/10/15 2
  • 3. 3 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #1 Deliver storage performance at large scale and at low cost, and all at the same time (Think early stage Google, Facebook, Twitter) 04/10/15 3
  • 4. 4 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #2 Minimize the “distance” between processing and data storage 04/10/15 4
  • 5. 5 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #3 Big Data analytics is dominated by open source 04/10/15 5
  • 6. 6 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Big Data Storage Maxim #4 Big Data analytics software developers manage data at the clustered server level. Storage vendors manage data at the storage system level. 04/10/15 6
  • 7. 7 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Shared Nothing, Asymmetrical Distributed Computing N O D E 1 N O D E 2 N O D E 3 N O D E n DAS DAS DAS DAS C O N T R O L DAS Network Layer 1 Gb Ethernet Compute Layer Commodity Servers Storage Layer 6-12 disks in each server typically JBOD Scale to thousands of nodes Only the Ethernet network is shared In Hadoop, Control = Name Node; Node 1,2… = Data Node
  • 8. 8 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Apache Hadoop: A Platform for All Applications? Presentation & Application Enable both existing and new applications to provide value to the organization Operations Empower existing operations and security tools to manage Hadoop Metadata Management HCatalog Batch Online Real- Time In- Memory OthersSQLScript Map Reduce Pig Hive Hbase Accumulo Storm Spark Multitenant Processing: YARN (Hadoop Operating System) Storage: HDFS (Hadoop Distributed File System) Data Access Data Management Data Integration & Governance Data Workflow Data Lifecycle Falcon Real-time and Batch Ingest Flume Sqoop WebHDFS NFS Authentication Authorization Accountability Data Protection Across Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Provision, Manage & Monitor Ambari Scheduling Oozie Linux WindowsEnvironmen t On Premise Virtualize Commodity HWAppliance Cloud/ Hosted Security Operations Source: Hortonworks
  • 9. 9 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. HDFS as a Persistent Storage Layer Advantages Storage performance at large scale and low cost Minimize distance between data and compute Node failures tolerated Open Source Disadvantages Hadoop NameNode lacks active/active failover (i.e. it’s a SPOF) For data integrity and protection, HDFS creates three full clone copies of data 3x the storage for each file – slow and inefficient If all three copies are corrupted, you’re still hosed (reload and start over) No storage tiering (recognition of different storage types now available in 2.3) Limited ways to respond to corporate security and data governance policies Data in/out processes can take longer than the actual query process What is the single source of the truth? Inability to dis-aggregate storage from compute so that the two can be scaled independently
  • 10. 10 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer SAN or NAS, but more commonly Scale-out NAS Shared Storage as Primary Storage 04/10/15 10
  • 11. 11 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer Shared Storage as Secondary Storage 04/10/15 11 SAN/NAS/Object Storage
  • 12. 12 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Hadoop On Scale-out Storage Scale-out storage replaces node-level DAS HDFS implemented as “over the wire” protocol or CDMI interface to underlying FS NameNode SPOF eliminated Decoupled storage and compute layers Data services, data protection, and DR by storage-resident services Examples include EMC Isilon, IBM Elastic Storage, Ceph 04/10/15 12
  • 13. 13 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Shared Primary/Secondary Storage Advantages Addresses the enterprise storage management requirements  Data protection/disaster recovery/business continuance  Data governance/compliance/archiving  Single source of the truth Disadvantages Additional cost Potential performance impact Using a vendor specific solution introduces proprietary data/storage management software 04/10/15 13
  • 14. 14 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. What About SSD? N O D E 1 N O D E 2 N O D E 3 N O D E n DAS DAS DAS DAS C O N T R O L DAS Network Layer 10+ Gb Ethernet Compute Layer Commodity Servers Storage Layer SSD in/attached to each server Scale to thousands of nodes Only the Ethernet network is shared In Hadoop, Control = Name Node; Node 1,2… = Data Node
  • 15. 15 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. N O D E 1 N O D E 2 N O D E 3 N O D E n C O N T R O L Network Layer Compute Layer Storage Layer Scale-out Flash Storage What About SSD? 04/10/15 15
  • 16. 16 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. What About In-Memory Computing? Tachyon UC Berkeley Amp Lab project “Reliable, memory-centric storage for Big Data Analytics clusters” (i.e. memory as persistent data store across cluster nodes) One in-memory data copy inside JVM, use operation “lineage” to re-compute data if failure Initial use in Apache Spark environments 04/10/15 16
  • 17. 17 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. What About In-memory Computing? Apache Ignite In-memory “data fabric” Distributed in-memory platform for computing and transacting on large-scale data sets in real-time “Orders of magnitude faster than possible with traditional disk-based or flash technologies.” Tier -1 storage? Originated as GridGain Data Fabric In-Memory Computing Summit 6/29-30 imcsummit.org 04/10/15 17
  • 18. 18 2015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved. Summary and Q&A The need for a longer-term, persistent storage layer is now recognized For Hadoop, HDFS may or may not be that storage layer Enterprise storage architects and administrators will be more directly involved in managing Big Data analytics storage over time Now is the time to research and understand the options 04/10/15 18