October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop

2
Storage Policies & Disk Types (Hadoop 2.6 and up)
Disk Type  Flexible, can assign to any local filesystem
Disk Policy  Set on file or inherited from parent directory
Hadoop HDFS Tiering Support
aka – Hetrogenous Storage
Storage Policy Name Disk Type (n replicas)
Lasy_Persist RAM_DISK: 1, DISK: n-1
All_SSD SSD: n
One_SSD SSD: 1, DISK: n-1
Hot (default) DISK: n
Warm DISK: 1, ARCHIVE: n-1
Cold ARCHIVE: n

3
/data/results/query2.csv
Hot Nodes
Storage Policy default is Hot
Storage Type default is DISK
Archive Nodes
Storage Policy: HOT
Storage Type: DISK

4
Hot Nodes
Storage Policy is changed
File remains on same storage type until mover is run
Archive Nodes
Storage Policy: Cold
Storage Type: DISK

5
Storage Policy: Cold
Storage Type: ARCHIVE
Hot Nodes Archive Nodes
After mover is run, all replicas move to storage type Archive.
Note: file has not logically moved in HDFS

6
WHY TIER HADOOP STORAGE?
ISN’T IT ALREADY COMMODITY STORAGE?
(aka – The cheapest stuff on the planet)

7
Lower Disk Capacity to Compute
Compute
Disk
Better job scalability, performance, and
consistent results
5x to 10x more expensive per GB

8
Much Denser Disk to Compute
Compute
Disk
Much less $ per GB
Could impact performance and produce
inconsistent results

9
Cold Goes to Archive. Hot Gets More Resources
Compute
Disk
Much less $ per GB
More resources are free to process jobs.
Compute
Disk
Better Performance & Lower
Infrastructure Costs

10
SO  How do I utilize archive storage to lower my
storage costs without performance impact?
Answer: Intelligent Tiering

11
Access frequency
of data is the
most important
metric for
effective tiering
Age is easiest to
determine.
CAUTION: Some
data is long-term
active so this
cannot be the
only criteria.
Zero and small files
should be archived
differently in tiering
Hadoop.
Large cold files
should have
priority for archive
Knowing how
long data is
accessed once
ingested can
provide better
capacity planning
for your tiers.

12
Installed on a server
or VM
outside your existing
Hadoop cluster without
inserting any
proprietary technology
on the cluster or in the
data path.
Report data usage
(heat), small files, user
activity, replication, and
HDFS tier utilization.
Customize rules and
queries to properly utilize
infrastructure and plan
better for future scale.
Automatically archive,
promote, or change the
replication factor of data
based on usage patterns
and user defined rules.
Tier Hadoop HDFS By Heat, Age, Size & Usage
In Three Easy Steps
01/INSTALL WITHOUT
CHANGES TO CLUSTER
02/VISUALIZE &
REPORT
03/AUTOMATE
OPTIMIZATION

13
Completely out of the data path
FactorData HDFSplus sits outside the Hadoop cluster and collects only
metadata information from the Hadoop cluster
No software to install on the existing Hadoop cluster
Because HDFSplus leverages only existing Hadoop APIs and features,
there is no software to install on the cluster.
Provides a highly scalable solution in a small foot-print
HDFS visibility and automation for thousands of Hadoop nodes on a single
node, VM or server
HDFSplus
Namenodes
Communicates with
Existing Hadoop API
VM or Physical Machine
32GB RAM
4 CPU or vCPU
500GB Free Disk

14
Simplify and Automate Archive and Tiering in Hadoop Today
• Move seldom accessed data to storage dense archive nodes
• Lower software licensing with less infrastructure
• Free resources on existing namenodes and datanodes
Who or what application
is creating all these small
files in the cluster?
How can we move data
not accessed for 90 days
to archive nodes?
How can we better plan
for future scale with real
Hadoop storage metrics?
Result: Better Performance, Lower Hardware Costs, Lower Software Costs
Plus: Get Necessary Storage Visibility To Answer These Questions & More
with FactorData HDFSplus

17
HDFSplus
Apply storage
policy based on
custom query
Files are optimized
during normal
balancing window
Query list based
on size, heat,
activity, and age
1 2 3
• Move all files 120 days old and not
accessed for 90 days to ARCHIVE…..
• FactorData creates a data list based on
query
• Limit automated run by max files or capacity
• FactorData tracks completion of each run
• Data can be excluded from run according to
path, size and application
Custom Query Example: Automated Tiering:

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop (20)

More from Yahoo Developer Network (19)

Recently uploaded (20)

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop