SlideShare a Scribd company logo
1
2
Storage Policies & Disk Types (Hadoop 2.6 and up)
Disk Type  Flexible, can assign to any local filesystem
Disk Policy  Set on file or inherited from parent directory
Hadoop HDFS Tiering Support
aka – Hetrogenous Storage
Storage Policy Name Disk Type (n replicas)
Lasy_Persist RAM_DISK: 1, DISK: n-1
All_SSD SSD: n
One_SSD SSD: 1, DISK: n-1
Hot (default) DISK: n
Warm DISK: 1, ARCHIVE: n-1
Cold ARCHIVE: n
3
Hadoop HDFS Tiering Support
aka – Hetrogenous Storage
/data/results/query2.csv
Hot Nodes
Storage Policy default is Hot
Storage Type default is DISK
Archive Nodes
Storage Policy: HOT
Storage Type: DISK
4
Hadoop HDFS Tiering Support
aka – Hetrogenous Storage
Hot Nodes
Storage Policy is changed
File remains on same storage type until mover is run
Archive Nodes
Storage Policy: Cold
Storage Type: DISK
/data/results/query2.csv
5
Hadoop HDFS Tiering Support
aka – Hetrogenous Storage
Storage Policy: Cold
Storage Type: ARCHIVE
Hot Nodes Archive Nodes
After mover is run, all replicas move to storage type Archive.
Note: file has not logically moved in HDFS
/data/results/query2.csv
6
WHY TIER HADOOP STORAGE?
ISN’T IT ALREADY COMMODITY STORAGE?
(aka – The cheapest stuff on the planet)
7
Lower Disk Capacity to Compute
Compute
Disk
Better job scalability, performance, and
consistent results
5x to 10x more expensive per GB
8
Much Denser Disk to Compute
Compute
Disk
Much less $ per GB
Could impact performance and produce
inconsistent results
9
Cold Goes to Archive. Hot Gets More Resources
Compute
Disk
Much less $ per GB
More resources are free to process jobs.
Compute
Disk
Better Performance & Lower
Infrastructure Costs
10
SO  How do I utilize archive storage to lower my
storage costs without performance impact?
Answer: Intelligent Tiering
11
Access frequency
of data is the
most important
metric for
effective tiering
Age is easiest to
determine.
CAUTION: Some
data is long-term
active so this
cannot be the
only criteria.
Zero and small files
should be archived
differently in tiering
Hadoop.
Large cold files
should have
priority for archive
Knowing how
long data is
accessed once
ingested can
provide better
capacity planning
for your tiers.
12
Installed on a server
or VM
outside your existing
Hadoop cluster without
inserting any
proprietary technology
on the cluster or in the
data path.
Report data usage
(heat), small files, user
activity, replication, and
HDFS tier utilization.
Customize rules and
queries to properly utilize
infrastructure and plan
better for future scale.
Automatically archive,
promote, or change the
replication factor of data
based on usage patterns
and user defined rules.
Tier Hadoop HDFS By Heat, Age, Size & Usage
In Three Easy Steps
01/INSTALL WITHOUT
CHANGES TO CLUSTER
02/VISUALIZE &
REPORT
03/AUTOMATE
OPTIMIZATION
13
Completely out of the data path
FactorData HDFSplus sits outside the Hadoop cluster and collects only
metadata information from the Hadoop cluster
No software to install on the existing Hadoop cluster
Because HDFSplus leverages only existing Hadoop APIs and features,
there is no software to install on the cluster.
Provides a highly scalable solution in a small foot-print
HDFS visibility and automation for thousands of Hadoop nodes on a single
node, VM or server
HDFSplus
Namenodes
Communicates with
Existing Hadoop API
VM or Physical Machine
32GB RAM
4 CPU or vCPU
500GB Free Disk
14
Simplify and Automate Archive and Tiering in Hadoop Today
• Move seldom accessed data to storage dense archive nodes
• Lower software licensing with less infrastructure
• Free resources on existing namenodes and datanodes
Who or what application
is creating all these small
files in the cluster?
How can we move data
not accessed for 90 days
to archive nodes?
How can we better plan
for future scale with real
Hadoop storage metrics?
Result: Better Performance, Lower Hardware Costs, Lower Software Costs
Plus: Get Necessary Storage Visibility To Answer These Questions & More
with FactorData HDFSplus
15
16
17
HDFSplus
Apply storage
policy based on
custom query
Files are optimized
during normal
balancing window
Query list based
on size, heat,
activity, and age
1 2 3
• Move all files 120 days old and not
accessed for 90 days to ARCHIVE…..
• FactorData creates a data list based on
query
• Limit automated run by max files or capacity
• FactorData tracks completion of each run
• Data can be excluded from run according to
path, size and application
Custom Query Example: Automated Tiering:

More Related Content

PDF
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
PPTX
Apache ignite Datagrid
PPTX
Introducing Kudu
PDF
Introduction to Apache Kudu
PPTX
Introduction to Apache Kudu
PDF
Next Generation Hadoop Operations
PDF
Maintainable cloud architecture_of_hadoop
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache ignite Datagrid
Introducing Kudu
Introduction to Apache Kudu
Introduction to Apache Kudu
Next Generation Hadoop Operations
Maintainable cloud architecture_of_hadoop

What's hot (20)

PPTX
Architecting Applications with Hadoop
PDF
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
PDF
Applications on Hadoop
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Hive spark-s3acommitter-hbase-nfs
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
PDF
Improving Hadoop Cluster Performance via Linux Configuration
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PDF
NYC HUG - Application Architectures with Apache Hadoop
PDF
Application architectures with Hadoop – Big Data TechCon 2014
PPTX
Introduction to Data Analyst Training
PDF
Interactive SQL-on-Hadoop and JethroData
PDF
Apache kudu
PDF
Large-scale Web Apps @ Pinterest
PDF
HUG August 2010: Best practices
PDF
The Heterogeneous Data lake
PDF
Apache Hadoop 1.1
PDF
Application Architectures with Hadoop
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPT
Deployment and Management of Hadoop Clusters
Architecting Applications with Hadoop
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
Applications on Hadoop
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Hive spark-s3acommitter-hbase-nfs
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Improving Hadoop Cluster Performance via Linux Configuration
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
NYC HUG - Application Architectures with Apache Hadoop
Application architectures with Hadoop – Big Data TechCon 2014
Introduction to Data Analyst Training
Interactive SQL-on-Hadoop and JethroData
Apache kudu
Large-scale Web Apps @ Pinterest
HUG August 2010: Best practices
The Heterogeneous Data lake
Apache Hadoop 1.1
Application Architectures with Hadoop
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Deployment and Management of Hadoop Clusters
Ad

Viewers also liked (20)

PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
PPTX
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
PPTX
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
PPTX
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
PPT
Hadoop @ Yahoo! - Internet Scale Data Processing
PPTX
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
PPTX
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
PPTX
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
PPTX
Facebook Analytics with Elastic Map/Reduce
PDF
Scaling your analytics with Amazon EMR
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
PPT
February 2016 HUG: Running Spark Clusters in Containers with Docker
PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
Building a Sustainable Data Platform on AWS
PDF
Efficient Top-N Recommendation by Linear Regression
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Hadoop @ Yahoo! - Internet Scale Data Processing
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
Facebook Analytics with Elastic Map/Reduce
Scaling your analytics with Amazon EMR
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
February 2016 HUG: Running Spark Clusters in Containers with Docker
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Building a Sustainable Data Platform on AWS
Efficient Top-N Recommendation by Linear Regression
Ad

Similar to October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop (20)

PPTX
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
PPTX
Big Data-Session, data engineering and scala
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
Apache Hadoop In Theory And Practice
PPT
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PDF
HPE Hadoop Solutions - From use cases to proposal
PPTX
Where to Deploy Hadoop: Bare Metal or Cloud?
PPTX
Where to Deploy Hadoop: Bare-metal or Cloud?
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PDF
Sam fineberg big_data_hadoop_storage_options_3v9-1
PPTX
Hadoop crashcourse v3
PPTX
Democratizing Memory Storage
PPTX
Data Discovery on Hadoop - Realizing the Full Potential of your Data
PPTX
Hadoop HDFS and Oracle
PPTX
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
PPTX
Zero-downtime Hadoop/HBase Cross-datacenter Migration
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
PDF
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Big Data-Session, data engineering and scala
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Apache Hadoop In Theory And Practice
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
HDFS Tiered Storage: Mounting Object Stores in HDFS
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
HPE Hadoop Solutions - From use cases to proposal
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?
Hortonworks Technical Workshop: What's New in HDP 2.3
Sam fineberg big_data_hadoop_storage_options_3v9-1
Hadoop crashcourse v3
Democratizing Memory Storage
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Hadoop HDFS and Oracle
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Harnessing Hadoop and Big Data to Reduce Execution Times
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

More from Yahoo Developer Network (19)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
August 2016 HUG: Recent development in Apache Oozie
PPTX
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
August 2016 HUG: Recent development in Apache Oozie
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Tartificialntelligence_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Tartificialntelligence_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Assigned Numbers - 2025 - Bluetooth® Document
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Group 1 Presentation -Planning and Decision Making .pptx
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Getting Started with Data Integration: FME Form 101
Programs and apps: productivity, graphics, security and other tools
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop

  • 1. 1
  • 2. 2 Storage Policies & Disk Types (Hadoop 2.6 and up) Disk Type  Flexible, can assign to any local filesystem Disk Policy  Set on file or inherited from parent directory Hadoop HDFS Tiering Support aka – Hetrogenous Storage Storage Policy Name Disk Type (n replicas) Lasy_Persist RAM_DISK: 1, DISK: n-1 All_SSD SSD: n One_SSD SSD: 1, DISK: n-1 Hot (default) DISK: n Warm DISK: 1, ARCHIVE: n-1 Cold ARCHIVE: n
  • 3. 3 Hadoop HDFS Tiering Support aka – Hetrogenous Storage /data/results/query2.csv Hot Nodes Storage Policy default is Hot Storage Type default is DISK Archive Nodes Storage Policy: HOT Storage Type: DISK
  • 4. 4 Hadoop HDFS Tiering Support aka – Hetrogenous Storage Hot Nodes Storage Policy is changed File remains on same storage type until mover is run Archive Nodes Storage Policy: Cold Storage Type: DISK /data/results/query2.csv
  • 5. 5 Hadoop HDFS Tiering Support aka – Hetrogenous Storage Storage Policy: Cold Storage Type: ARCHIVE Hot Nodes Archive Nodes After mover is run, all replicas move to storage type Archive. Note: file has not logically moved in HDFS /data/results/query2.csv
  • 6. 6 WHY TIER HADOOP STORAGE? ISN’T IT ALREADY COMMODITY STORAGE? (aka – The cheapest stuff on the planet)
  • 7. 7 Lower Disk Capacity to Compute Compute Disk Better job scalability, performance, and consistent results 5x to 10x more expensive per GB
  • 8. 8 Much Denser Disk to Compute Compute Disk Much less $ per GB Could impact performance and produce inconsistent results
  • 9. 9 Cold Goes to Archive. Hot Gets More Resources Compute Disk Much less $ per GB More resources are free to process jobs. Compute Disk Better Performance & Lower Infrastructure Costs
  • 10. 10 SO  How do I utilize archive storage to lower my storage costs without performance impact? Answer: Intelligent Tiering
  • 11. 11 Access frequency of data is the most important metric for effective tiering Age is easiest to determine. CAUTION: Some data is long-term active so this cannot be the only criteria. Zero and small files should be archived differently in tiering Hadoop. Large cold files should have priority for archive Knowing how long data is accessed once ingested can provide better capacity planning for your tiers.
  • 12. 12 Installed on a server or VM outside your existing Hadoop cluster without inserting any proprietary technology on the cluster or in the data path. Report data usage (heat), small files, user activity, replication, and HDFS tier utilization. Customize rules and queries to properly utilize infrastructure and plan better for future scale. Automatically archive, promote, or change the replication factor of data based on usage patterns and user defined rules. Tier Hadoop HDFS By Heat, Age, Size & Usage In Three Easy Steps 01/INSTALL WITHOUT CHANGES TO CLUSTER 02/VISUALIZE & REPORT 03/AUTOMATE OPTIMIZATION
  • 13. 13 Completely out of the data path FactorData HDFSplus sits outside the Hadoop cluster and collects only metadata information from the Hadoop cluster No software to install on the existing Hadoop cluster Because HDFSplus leverages only existing Hadoop APIs and features, there is no software to install on the cluster. Provides a highly scalable solution in a small foot-print HDFS visibility and automation for thousands of Hadoop nodes on a single node, VM or server HDFSplus Namenodes Communicates with Existing Hadoop API VM or Physical Machine 32GB RAM 4 CPU or vCPU 500GB Free Disk
  • 14. 14 Simplify and Automate Archive and Tiering in Hadoop Today • Move seldom accessed data to storage dense archive nodes • Lower software licensing with less infrastructure • Free resources on existing namenodes and datanodes Who or what application is creating all these small files in the cluster? How can we move data not accessed for 90 days to archive nodes? How can we better plan for future scale with real Hadoop storage metrics? Result: Better Performance, Lower Hardware Costs, Lower Software Costs Plus: Get Necessary Storage Visibility To Answer These Questions & More with FactorData HDFSplus
  • 15. 15
  • 16. 16
  • 17. 17 HDFSplus Apply storage policy based on custom query Files are optimized during normal balancing window Query list based on size, heat, activity, and age 1 2 3 • Move all files 120 days old and not accessed for 90 days to ARCHIVE….. • FactorData creates a data list based on query • Limit automated run by max files or capacity • FactorData tracks completion of each run • Data can be excluded from run according to path, size and application Custom Query Example: Automated Tiering: