SlideShare a Scribd company logo
Analyzing Small Files in HDFS Cluster
Presenters: Rohit Jangid
Presenters: Raman Goyal
HDFS Analysis for Small Files
Outline
▪ What are small files and their problems?
▪ Small Files Analysis
▪ Architecture
▪ FsImage Processing and Aggregation
▪ Implementation and tool
▪ Dashboards and Results
▪ Dashboards
▪ Results
▪ Future Work
▪ Conclusions
2
Expedia’s HDFS Cluster
3
Hdfs Doesn’t Like Lots Of Small Files…
4
Problem?
INEFFICIENT DATA ACCESS PATTERN
5
MAKES JOBS SLOW....
6
Trivial Solution?
7
Compaction
Solution?
8
BUT WHERE...?
9
SMALL FILES ANALYSIS
10
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
11
LSR
LSR
FsIMAGE PROCESSING
MeProcessed 20gb FsImage In ~20 Minutes
Custom OIV Interpreter For Reduced Memory Usage
Fetched from Name node OIV to LSR Interpreter
HDFS Cluster RAW FsImage
Interpreted
FsImage
12
LSR
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
13
Attributes Found Directly
Owner Name
Group Name
Size of File
Replication Factor
Number of Direct File objects
Last Modified Date
Level of File
Is File or Is Directory?
Attribution and Aggregation
Aggregated Attributes
Number of Small File objects
Number of Namespace objects
Smallest, Largest, Avg File size
Difference in Size since Last run
If Directory
14
Attribution and Aggregation
Generate Small Files / Total Files Metrics
Roll-up Attributes to Parent Directories
Custom UDF’s and
PIG Scripts
Using Sqoop
Stored in HDFS
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Storage
15
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
16
LSR
STORAGE AND REPORTING
DashboardStorage
Relational Database and Rest API Dashboards
Different Dashboards Showing User Level and
Overall Level
REST
API
Powered by Cyclotron: http://cyclotron.io
17
Implementation and Tool
Files and Directories Attributed
Small file & Directory information
Download and Interpret
HDFS NameNode
At Directory level
Statistics like Smallest File calculated
Using OIV Interpreter
By splitting FsImage rows
Storage, REST API and Dashboards
Can easily add new Clusters in Tool
18
DASHBOARDS AND
RESULTS
19
Dashboards Information
For file size less than 10 MB
For file size between 10 MB to 70 MB
For file size between 70 MB to ~100 MB
3 possible bucketing models
Goes upto all levels in HDFS
Distribution of owners of small Top 10
Directories to be investigated for
deletion, re-partition, compaction
3
2
1
20
Overall Dashboard containing all Information
21
Distribution of Owners of Small Files
22
Sample Directories Containing Small Files
23
Top 10: Files vs Small Files
24
Daily Small Files per Directory
25
Doesn’t have real time analysis! with
alerting
Cluster has 200+ million namespace objects that we get as memory dump from
Hadoop server.
Future Work
Translating and attributing each directory and file is a time consuming process.
Developing Customisable Compaction
Utility
1
2
26
EDWPMonitoring@expedia.com
Conclusions

More Related Content

PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Etsy Activity Feeds Architecture
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Big data architectures and the data lake
PDF
Optimizing Hive Queries
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Etsy Activity Feeds Architecture
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Big data architectures and the data lake
Optimizing Hive Queries

What's hot (20)

PDF
Spark SQL Bucketing at Facebook
PDF
Cassandra serving netflix @ scale
PDF
Cassandra Introduction & Features
PPTX
Application Timeline Server - Past, Present and Future
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Introduction to Redis
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Scalability, Availability & Stability Patterns
PDF
Iceberg: a fast table format for S3
PDF
Fast Data Analytics with Spark and Python
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PDF
The "Big Data" Ecosystem at LinkedIn
Spark SQL Bucketing at Facebook
Cassandra serving netflix @ scale
Cassandra Introduction & Features
Application Timeline Server - Past, Present and Future
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Apache Spark in Depth: Core Concepts, Architecture & Internals
Introduction to Redis
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Building robust CDC pipeline with Apache Hudi and Debezium
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Scalability, Availability & Stability Patterns
Iceberg: a fast table format for S3
Fast Data Analytics with Spark and Python
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Simplifying Big Data Analytics with Apache Spark
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
The "Big Data" Ecosystem at LinkedIn
Ad

Similar to HDFS Analysis for Small Files (20)

PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Hadoop Distributed File System
PPT
Hadoop training by keylabs
ODP
Hadoop HDFS by rohitkapa
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
Apache Hadoop Big Data Technology
PDF
Red Hat Storage Server Administration Deep Dive
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
Hadoop File system (HDFS)
PDF
Wheeler w 0450_linux_file_systems1
PDF
Wheeler w 0450_linux_file_systems1
PPTX
Big Data Analytics -Introduction education
PPTX
Hadoop Distributed File System
PPTX
Introduction to HDFS
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
PDF
BIG DATA Session 6
PPTX
Hadoop file system
PPTX
Introduction to Big Data and hadoop
PPTX
Introduction to HDFS
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop Distributed File System
Hadoop training by keylabs
Hadoop HDFS by rohitkapa
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Apache Hadoop Big Data Technology
Red Hat Storage Server Administration Deep Dive
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Hadoop File system (HDFS)
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
Big Data Analytics -Introduction education
Hadoop Distributed File System
Introduction to HDFS
Bringing OLTP woth OLAP: Lumos on Hadoop
BIG DATA Session 6
Hadoop file system
Introduction to Big Data and hadoop
Introduction to HDFS
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology

HDFS Analysis for Small Files