HDFS Analysis for Small Files

Analyzing Small Files in HDFS Cluster
Presenters: Rohit Jangid
Presenters: Raman Goyal
HDFS Analysis for Small Files

Outline
▪ What are small files and their problems?
▪ Small Files Analysis
▪ Architecture
▪ FsImage Processing and Aggregation
▪ Implementation and tool
▪ Dashboards and Results
▪ Dashboards
▪ Results
▪ Future Work
▪ Conclusions
2

Hdfs Doesn’t Like Lots Of Small Files…
4
Problem?

INEFFICIENT DATA ACCESS PATTERN
5

ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
11
LSR

LSR
FsIMAGE PROCESSING
MeProcessed 20gb FsImage In ~20 Minutes
Custom OIV Interpreter For Reduced Memory Usage
Fetched from Name node OIV to LSR Interpreter
Interpreted
FsImage
12

LSR
ARCHITECTURE
Interpreted
FsImage
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
13

Attributes Found Directly
Owner Name
Group Name
Size of File
Replication Factor
Number of Direct File objects
Last Modified Date
Level of File
Is File or Is Directory?
Attribution and Aggregation
Aggregated Attributes
Number of Small File objects
Number of Namespace objects
Smallest, Largest, Avg File size
Difference in Size since Last run
If Directory
14

Attribution and Aggregation
Generate Small Files / Total Files Metrics
Roll-up Attributes to Parent Directories
Custom UDF’s and
PIG Scripts
Using Sqoop
Stored in HDFS
Directory
information
Aggregated Files
and Directory
information
Storage
15

ARCHITECTURE
Interpreted
FsImage
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
16
LSR

STORAGE AND REPORTING
DashboardStorage
Relational Database and Rest API Dashboards
Different Dashboards Showing User Level and
Overall Level
REST
API
Powered by Cyclotron: http://cyclotron.io
17

Implementation and Tool
Files and Directories Attributed
Small ﬁle & Directory information
Download and Interpret
HDFS NameNode
At Directory level
Statistics like Smallest File calculated
Using OIV Interpreter
By splitting FsImage rows
Storage, REST API and Dashboards
Can easily add new Clusters in Tool
18

Dashboards Information
For file size less than 10 MB
For file size between 10 MB to 70 MB
For file size between 70 MB to ~100 MB
3 possible bucketing models
Goes upto all levels in HDFS
Distribution of owners of small Top 10
Directories to be investigated for
deletion, re-partition, compaction
3
2
1
20

Overall Dashboard containing all Information
21

Distribution of Owners of Small Files
22

Sample Directories Containing Small Files
23

Top 10: Files vs Small Files
24

Daily Small Files per Directory
25

Doesn’t have real time analysis! with
alerting
Cluster has 200+ million namespace objects that we get as memory dump from
Hadoop server.
Future Work
Translating and attributing each directory and file is a time consuming process.
Developing Customisable Compaction
Utility
1
2
26

EDWPMonitoring@expedia.com
Conclusions

HDFS Analysis for Small Files

More Related Content

What's hot (20)

Similar to HDFS Analysis for Small Files (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

HDFS Analysis for Small Files