Hadoop Summit 2010 Data Management On Grid

Data Management on Hadoop
@ Yahoo!

Srikanth Sundarrajan
Principal Engineer

Why is Data Management
important?
• Large datasets are incentives for users to come to
grid
• Volume of data movement
• Cluster access / partitioning (Research &
Production purposes)
• Resource consumption
• SLA’s on data availability
• Data Retention
• Regulatory compliance
• Data conversion

Data volumes

• Steady growth in data volumes (Data
movement per DAY – Into the grid)
40

35

30

25
TB
20

15

10

5

0

Data Acquisition Service

JT HDFS
Cluster 1

JT HDFS
Data Acquisition Cluster 2
Service

JT HDFS
Source Cluster 3

• Replication & Retention are additional Targets
services that handle cross cluster data
movement and data purge respectively

Pluggable interfaces

• Different warehouse may use different
interfaces to expose data (ex. http, scp, ftp or
some proprietary mechanism)
• Acquisition service should be generic and have
ability to plugin interfaces easily to support
newer warehouses

Data load & conversion

• Heavy lifting delegated to Map-reduce jobs,
keeping the acquisition service light
• Data load executed as a map-reduce job
• Data conversion as map-reduce job (to enable
faster data processing post acquisition)
– Fields inclusion/removal
– Data filtering
– Data Anonymization
– Data format conversion (raw delimited / Hadoop
sequence file)
• Cluster to cluster copy is a map-reduce job

Warehouse & Cluster isolation

• Source warehouses have diverse capacity,
often constrained
• Different clusters can have different versions
of Hadoop and cluster performance may not
be uniform
• Need for isolation at a warehouse & cluster
level and resource usage limits at a warehouse
level

Job throttling
Discovery

Discovery
threads

Queue per
source

Job execution
threads

Async Map reduce job post
resource negotiation

Cluster 1 Cluster N

Other things in consideration

• SLA, Feed priority & frequency in
consideration for scheduling data load
• Retention to remove old data (as required for
legal compliance and for capacity purposes)
• Interoperability across Hadoop versions

Hadoop Summit 2010 Data Management On Grid

More Related Content

What's hot (20)

Similar to Hadoop Summit 2010 Data Management On Grid (20)

More from Yahoo Developer Network (20)

Recently uploaded (20)

Hadoop Summit 2010 Data Management On Grid