SlideShare a Scribd company logo
Data Management on Hadoop
        @ Yahoo!

     Srikanth Sundarrajan
      Principal Engineer
Why is Data Management
        important?
• Large datasets are incentives for users to come to
  grid
• Volume of data movement
• Cluster access / partitioning (Research &
  Production purposes)
• Resource consumption
• SLA’s on data availability
• Data Retention
• Regulatory compliance
• Data conversion
Data volumes

• Steady growth in data volumes (Data
  movement per DAY – Into the grid)
       40

       35

       30

       25
  TB
       20

       15

       10

        5

        0
Data Acquisition Service


                                            JT               HDFS
                                                 Cluster 1




                                            JT               HDFS
                 Data Acquisition                Cluster 2
                     Service

                                            JT               HDFS
   Source                                        Cluster 3


• Replication & Retention are additional         Targets
  services that handle cross cluster data
  movement and data purge respectively
Pluggable interfaces

• Different warehouse may use different
  interfaces to expose data (ex. http, scp, ftp or
  some proprietary mechanism)
• Acquisition service should be generic and have
  ability to plugin interfaces easily to support
  newer warehouses
Data load & conversion

• Heavy lifting delegated to Map-reduce jobs,
  keeping the acquisition service light
• Data load executed as a map-reduce job
• Data conversion as map-reduce job (to enable
  faster data processing post acquisition)
  –   Fields inclusion/removal
  –   Data filtering
  –   Data Anonymization
  –   Data format conversion (raw delimited / Hadoop
      sequence file)
• Cluster to cluster copy is a map-reduce job
Warehouse & Cluster isolation

• Source warehouses have diverse capacity,
  often constrained
• Different clusters can have different versions
  of Hadoop and cluster performance may not
  be uniform
• Need for isolation at a warehouse & cluster
  level and resource usage limits at a warehouse
  level
Job throttling
               Discovery

                                        Discovery
                                         threads


                                        Queue per
                                         source

                                       Job execution
                                          threads


                                 Async Map reduce job post
                                    resource negotiation




Cluster 1                  Cluster N
Other things in consideration

• SLA, Feed priority & frequency in
  consideration for scheduling data load
• Retention to remove old data (as required for
  legal compliance and for capacity purposes)
• Interoperability across Hadoop versions
Thanks!

More Related Content

PPT
Hadoop technology
PDF
Sharing resources with non-Hadoop workloads
PPTX
Hadoop Technology
PPTX
Backup and Disaster Recovery in Hadoop
ODP
Glusterfs and Hadoop
PDF
HBaseCon 2013: Apache HBase Operations at Pinterest
PPTX
HDFS Tiered Storage
PPTX
Hadoop Fundamentals
Hadoop technology
Sharing resources with non-Hadoop workloads
Hadoop Technology
Backup and Disaster Recovery in Hadoop
Glusterfs and Hadoop
HBaseCon 2013: Apache HBase Operations at Pinterest
HDFS Tiered Storage
Hadoop Fundamentals

What's hot (20)

PPTX
2. hadoop fundamentals
PDF
Introduction to GlusterFS Webinar - September 2011
PDF
Philly DB MapR Overview
PDF
Hadoop ecosystem
PPTX
IBM GPFS
PDF
Future of cloud storage
PDF
1 rh storage - architecture whitepaper
PPTX
Geo-based content processing using hbase
PPTX
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Aziksa hadoop architecture santosh jha
PPTX
سکوهای ابری و مدل های برنامه نویسی در ابر
PDF
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PPTX
Selective Data Replication with Geographically Distributed Hadoop
PDF
How MariaDB is approaching DBaaS
PPTX
Hadoop Architecture
PPTX
Basic Hadoop Architecture V1 vs V2
PDF
Apache HBase in the Enterprise Data Hub at Cerner
PPTX
Gfs vs hdfs
2. hadoop fundamentals
Introduction to GlusterFS Webinar - September 2011
Philly DB MapR Overview
Hadoop ecosystem
IBM GPFS
Future of cloud storage
1 rh storage - architecture whitepaper
Geo-based content processing using hbase
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Achieving Separation of Compute and Storage in a Cloud World
Aziksa hadoop architecture santosh jha
سکوهای ابری و مدل های برنامه نویسی در ابر
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
HBase Data Modeling and Access Patterns with Kite SDK
Selective Data Replication with Geographically Distributed Hadoop
How MariaDB is approaching DBaaS
Hadoop Architecture
Basic Hadoop Architecture V1 vs V2
Apache HBase in the Enterprise Data Hub at Cerner
Gfs vs hdfs
Ad

Similar to Hadoop Summit 2010 Data Management On Grid (20)

PDF
Hw09 Data Processing In The Enterprise
PPTX
Apache Hadoop Now Next and Beyond
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PDF
Introduction to Hadoop
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PDF
Hadoop on Azure, Blue elephants
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
PPTX
Hadoop and WANdisco: The Future of Big Data
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PPTX
Hadoop introduction
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PPTX
HDFS tiered storage
PPTX
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
PDF
Hadoop, Taming Elephants
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PDF
Cloud batch a batch job queuing system on clouds with hadoop and h-base
PDF
Steps to Modernize Your Data Ecosystem with Mindtree Blog
PDF
6 Steps to Modernize Data Ecosystem with Mindtree
PDF
Six Steps to Modernize Your Data Ecosystem - Mindtree
Hw09 Data Processing In The Enterprise
Apache Hadoop Now Next and Beyond
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Introduction to Hadoop
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop on Azure, Blue elephants
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Hadoop and WANdisco: The Future of Big Data
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Hadoop introduction
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
HDFS tiered storage
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop, Taming Elephants
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Cloud batch a batch job queuing system on clouds with hadoop and h-base
Steps to Modernize Your Data Ecosystem with Mindtree Blog
6 Steps to Modernize Data Ecosystem with Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Programs and apps: productivity, graphics, security and other tools
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
KodekX | Application Modernization Development
PDF
Encapsulation theory and applications.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Programs and apps: productivity, graphics, security and other tools
The AUB Centre for AI in Media Proposal.docx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
Spectroscopy.pptx food analysis technology
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Hadoop Summit 2010 Data Management On Grid

  • 1. Data Management on Hadoop @ Yahoo! Srikanth Sundarrajan Principal Engineer
  • 2. Why is Data Management important? • Large datasets are incentives for users to come to grid • Volume of data movement • Cluster access / partitioning (Research & Production purposes) • Resource consumption • SLA’s on data availability • Data Retention • Regulatory compliance • Data conversion
  • 3. Data volumes • Steady growth in data volumes (Data movement per DAY – Into the grid) 40 35 30 25 TB 20 15 10 5 0
  • 4. Data Acquisition Service JT HDFS Cluster 1 JT HDFS Data Acquisition Cluster 2 Service JT HDFS Source Cluster 3 • Replication & Retention are additional Targets services that handle cross cluster data movement and data purge respectively
  • 5. Pluggable interfaces • Different warehouse may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) • Acquisition service should be generic and have ability to plugin interfaces easily to support newer warehouses
  • 6. Data load & conversion • Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service light • Data load executed as a map-reduce job • Data conversion as map-reduce job (to enable faster data processing post acquisition) – Fields inclusion/removal – Data filtering – Data Anonymization – Data format conversion (raw delimited / Hadoop sequence file) • Cluster to cluster copy is a map-reduce job
  • 7. Warehouse & Cluster isolation • Source warehouses have diverse capacity, often constrained • Different clusters can have different versions of Hadoop and cluster performance may not be uniform • Need for isolation at a warehouse & cluster level and resource usage limits at a warehouse level
  • 8. Job throttling Discovery Discovery threads Queue per source Job execution threads Async Map reduce job post resource negotiation Cluster 1 Cluster N
  • 9. Other things in consideration • SLA, Feed priority & frequency in consideration for scheduling data load • Retention to remove old data (as required for legal compliance and for capacity purposes) • Interoperability across Hadoop versions