SlideShare a Scribd company logo
Data Management on Hadoop @ Y!
Seetharam Venkatesh (venkatesh@innerzeal.com)
Hadoop Data Infrastructure Lead/Architect
Agenda
2 Challenging Data Landscape
1 Introduction
3 The Solution
4 Future opportunities
Introduction
Why is Data Management Critical
Challenging Data Landscape
Data
Warehouse
Database
NAS
Data
Warehouse
Database
NAS
Tape
Data Acquisition
Challenge Solution
Steady growth in data volumes Heavy lifting delegated to map-
only jobs
Diverse Interfaces, API contracts Pluggable interfaces, Adaptors for
specific API
Data Sources have diverse serving
capacity
Throttling enables uniform load
on the Data Source
Data Source Isolation Asynchronous Scheduling,
progress monitored per source
Varying Data Formats, file sizes
and long tails, failures
Conversion as map-reduce job
Coalesce, Chunking, Checkpoint
Data Quality
BCP
Pluggable validations
Supports Hot-Hot, Hot-Warm
Data Replication
Challenge Solution
Steady growth in data
volumes
Heavy lifting delegated to map-only
jobs (DistCp v2)
Cluster proximity, availability Tree of copies with at most one
cross-datacenter copy
Long tails Dynamic split assignment, each map
picks up only one file at a time
(DistCp v2)
Data Export Export as Replication target -Push
Adhoc uses HDFS Proxy – Pull
Datacenter Datacenter
Data Lifecycle Management
Challenge Solution
Aging Data expires Retention to remove old data (as
required for legal compliance and
for capacity purposes)
Data Privacy Anonymization of Personally
Identifiable information
SOX Compliance & Audit Archival/Restoration to/from
Tape (13 months)
SEC Compliance & Audit Archival/Restoration to/from
Tape (7 years)
Operability, Manageability
Challenge Solution
Monitor and administer data
loading across clusters, colos
Central dashboard for monitoring
and administration
Integrated view of jobs running
across clusters, colos
Interoperability across
incompatible Hadoop versions
Support various Hadoop versions
using Reverse Class loader
One data loading instance per
colo that can work across clusters
Maintenance Windows, failures
System shutdown
Partial copy + auto resume
Automatic resume upon restart
SLA management + introspection
via metrics
Architecture
Highlights
• “Workflows” abstraction over MR Jobs
• More workflows than Oozie with in Y!
• Amounts to >30% of jobs launched on the clusters
• Occupies less than 10% of cluster capacity (slots)
• Solves recurring batch data transfers
• 2300+ feeds with varying periodicity (5m to Monthly)
• 100+ TB/day of data movement
• SLAs
• Central Dashboard
• SLA monitoring with ETA on feeds
Highlights
Future
Data Management on Hadoop at Yahoo!

More Related Content

PPTX
Data Center Operating System
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PDF
Aesop change data propagation
PDF
Architecting Data in the AWS Ecosystem
PPT
Case Study Real Time Olap Cubes
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PPTX
Beyond Aurora. Scale-out SQL databases for AWS
PDF
Achieve new levels of performance for Magento e-commerce sites.
Data Center Operating System
Building tiered data stores using aesop to bridge sql and no sql systems
Aesop change data propagation
Architecting Data in the AWS Ecosystem
Case Study Real Time Olap Cubes
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Beyond Aurora. Scale-out SQL databases for AWS
Achieve new levels of performance for Magento e-commerce sites.

What's hot (20)

PPTX
Ontology2 platform
PPTX
Quantopix analytics system (qas)
PPTX
Big Data Ingestion @ Flipkart Data Platform
PDF
E commerce data migration in moving systems across data centres
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
A deep dive into neuton
PPTX
PDF
Presentacion redislabs-ihub
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
From zero to hero with the actor model - Tamir Dresher - Odessa 2019
PPTX
Hadoop data ingestion
PPTX
Gobblin meetup-whats new in 0.7
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Azure Data Factory ETL Patterns in the Cloud
PDF
Ramunas Balukonis. Research DWH
PPTX
Mapping Data Flows Training April 2021
PDF
Cassandra Essentials Day Cambridge
ODP
RDBMS and Hadoop
PPT
A big-data architecture for real-time analytics
PPTX
presto-at-netflix-hadoop-summit-15
Ontology2 platform
Quantopix analytics system (qas)
Big Data Ingestion @ Flipkart Data Platform
E commerce data migration in moving systems across data centres
Presto: Optimizing Performance of SQL-on-Anything Engine
A deep dive into neuton
Presentacion redislabs-ihub
Alluxio Data Orchestration Platform for the Cloud
From zero to hero with the actor model - Tamir Dresher - Odessa 2019
Hadoop data ingestion
Gobblin meetup-whats new in 0.7
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Azure Data Factory ETL Patterns in the Cloud
Ramunas Balukonis. Research DWH
Mapping Data Flows Training April 2021
Cassandra Essentials Day Cambridge
RDBMS and Hadoop
A big-data architecture for real-time analytics
presto-at-netflix-hadoop-summit-15
Ad

Similar to Data Management on Hadoop at Yahoo! (20)

PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
PDF
Google Data Engineering.pdf
PDF
Data Engineering on GCP
PDF
data_engineering_on_GCP_PDE_cheat_sheets
PDF
How can Hadoop & SAP be integrated
PPTX
عصر کلان داده، چرا و چگونه؟
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPT
Hadoop and Voldemort @ LinkedIn
PPTX
Relational databases vs Non-relational databases
PPT
No SQL Databases as modern database concepts
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PPT
Hadoop for Scientific Workloads__HadoopSummit2010
PDF
Hadoop introduction
ODP
HadoopDB
PPTX
Big data & hadoop
PDF
Infrastructure Considerations for Analytical Workloads
PPTX
Bigdata
PPT
05 No SQL Sudarshan.ppt
PPT
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
PPT
No SQL Databases.ppt
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Google Data Engineering.pdf
Data Engineering on GCP
data_engineering_on_GCP_PDE_cheat_sheets
How can Hadoop & SAP be integrated
عصر کلان داده، چرا و چگونه؟
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
Relational databases vs Non-relational databases
No SQL Databases as modern database concepts
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop introduction
HadoopDB
Big data & hadoop
Infrastructure Considerations for Analytical Workloads
Bigdata
05 No SQL Sudarshan.ppt
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases.ppt
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
Advanced methodologies resolving dimensionality complications for autism neur...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm

Data Management on Hadoop at Yahoo!

  • 1. Data Management on Hadoop @ Y! Seetharam Venkatesh (venkatesh@innerzeal.com) Hadoop Data Infrastructure Lead/Architect
  • 2. Agenda 2 Challenging Data Landscape 1 Introduction 3 The Solution 4 Future opportunities
  • 4. Why is Data Management Critical
  • 6. Data Acquisition Challenge Solution Steady growth in data volumes Heavy lifting delegated to map- only jobs Diverse Interfaces, API contracts Pluggable interfaces, Adaptors for specific API Data Sources have diverse serving capacity Throttling enables uniform load on the Data Source Data Source Isolation Asynchronous Scheduling, progress monitored per source Varying Data Formats, file sizes and long tails, failures Conversion as map-reduce job Coalesce, Chunking, Checkpoint Data Quality BCP Pluggable validations Supports Hot-Hot, Hot-Warm
  • 7. Data Replication Challenge Solution Steady growth in data volumes Heavy lifting delegated to map-only jobs (DistCp v2) Cluster proximity, availability Tree of copies with at most one cross-datacenter copy Long tails Dynamic split assignment, each map picks up only one file at a time (DistCp v2) Data Export Export as Replication target -Push Adhoc uses HDFS Proxy – Pull Datacenter Datacenter
  • 8. Data Lifecycle Management Challenge Solution Aging Data expires Retention to remove old data (as required for legal compliance and for capacity purposes) Data Privacy Anonymization of Personally Identifiable information SOX Compliance & Audit Archival/Restoration to/from Tape (13 months) SEC Compliance & Audit Archival/Restoration to/from Tape (7 years)
  • 9. Operability, Manageability Challenge Solution Monitor and administer data loading across clusters, colos Central dashboard for monitoring and administration Integrated view of jobs running across clusters, colos Interoperability across incompatible Hadoop versions Support various Hadoop versions using Reverse Class loader One data loading instance per colo that can work across clusters Maintenance Windows, failures System shutdown Partial copy + auto resume Automatic resume upon restart SLA management + introspection via metrics
  • 11. Highlights • “Workflows” abstraction over MR Jobs • More workflows than Oozie with in Y! • Amounts to >30% of jobs launched on the clusters • Occupies less than 10% of cluster capacity (slots) • Solves recurring batch data transfers • 2300+ feeds with varying periodicity (5m to Monthly) • 100+ TB/day of data movement • SLAs • Central Dashboard • SLA monitoring with ETA on feeds

Editor's Notes

  • #2: Hello, This is Venkatesh working as part of the greater hadoop team at Yahoo in Bangalore.I’m going to talk about how we manage data into and out of our hadoop clusters at Yahoo.
  • #3: I’ll cover what we mean by Data Management and why is it critical for us at Yahoo.I’ll look at the challenges in the space and then the solutions for the major components.Will briefly go over what we are thinking ahead.
  • #4: Data Management is a suite of services that manage the lifecycle of data on the Hadoop clusters. focus is on bulk data, usually recurring in batches, agnostic to dataaims to provide these services with strong SLA and quality guaranteesData Management can broadly be decomposed into 2 major areas, Data Movement & Lifecycle Management. * Data Movement is further decomposed into the following services and components. Data Acquisition Service - Movement of data into Grids. Data Outbound Service & Interfaces - Movement of data out of GridsData Replication - Replicate data across Grids. * Data Lifecycle ManagementData Retention – Cleaning of expired dataA29 [Data formats] - Creates anonymized version of raw datasets, Set of rules governing the anonymization of personally-identifiable information in our dataCompliance Archival - archival, and restoration of data on GridsPolicy Management (ACL) - Independent retention and replication policies for both the raw and anonymized versions of datasets.CompactionRestatement
  • #5: Productivity GainsLarge datasets are incentives for users to come to HadoopSLA’s on data availability, Data Quality guaranteesUsers focus on utilizing the data and not how to get itData Loading optimized for space, time and bandwidthVolume of data movementResource consumption Impedance mismatch between Producers and HDFSRegulatory compliance - We cannot rely on users to adhere to data governance policies. SEC SOX A29 of PII
  • #6: Producers/Consumers * Multiple Data Sources * Various Interfaces & API contracts (ex. http, scp, ftp or some proprietary mechanism) * performance may not be uniform * Each source has different capacity to serve * Maintenance windows * Mutable tablesHadoop * Clusters can be running on various versions which are incompatible * Consumers of data and down streamers are SLA sensitiveDM * Provide a secure, robust and efficient way to manage data on Hadoop* Bridge Producers and Consumers of data* Automatic Scaling * Data Quality * Manageability Data Sources (Producers/Consumers)Diverse interfaces, API contractsVarying Authentication methodsData Sources have diverse serving capacity (QoS)Data Source Isolation (MWs)Data quality, statsSteady growth in data volumesWide array of data producersSherpa hosting 2 GB each in 1000 hostsWCC hosting log files on 2000 nodesSLAIncompatible versions (Hadoop only)DataBatch (Staged), Streaming (Event-level data)Structured, Unstructured (Video stream – force fit into a blob?)Mutable, Immutable (Dimension data)Full dump vs deltasEncoding impedance mismatch (export)Vary in format and size, resulting in long tailsAccess frequency drops with ageData MovementPeriodic, AdhocSLA, Latency requirements: High, LowMechanism: Push vs PullBCP RequirementsReplication (copy within and across colos)Quota policy, one feed can choke the pipeData Lifecycle Cannot rely on users to adhere to Data governance policiesRegulatory complianceRetention PolicyRestorationRestatements, ReprocessingCompactionRAID, Reduce replication factorOperabilitySingle integrated view of data loading jobsSLA management + introspection Partial copy & auto-resumeAutomatic resume upon restartQuota policy enforcement
  • #7: Data volumes are growing - Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service lightDifferent sources may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) - Acquisition service is generic and have ability to plugin interfaces easily to support newer data sources - We had 3 choices to host the adaptors, as part of MR jobs, Client hosts it, or a bank of proxies hosting it.Each Data source has different capacity to serveResource limits are to be honored while pulling data (both for acquisition and replication) - Queue per Data Source (This allows datasets from other sources to make progress, even if a source is slow) - Every load explicitly makes resource requirements that has to be satisfied, before the load can proceed. In the event resources are unavailable, its blocked, till the resources become available - One such resource that is of significance is “Bandwidth”. Bandwidth is defined between a source and a target colo. The number of concurrent tasks used to pull is decided by the bandwidth requested for the feed. If the requested bandwidth is not available, the workflow waits, till other workflows where were holding the resource relinquish them. This will ensure that the sources are never overwhelmed. - Bandwidth resource based throttling is enforced for both acquisition & replicationData Source performance may not be uniformThere is mix of slow & well performing JTClusters’ maintenance windows has to be factored inSLA, Feed priority & frequency are different - All hadoop jobs are scheduled asynchronously. Their progress are being monitored independently. Issues with one cluster (for ex. Slow JT or JT OOM issues) doesn’t affect loadingrelating to other clustersSource data can be in multiple formats - Data conversion as map-reduce job (to enable faster data processing post acquisition)Typically, file sizes vary significantly at source - Coalesce & Chunking * an even spread for processing in MR jobs* Better locality for jobs* Gzip is not splittable, but every map now gets equal 128 MB gzip files to process* No long tails in processing
  • #8: Copying dataset instances from cluster to cluster, within and across colosNon-trivial workflow-generation, accommodating Cluster availability (maintenance windows) Cluster proximity (Colos)Chaining replications immediately after Acquisition Web-HooksSingle replication instance talks to grids running multiple Hadoop versions. Maintenance window handling Partial and failed workflowsDistCp:The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat, and is new to DistCp. The listing-file is split into several "chunk-files", the exact number of chunk-files being a multiple of the number of maps requested for in the Hadoop Job. Each map task is "assigned" one of the chunk-files (by renaming the chunk to the task's id), before the Job is launched. Paths are read from each chunk using the DynamicRecordReader, and processed in the CopyMapper. After all the paths in a chunk are processed, the current chunk is deleted and a new chunk is acquired. The process continues until no more chunks are available. This "dynamic" approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall. HDFS Proxy: * HDFS read/write access from other platforms (via HTTP REST). decouple cluster upgrades from client upgrades.
  • #9: Lifecycle management deals with the lifecycle of data once its acquired until its purged.We cannot rely on users to adhere to data governance policies.
  • #10: How do we make this solution operable and manageable?All workflows running when system is shutdown are automatically resumed upon restart
  • #11: 3 daemons for acquisition, replication & retention per colo State stored in a database with BCPOne instance to serve multiple clusters in a data centerAutomatic Scaling All processing happens as map-reduce jobsBank of proxies and map tasks ensure bandwidth adjustmentAn integrated Console/Dashboard for Management
  • #14: Eat our own dogfoodSelf-serviceTool to assemble a data movement workflowLibraries and Interface specification for common Data Movement use casesUser needs to take the bundle, customize and register with OozieRuns as the user submitting itLots of solutions for event-level data, streaming data, how do we build a solution so customers can push data into the ClusterAs data ages, it becomes storage bound, how do we save on the footprint of data on clusters
  • #15: This concludes what I wanted to share about Data Management. Thank you for listening. Lemme know if you have any questions.