Data Management on Hadoop at Yahoo!

Download as PPTX, PDF

3 likes978 views

Data management at Yahoo faces challenges from growing and diverse data sources. Hadoop provides a solution with its scalability and flexibility. Specifically: 1) Hadoop uses map-reduce jobs to handle steady data volume growth and load data from diverse sources and formats in an asynchronous and throttled manner. 2) Data replication across clusters is done by map-only jobs that replicate in a tree structure with at most one cross-datacenter copy to ensure proximity and availability. 3) A workflow abstraction manages recurring batch data transfers for over 2300 feeds totaling 100+ TB per day while using less than 10% of cluster resources.

Technology

Data Management on Hadoop at Yahoo!

1. Data Management on Hadoop @ Y! Seetharam Venkatesh (venkatesh@innerzeal.com) Hadoop Data Infrastructure Lead/Architect

2. Agenda 2 Challenging Data Landscape 1 Introduction 3 The Solution 4 Future opportunities

3. Introduction

4. Why is Data Management Critical

5. Challenging Data Landscape Data Warehouse Database NAS Data Warehouse Database NAS Tape

6. Data Acquisition Challenge Solution Steady growth in data volumes Heavy lifting delegated to map- only jobs Diverse Interfaces, API contracts Pluggable interfaces, Adaptors for specific API Data Sources have diverse serving capacity Throttling enables uniform load on the Data Source Data Source Isolation Asynchronous Scheduling, progress monitored per source Varying Data Formats, file sizes and long tails, failures Conversion as map-reduce job Coalesce, Chunking, Checkpoint Data Quality BCP Pluggable validations Supports Hot-Hot, Hot-Warm

7. Data Replication Challenge Solution Steady growth in data volumes Heavy lifting delegated to map-only jobs (DistCp v2) Cluster proximity, availability Tree of copies with at most one cross-datacenter copy Long tails Dynamic split assignment, each map picks up only one file at a time (DistCp v2) Data Export Export as Replication target -Push Adhoc uses HDFS Proxy – Pull Datacenter Datacenter

8. Data Lifecycle Management Challenge Solution Aging Data expires Retention to remove old data (as required for legal compliance and for capacity purposes) Data Privacy Anonymization of Personally Identifiable information SOX Compliance & Audit Archival/Restoration to/from Tape (13 months) SEC Compliance & Audit Archival/Restoration to/from Tape (7 years)

9. Operability, Manageability Challenge Solution Monitor and administer data loading across clusters, colos Central dashboard for monitoring and administration Integrated view of jobs running across clusters, colos Interoperability across incompatible Hadoop versions Support various Hadoop versions using Reverse Class loader One data loading instance per colo that can work across clusters Maintenance Windows, failures System shutdown Partial copy + auto resume Automatic resume upon restart SLA management + introspection via metrics

10. Architecture

11. Highlights • “Workflows” abstraction over MR Jobs • More workflows than Oozie with in Y! • Amounts to >30% of jobs launched on the clusters • Occupies less than 10% of cluster capacity (slots) • Solves recurring batch data transfers • 2300+ feeds with varying periodicity (5m to Monthly) • 100+ TB/day of data movement • SLAs • Central Dashboard • SLA monitoring with ETA on feeds

12. Highlights

13. Future

Editor's Notes

#2: Hello, This is Venkatesh working as part of the greater hadoop team at Yahoo in Bangalore.I’m going to talk about how we manage data into and out of our hadoop clusters at Yahoo.
#3: I’ll cover what we mean by Data Management and why is it critical for us at Yahoo.I’ll look at the challenges in the space and then the solutions for the major components.Will briefly go over what we are thinking ahead.
#4: Data Management is a suite of services that manage the lifecycle of data on the Hadoop clusters. focus is on bulk data, usually recurring in batches, agnostic to dataaims to provide these services with strong SLA and quality guaranteesData Management can broadly be decomposed into 2 major areas, Data Movement & Lifecycle Management. * Data Movement is further decomposed into the following services and components. Data Acquisition Service - Movement of data into Grids. Data Outbound Service & Interfaces - Movement of data out of GridsData Replication - Replicate data across Grids. * Data Lifecycle ManagementData Retention – Cleaning of expired dataA29 [Data formats] - Creates anonymized version of raw datasets, Set of rules governing the anonymization of personally-identifiable information in our dataCompliance Archival - archival, and restoration of data on GridsPolicy Management (ACL) - Independent retention and replication policies for both the raw and anonymized versions of datasets.CompactionRestatement
#5: Productivity GainsLarge datasets are incentives for users to come to HadoopSLA’s on data availability, Data Quality guaranteesUsers focus on utilizing the data and not how to get itData Loading optimized for space, time and bandwidthVolume of data movementResource consumption Impedance mismatch between Producers and HDFSRegulatory compliance - We cannot rely on users to adhere to data governance policies. SEC SOX A29 of PII
#6: Producers/Consumers * Multiple Data Sources * Various Interfaces & API contracts (ex. http, scp, ftp or some proprietary mechanism) * performance may not be uniform * Each source has different capacity to serve * Maintenance windows * Mutable tablesHadoop * Clusters can be running on various versions which are incompatible * Consumers of data and down streamers are SLA sensitiveDM * Provide a secure, robust and efficient way to manage data on Hadoop* Bridge Producers and Consumers of data* Automatic Scaling * Data Quality * Manageability Data Sources (Producers/Consumers)Diverse interfaces, API contractsVarying Authentication methodsData Sources have diverse serving capacity (QoS)Data Source Isolation (MWs)Data quality, statsSteady growth in data volumesWide array of data producersSherpa hosting 2 GB each in 1000 hostsWCC hosting log files on 2000 nodesSLAIncompatible versions (Hadoop only)DataBatch (Staged), Streaming (Event-level data)Structured, Unstructured (Video stream – force fit into a blob?)Mutable, Immutable (Dimension data)Full dump vs deltasEncoding impedance mismatch (export)Vary in format and size, resulting in long tailsAccess frequency drops with ageData MovementPeriodic, AdhocSLA, Latency requirements: High, LowMechanism: Push vs PullBCP RequirementsReplication (copy within and across colos)Quota policy, one feed can choke the pipeData Lifecycle Cannot rely on users to adhere to Data governance policiesRegulatory complianceRetention PolicyRestorationRestatements, ReprocessingCompactionRAID, Reduce replication factorOperabilitySingle integrated view of data loading jobsSLA management + introspection Partial copy & auto-resumeAutomatic resume upon restartQuota policy enforcement
#7: Data volumes are growing - Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service lightDifferent sources may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) - Acquisition service is generic and have ability to plugin interfaces easily to support newer data sources - We had 3 choices to host the adaptors, as part of MR jobs, Client hosts it, or a bank of proxies hosting it.Each Data source has different capacity to serveResource limits are to be honored while pulling data (both for acquisition and replication) - Queue per Data Source (This allows datasets from other sources to make progress, even if a source is slow) - Every load explicitly makes resource requirements that has to be satisfied, before the load can proceed. In the event resources are unavailable, its blocked, till the resources become available - One such resource that is of significance is “Bandwidth”. Bandwidth is defined between a source and a target colo. The number of concurrent tasks used to pull is decided by the bandwidth requested for the feed. If the requested bandwidth is not available, the workflow waits, till other workflows where were holding the resource relinquish them. This will ensure that the sources are never overwhelmed. - Bandwidth resource based throttling is enforced for both acquisition & replicationData Source performance may not be uniformThere is mix of slow & well performing JTClusters’ maintenance windows has to be factored inSLA, Feed priority & frequency are different - All hadoop jobs are scheduled asynchronously. Their progress are being monitored independently. Issues with one cluster (for ex. Slow JT or JT OOM issues) doesn’t affect loadingrelating to other clustersSource data can be in multiple formats - Data conversion as map-reduce job (to enable faster data processing post acquisition)Typically, file sizes vary significantly at source - Coalesce & Chunking * an even spread for processing in MR jobs* Better locality for jobs* Gzip is not splittable, but every map now gets equal 128 MB gzip files to process* No long tails in processing
#8: Copying dataset instances from cluster to cluster, within and across colosNon-trivial workflow-generation, accommodating Cluster availability (maintenance windows) Cluster proximity (Colos)Chaining replications immediately after Acquisition Web-HooksSingle replication instance talks to grids running multiple Hadoop versions. Maintenance window handling Partial and failed workflowsDistCp:The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat, and is new to DistCp. The listing-file is split into several "chunk-files", the exact number of chunk-files being a multiple of the number of maps requested for in the Hadoop Job. Each map task is "assigned" one of the chunk-files (by renaming the chunk to the task's id), before the Job is launched. Paths are read from each chunk using the DynamicRecordReader, and processed in the CopyMapper. After all the paths in a chunk are processed, the current chunk is deleted and a new chunk is acquired. The process continues until no more chunks are available. This "dynamic" approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall. HDFS Proxy: * HDFS read/write access from other platforms (via HTTP REST). decouple cluster upgrades from client upgrades.
#9: Lifecycle management deals with the lifecycle of data once its acquired until its purged.We cannot rely on users to adhere to data governance policies.
#10: How do we make this solution operable and manageable?All workflows running when system is shutdown are automatically resumed upon restart
#11: 3 daemons for acquisition, replication & retention per colo State stored in a database with BCPOne instance to serve multiple clusters in a data centerAutomatic Scaling All processing happens as map-reduce jobsBank of proxies and map tasks ensure bandwidth adjustmentAn integrated Console/Dashboard for Management
#14: Eat our own dogfoodSelf-serviceTool to assemble a data movement workflowLibraries and Interface specification for common Data Movement use casesUser needs to take the bundle, customize and register with OozieRuns as the user submitting itLots of solutions for event-level data, streaming data, how do we build a solution so customers can push data into the ClusterAs data ages, it becomes storage bound, how do we save on the footprint of data on clusters
#15: This concludes what I wanted to share about Data Management. Thank you for listening. Lemme know if you have any questions.

Data Management on Hadoop at Yahoo!

More Related Content

What's hot (20)

Similar to Data Management on Hadoop at Yahoo! (20)

Recently uploaded (20)

Data Management on Hadoop at Yahoo!

Editor's Notes