SlideShare a Scribd company logo
HBase Global Indexing to
support large-scale data
ingestion @ Uber
May 21, 2019
Danny Chen
dannyc@uber.com
● Engineering Manager on Hadoop
Data Platform team
● Leading Data Ingestion team
● Previous worked @ on
storage team (Manhattan)
● Enjoy playing basketball, biking,
and spending time w/my kids.
Uber Apache Hadoop
Platform Team Mission
Build products to support reliable,
scalable, easy-to-use, compliant, and
efficient data transfer (both ingestion
& dispersal) as well as data storage
leveraging the Apache Hadoop
ecosystem.
Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
Overview
● High-Level Ingestion & Dispersal
introduction
● Different types of workloads
● Need for Global Index
● How Global Index Works
● Generating Global Indexes with
HFiles
● Throttling HBase Access
● Next Steps
High Level
Ingestion/Dispersal
Introduction
Hadoop Data Ecosystem at Uber
Apache
Hadoop
Data
Lake
Schemaless
Analytical
Processing
Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Data
Ingestion
Data
Dispersal
Hadoop Data Ecosystem at Uber
Different
Types of
Workloads
Bootstrap
● One time only at beginning of lifecycle
● Large amounts of data
● Millions of QPS throughput
● Need to finish in a matter of hours
● NoSQL stores cannot keep up
Incremental
● Dominates lifecycle of Hive table ingestion
● Incremental upstream changes from Kafka
or other data sources.
● 1000’s QPS per dataset
● Reasonable throughput requirements for
NoSQL stores
Cell vs Row Changes
Need for Global
Index
Requirements for Global Index
● Large amounts of historical data ingested in short
amount of time
● Append only vs Append-plus-update
● Data layout and partitioning
● Bookkeeping for data layout
● Strong consistency
● High Throughput
● Horizontally scalable
● Required a NoSQL store
● Decision was to use HBase
● Trade Availability for Consistency
● Automatic Rebalancing of HBase tables via region splitting
● Global view of dataset via master/slave architecture
VS
How Global Index
Works
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Generating Global
Indexes
Batch and One Time Index Upload
Data Model For Global Index
Spark & RDD Transformations for
index generation
HFile Upload Process
HFile Index Job Tuning
● Explicitly register classes with Kryo Serialization
● Reduce 3 shuffle stages to one
● Proper HFile Size
● Proper Partition Counting Size
● 13 TB index data with 54 billion indexes
○ 2 hours to generate indexes
○ 10 min to load
Throttling HBase
Access
The need for throttling HBase Access
Horizontal Scalability & Throttling
Next Steps
Next Steps
● Handle non-append-only
data during bootstrap
● Explore other indexing
solutions
Useful Links
https://github.com/uber/marmaray
https://github.com/uber/hudi
https://eng.uber.com/data-partitioning-global-indexing/
https://eng.uber.com/uber-big-data-platform/
https://eng.uber.com/marmaray-hadoop-ingestion-
open-source/
Other Dataworks Summit Talks
Marmaray: Uber’s Open-sourced Generic Hadoop Data
Ingestion and Dispersal Framework
Wednesday at 11 am
Attribution
Kaushik
Devarajaiah
Nishith
Agarwal
Jing
Li
Positions available: Seattle, Palo Alto & San
Francisco
email : hadoop-platform-jobs@uber.com
We are hiring!
Thank you
Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from
Uber. This document is intended only for the use of the individual or entity to
whom it is addressed. All recipients of this document are notified that the
information contained herein includes proprietary information of Uber, and
recipient may not make use of, disseminate, or in any way disclose this
document or any of the enclosed information to any person other than
employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Questions: email ospo@uber.com
Follow our Facebook page: www.facebook.com/uberopensource

More Related Content

PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Curb your insecurity with HDP
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PPTX
Real time fraud detection at 1+M scale on hadoop stack
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Curb your insecurity with HDP
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Real time fraud detection at 1+M scale on hadoop stack

What's hot (20)

PPTX
Empower Data-Driven Organizations
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
IoFMT – Internet of Fleet Management Things
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Presto @ Uber Hadoop summit2017
PPTX
Time-oriented event search. A new level of scale
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
PPTX
HBaseConAsia2018 Track3-2: HBase at China Telecom
PDF
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PPTX
PPTX
Practice of large Hadoop cluster in China Mobile
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Hadoop and HBase @eBay
Empower Data-Driven Organizations
Scaling Deep Learning on Hadoop at LinkedIn
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
IoFMT – Internet of Fleet Management Things
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Presto @ Uber Hadoop summit2017
Time-oriented event search. A new level of scale
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Practice of large Hadoop cluster in China Mobile
Presto: Optimizing Performance of SQL-on-Anything Engine
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
What's new in SQL on Hadoop and Beyond
Hadoop and HBase @eBay
Ad

Similar to HBase Global Indexing to support large-scale data ingestion at Uber (20)

PDF
BIGDATA ppts
PPTX
Getting started big data
ODP
Hadoop introduction
PDF
Bn1028 demo hadoop administration and development
PPTX
Hadoop in a Nutshell
PPTX
Get started with hadoop hive hive ql languages
PDF
Open source stak of big data techs open suse asia
PDF
Key trends in Big Data and new reference architecture from Hewlett Packard En...
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPTX
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
PDF
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
PPTX
Hadoop jon
PDF
Techincal Talk Hbase-Ditributed,no-sql database
PPTX
Big data solutions in azure
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
PPT
Hadoop in action
DOC
Robin_Hadoop
PPSX
PPTX
201305 hadoop jpl-v3
BIGDATA ppts
Getting started big data
Hadoop introduction
Bn1028 demo hadoop administration and development
Hadoop in a Nutshell
Get started with hadoop hive hive ql languages
Open source stak of big data techs open suse asia
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Hadoop a Natural Choice for Data Intensive Log Processing
Eric Baldeschwieler Keynote from Storage Developers Conference
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Hadoop jon
Techincal Talk Hbase-Ditributed,no-sql database
Big data solutions in azure
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Hadoop in action
Robin_Hadoop
201305 hadoop jpl-v3
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
PPTX
Applying Noisy Knowledge Graphs to Real Problems
PDF
Open Source, Open Data: Driving Innovation in Smart Cities
PPTX
Data Protection in Hybrid Enterprise Data Lake Environment
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Applying Noisy Knowledge Graphs to Real Problems
Open Source, Open Data: Driving Innovation in Smart Cities
Data Protection in Hybrid Enterprise Data Lake Environment

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Advanced Soft Computing BINUS July 2025.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf

HBase Global Indexing to support large-scale data ingestion at Uber

Editor's Notes

  • #5: Lots of effort into making a completely self-serve onboarding process Analytical users will little technical knowledge of Spark, Hadoop, Hive etc will still be able to take advantage of our platform Our assertion is that when relevant data is discoverable in the appropriate data stores for either analytical purposes, there really can be a substantial gains in terms of efficiency and value for your business. Marmaray is critical for ensuring data is in the appropriate data store. Familiarity with suite of tools in our Hadoop Ecosystem for many potential use cases to extract insights out of raw data
  • #7: Completion of the Hadoop Ecosystem of tools at Uber and original vision of the Data Processing Platform Heatpipe/Watchtower produce quality schematized data Ingest the data via Marmaray Orchestrate jobs via Workflow Management System to run analytics and generate derived datasets, or build models using Michelangelo Disperse the data using Marmaray to stores with low latency semantics What sets it apart Generic ingestion framework Not tightly coupled to any source or sink Shouldn’t be coupled to a specific source or a specific sink (product teams focus on this)
  • #13: Dividing bootstrap and incremental allows us to choose a kv store where we that can scale for incremental phase indexing but not necessarily for bootstrapping of data.
  • #15: HBase automatically rebalances tables within a cluster by splitting up key ranges when a region gets too large. Can also load balance by having new regions moved to other servers The master-slave architecture enables getting a global view of the spread of a dataset across the cluster, which we utilize in customizing dataset specific throughputs to our HBase cluster.
  • #17: During incremental ingestion We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  • #18: We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  • #21: Our Big Data ecosystem’s model of indexes stored in HBase contains entities shown in green that help identify files that need to be updated corresponding to a given record in an append-plus-update dataset. The layout of index entries in HFiles lets us sort based on key value and column.
  • #22: This is for the one time upload case FlatMapToMair transformation in Apache Spark does not preserve the ordering of entries, so a partition isolated sort is performed. The partitioning is unchanged to ensure each partition still corresponds to a non-overlapping key range.
  • #23: HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process. - Hfile upload by be severely affected by splitting - Presplit HBase table into as many regions as there are HFiles so each Hfile can fit within a regio - We avoid splitting Hfile based on Hfile size and it severely impacts Hfile upload time (10 min even for tens of TB) - Done by presplitting hbase table so each hfile fits within a seaparate HBase region with non overlapping keys
  • #24: HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
  • #26: Three Apache Spark jobs corresponding to three different datasets access their respective HBase index table, creating loads on HBase regional servers hosting these tables.
  • #27: Adding more servers to the HBase cluster for a single dataset that is using global index linearly correlates with a QPS increase, although the dataset’s QPSFraction remains constant.
  • #29: Explore other indexing solutions to possibly merge bootstrap and incremental indexing solutions for easier maintenance.