HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to
support large-scale data
ingestion @ Uber
May 21, 2019

Danny Chen
dannyc@uber.com
● Engineering Manager on Hadoop
Data Platform team
● Leading Data Ingestion team
● Previous worked @ on
storage team (Manhattan)
● Enjoy playing basketball, biking,
and spending time w/my kids.

Uber Apache Hadoop
Platform Team Mission
Build products to support reliable,
scalable, easy-to-use, compliant, and
efficient data transfer (both ingestion
& dispersal) as well as data storage
leveraging the Apache Hadoop
ecosystem.
Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.

Overview
● High-Level Ingestion & Dispersal
introduction
● Different types of workloads
● Need for Global Index
● How Global Index Works
● Generating Global Indexes with
HFiles
● Throttling HBase Access
● Next Steps

High Level
Ingestion/Dispersal
Introduction

Hadoop Data Ecosystem at Uber
Apache
Hadoop
Data
Lake
Schemaless
Analytical
Processing
Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Data
Ingestion
Data
Dispersal

Bootstrap
● One time only at beginning of lifecycle
● Large amounts of data
● Millions of QPS throughput
● Need to finish in a matter of hours
● NoSQL stores cannot keep up

Incremental
● Dominates lifecycle of Hive table ingestion
● Incremental upstream changes from Kafka
or other data sources.
● 1000’s QPS per dataset
● Reasonable throughput requirements for
NoSQL stores

Requirements for Global Index
● Large amounts of historical data ingested in short
amount of time
● Append only vs Append-plus-update
● Data layout and partitioning
● Bookkeeping for data layout
● Strong consistency
● High Throughput
● Horizontally scalable
● Required a NoSQL store

● Decision was to use HBase
● Trade Availability for Consistency
● Automatic Rebalancing of HBase tables via region splitting
● Global view of dataset via master/slave architecture
VS

HBase Global Indexing to support large-scale data ingestion at Uber

Batch and One Time Index Upload

Spark & RDD Transformations for
index generation

HFile Index Job Tuning
● Explicitly register classes with Kryo Serialization
● Reduce 3 shuffle stages to one
● Proper HFile Size
● Proper Partition Counting Size
● 13 TB index data with 54 billion indexes
○ 2 hours to generate indexes
○ 10 min to load

The need for throttling HBase Access

Horizontal Scalability & Throttling

Next Steps
● Handle non-append-only
data during bootstrap
● Explore other indexing
solutions

Useful Links
https://github.com/uber/marmaray
https://github.com/uber/hudi
https://eng.uber.com/data-partitioning-global-indexing/
https://eng.uber.com/uber-big-data-platform/
https://eng.uber.com/marmaray-hadoop-ingestion-
open-source/

Other Dataworks Summit Talks
Marmaray: Uber’s Open-sourced Generic Hadoop Data
Ingestion and Dispersal Framework
Wednesday at 11 am

Attribution
Kaushik
Devarajaiah
Nishith
Agarwal
Jing
Li

Positions available: Seattle, Palo Alto & San
Francisco
email : hadoop-platform-jobs@uber.com
We are hiring!

Thank you
Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from
Uber. This document is intended only for the use of the individual or entity to
whom it is addressed. All recipients of this document are notified that the
information contained herein includes proprietary information of Uber, and
recipient may not make use of, disseminate, or in any way disclose this
document or any of the enclosed information to any person other than
employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Questions: email ospo@uber.com
Follow our Facebook page: www.facebook.com/uberopensource

HBase Global Indexing to support large-scale data ingestion at Uber

More Related Content

What's hot (20)

Similar to HBase Global Indexing to support large-scale data ingestion at Uber (20)

More from DataWorks Summit (20)

Recently uploaded (20)

HBase Global Indexing to support large-scale data ingestion at Uber

Editor's Notes