Hive and Hbase inegration

Contents
HBase
Hive
Hive+HBase Motivation
Integration
StorageHandler
Schema/Type Mapping
Data Flows
Use Cases
I.
II.
III.
IV.
V.
VI.
VII
VIII

HBase
 Apache HBase in a few words:
“HBase is an open-source, distributed, column-oriented,
versioned NoSQL database modeled after Google's Bigtable”
 Used for:
– Powering websites/products, such as StumbleUpon and
Facebook’s Messages
– Storing data that’s used as a sink or a source to analytical
jobs (usually MapReduce)
 Main features:
– Horizontal scalability
– Machine failure tolerance
– Row-level atomic operations including compare-and-swap-
ops like incrementing counters
– Augmented key-value schemas, the user can group columns
into families which are configured independently
– Multiple clients like its native Java library, Thrift, and REST

Apache HBase Architecture

Hive
 Apache Hive in a few words:
“A data warehouse infrastructure built on top of Apache Hadoop”
 Used for:
– Ad-hoc querying and analyzing large data sets without having
to learn MapReduce
 Main features:
– SQL-like query language called HiveQL
– Built-in user defined functions (UDFs) to manipulate dates,
strings, and other data-mining tools
– Plug-in capabilities for custom mappers, reducers, and UDFs
– Support for different storage types such as plain text, RCFiles, HBase,
and others
– Multiple clients like a shell, JDBC, Thrift

Apache Hive Architecture

Hive+HBase Motivation
 Hive and HBase has different characteristics
High latency Low latency
Structured vs. Unstructured
Analysts Programmers
 Hive data warehouses on Hadoop are high latency
- Long ETL times
- Accesss to real time data
 Analyzing HBase data with MapReduce requires custom coding
 Hive and SQL are already known by many analysts

Integration
 Reasons to use Hive on HBase:
– A lot of data sitting in HBase due to its usage in a real-time
environment, but never used for analysis
– Give access to data in HBase usually only queried through
MapReduce to people that don’t code (business analysts)
– When needing a more flexible storage solution, so that rows
can be updated live by either a Hive job or an application and can
be seen immediately to the other
 Reasons not to do it:
– Run SQL queries on HBase to answer live user requests (it’s
still a MR job)
– Hoping to see interoperability with other SQL analytics systems

Integration
 How it works:
– Hive can use tables that already exist in HBase or manage its own
ones, but they still all reside in the same HBase instance
Hive table definitions
Points to some column
Points to other columns,
different names
HBase

Integration
 How it works:
– Columns are mapped however you want, changing names and giving types
Hive table definitions Hbase table
name STRING
age INT
siblings MAP<string,
string>
d:fullname
d:age
d:address
f:

Integration
 Drawbacks (that can be fixed with brain juice):
– Binary keys and values (like integers represented on 4 bytes) aren’t supported
since Hive prefers string representations, HIVE-1634
– Compound row keys aren’t supported, there’s no way of using multiple parts
of a key as different “fields”
– This means that concatenated binary row keys are completely unusable,
which is what people often use for HBase
– Filters are done at Hive level instead of being pushed to the region servers
– Partitions aren’t supported

Apache Hive+HBase Architecture

Example: Hive+HBase (HBase table)
hbase(main):001:0> create 'short_urls', {NAME =>'u'}, {NAME=>'s'}
hbase(main):014:0> scan 'short_urls‘
ROW COLUMN+CELL
bit.ly/aaaa column=s:hits, value=100
bit.ly/aaaa column=u:url,
value=hbase.apache.org/
bit.ly/abcd column=s:hits, value=123
bit.ly/abcd column=u:url,
value=example.com/foo

Example: Hive+HBase (Hive table)
CREATE TABLE short_urls(
short_url string,
url string,
hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")
TBLPROPERTIES
("hbase.table.name" = ”short_urls");

Storage Handler
 Hive defines HiveStorageHandler class for different storage
backends: HBase/ Cassandra / MongoDB/ etc
 Storage Handler has hooks for
– Getting input / output formats
– Meta data operations hook: CREATE TABLE, DROP TABLE, etc
 Storage Handler is a table level concept
– Does not support Hive partitions, and buckets

Schema Mapping
 Hive table + columns + column types <=> HBase table + column
families (+ column qualifiers)
 Every field in Hive table is mapped in order to either:
– The table key (using :key as selector)
– A column family (cf:) -> MAP fields in Hive
– A column (cf:cq)
 Hive table does not need to include all columns in HBase
short_url string,
url string,
hit_count int,
props, map<string,string>
)
("hbase.columns.mapping" = ": key, u:url, s:hits, p:")

Type Mapping
 Recently added to Hive (0.9.0)
 Previously all types were being converted to strings in HBase
 Hive has:
– Primitive types: INT, STRING, BINARY, DATE, etc
– ARRAY<Type>
– MAP<PrimitiveType, Type>
– STRUCT<a:INT, b:STRING, c:STRING>
 HBase does not have types
– Bytes.toBytes()

Type Mapping
 Table level property
"hbase.table.default.storage.type” = “binary”
 Type mapping can be given per column after #
– Any prefix of “binary” , eg u:url#b
– Any prefix of “string” , eg u:url#s
– The dash char “-” , eg u:url#-
short_url string,
url string,
hit_count int,
props, map<string,string>
)
("hbase.columns.mapping" = ":key#b, u:url#b, s:hits#b, p:#s")

Type Mapping
 If the type is not a primitive or Map, it is converted to a JSON
string and serialized
 Still a few rough edges for schema and type mapping:
– No Hive BINARY support in HBase mapping
– No mapping of HBase timestamp (can only provide put
timestamp)
– No arbitrary mapping of Structs / Arrays into HBase schema

Data Flows
 Data is being generated all over the place:
– Apache logs
– Application logs
– MySQL clusters
– HBase clusters

Data Flows
 Moving application log files

Data Flows
 Moving MySQL data

Data Flows
 Moving HBase data

Use Cases
 Front-end engineers
– They need some statistics regarding their latest product
 Research engineers
– Ad-hoc queries on user data to validate some assumptions
– Generating statistics about recommendation quality
 Business analysts
– Statistics on growth and activity
– Effectiveness of advertiser campaigns
– Users’ behavior VS past activities to determine, for example, why
certain groups react better to email communications
– Ad-hoc queries on stumbling behaviors of slices of the user base

Use Cases
 Using a simple table in HBase
CREATE EXTERNAL TABLE blocked_users(
userid INT,
blockee INT,
blocker INT,
created BIGINT)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
("hbase.columns.mapping" =":key,f:blockee,f:blocker,f:created")
TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users");
HBase is a special case here, it has a unique row key map with :key
Not all the columns in the table need to be mapped

Use Cases
 Using a complicated table in HBase
CREATE EXTERNAL TABLE ratings_hbase(
userid INT,
created BIGINT,
urlid INT,
rating INT,
topic INT,
modified BIGINT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
("hbase.columns.mapping" = ":key#b@0,:key#b@1,:key#b@2,
default:rating#b,default:topic#b,default:modified#b")
TBLPROPERTIES("hbase.table.name" = "ratings_by_userid");
#b means binary, @ means position in composite key (SU-specific hack)

Wrapping up
 Hive is a good complement to HBase for ad-hoc querying capabilities
without having to write a new MR job each time.
(All you need to know is SQL)
 Even though it enables relational queries, it is not meant for live systems.
(Not a MySQL replacement)
 The Hive/HBase integration is functional but still lacks some features to c
all it ready.
(Unless you want to get your hands dirty)

Thank you

Hive and Hbase inegration

More Related Content

What's hot (11)

Similar to Hive and Hbase inegration (20)

Recently uploaded (20)

Hive and Hbase inegration

Editor's Notes