Introduction to Apache HBase Training

Introduction to Apache HBase Training
Jesse Anderson
Curriculum Developer and Instructor

Agenda
• Why Cloudera Training?
• Target Audience and Prerequisites
• Course Outline
• Short Presentation Based on Actual Course Material
- Using Scans to Access Data
• Q&A

32,000trained professionals by 2015
Rising demand for Big Data
and analytics experts but a
DEFICIENCY OF TALENT
will result in a shortfall of
Source: Accenture “Analytics in Action,“ March 2013.

55%
of the Fortune 100
have attended live
Cloudera training
Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012.
100%
of the top 20 global
technology firms to
use Hadoop
Cloudera has trained
employees from
Big Data
professionals from
Cloudera Trains the Top Companies

Intro to Data
Science
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
HBase
Training
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Data Analyst
Training
Run full analyses natively on Big Data without BI software
Eliminate complexity to perform ad hoc queries in real time
Developer
Training
Learning Path: Developers

Data Analyst
Training
Implement massively distributed, columnar storage at scale
Enable random, real-time read/write access to all data
HBase
Training
Configure, install, and monitor clusters for optimal performance
Implement security measures and multi-user functionality
Vertically integrate basic analytics into data management
Transform and manipulate data to drive high-value utilization
Enterprise
Training
Use Cloudera Manager to speed deployment and scale the cluster
Learn which tools and techniques improve cluster performance
Administrator
Training
Learning Path: Administrators

1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
Over 15,000 students trained since 2009
5 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
7 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 5,000 accredited Cloudera professionals
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?

Cloudera is the best vendor evangelizing
the Big Data movement and is doing a
great service promoting Hadoop in the
industry. Developer training was a great
way to get started on my journey.

Cloudera Training for Apache HBase
About the Course

 This course was created for people in developer and operations roles,
including
–Developers
–DevOps
–Database Administrator
–Data Warehouse Engineer
–Administrators
 Also useful for others who want to access HBase
–Business Intelligence Developer
–ETL Developers
–Quality Assurance Engineers
Intended Audience

 Developers who want to learn details of MapReduce programming
–Recommend Cloudera Developer Training for Apache Hadoop
 System administrators who want to learn how to install/configure tools
–Recommend Cloudera Administrator Training for Apache Hadoop
Who Should Not Take this Course

 No prior knowledge of Hadoop is required
 What is required is an understanding of
–Basic end-user UNIX commands
 An optional understanding of
–Basic relational database concepts
–Basic knowledge of SQL
Course Prerequisites
SELECT id, first_name, last_name
FROM customers;
ORDER BY last_name;
$ mkdir /data
$ cd /data
$ rm /home/tomwheeler/salesreport.txt

During this course, you will learn:
 The core technologies of Apache HBase
 How HBase and HDFS work together
 How to work with the HBase shell, Java API, and Thrift API
 The HBase storage and cluster architecture
 The fundamentals of HBase administration
 Best practices for installing and configuring HBase
 Advanced features of the HBase API
 The importance of schema design in HBase
 How to work with HBase ecosystem projects
Course Objectives

 Hadoop Introduction
–Hands-On Exercise - Using HDFS
 Introduction to HBase
 HBase Concepts
–Hands-On Exercise - HBase Data Import
 The HBase Administration API
–Hands-On Exercise - Using the HBase Shell
 Accessing Data with the HBase API Part 1
–Hands-On Exercise - Data Access in the HBase Shell
–Hands-On Exercise - Using the Developer API
Course Outline

–Hands-On Exercise - Filters
 HBase Architecture Part 1
–Hands-On Exercise - Exploring HBase
 HBase Architecture Part 2
–Hands-On Exercise - Flushes and Compactions
 Installation and Configuration Part 1
 Installation and Configuration Part 2
–Hands-On Exercise - Administration
 Row Key Design in HBase
Course Outline (cont’d)

 Schema Design in HBase
–Hands-On Exercise - Detecting Hot Spots
 The HBase Ecosystem
–Hands-On Exercise - Hive and HBase
Course Outline (cont’d)

 A Scan can be used when:
–The exact row key is not known
–A group of rows needs to be accessed
 Scans can be bounded by a start and stop row key
–The start row key is included in the results
–The stop row is not included in the results and the Scan will exhaust its
data upon hitting the stop row key
 Scans can be limited to certain column families or column descriptors
Scans

 A scan without a start
and stop row will scan
the entire table
 With a start row of
"jordena" and an end
row of "turnerb"
–The scan will return
all rows starting at
"jordena" and not
include "turnerb"
Scanning
Row key Users Table
aaronsona fname: Aaron lname: Aaronson
harrise fname: Ernest lname: Harris
jordena fname: Adam lname: Jorden
laytonb fname: Bennie lname: Layton
millerb fname: Billie lname: Miller
nununezw fname: Willam lname: Nunez
rossw fname: William lname: Ross
sperberp fname: Phyllis lname: Sperber
turnerb fname: Brian lname: Turner
walkerm fname: Martin lname: Walker
zykowskiz fname: Zeph lname: Zykowski

 Retrieve a group of rows with scan
 General form:
 Examples:
Scanning Rows With scan in HBase Shell
hbase> scan 'tablename' [,options]
hbase> scan 'table1'
hbase> scan 'table1', {LIMIT => 10}
hbase> scan 'table1', {STARTROW => 'start',
STOPROW => 'stop'}
hbase> scan 'table1', {COLUMNS =>
['fam1:col1', 'fam2:col2']}

Scan Java API: Complete Code
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();

Scan Java API: Scan and ResultScanner
}
s.close();
The Scan object is created and will scan all rows. The scan
is executed on the table and a ResultScanner object is
returned.

Scan Java API: Iterating
}
s.close();Using a for loop, you iterate through all Result objects
in the ResultScanner. Each Result can be used to get
the values.

Python Scan Code: Complete Code
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
client.scannerClose(scannerId)

Python Scan Code: Open Scanner
while row:
Call scannerOpen to create a scan object on the Thrift
server. This returns a scanner id that uniquely identifies the
scanner on the server.

Python Scan Code: Get the List
while row:
The scannerGet method needs to be called with the
unique id. This returns a row of results.

Python Scan Code: Iterating Through
while row:
client.scannerClose(scannerId)The while loop continues as long as the scanner returns a
new row. Columns must be addressed with column family,
":", and the column descriptor. row gets populated by
another call to scannerGet and the loop is repeated.

Python Scan Code: Closing the Scanner
while row:
The scannerClose method call is very important. This
closes the Scan object on the Thrift server. Not calling this
method can leak Scan objects on the server.

 Scan results can be retrieved in batches to improve performance
–Performance will improve but memory usage will increase
 Java API:
 Python with Thrift:
Scanner Caching
s.setCaching(20);
rowsArray = client.scannerGetList(scannerId, 10)

Introduction to Apache HBase Training

Introduction to Apache HBase Training

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Introduction to Apache HBase Training (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Introduction to Apache HBase Training

Editor's Notes