Introduction to Apache HBase Training
Jesse Anderson
Curriculum Developer and Instructor
Agenda
• Why Cloudera Training?
• Target Audience and Prerequisites
• Course Outline
• Short Presentation Based on Actual Course Material
- Using Scans to Access Data
• Q&A
32,000trained professionals by 2015
Rising demand for Big Data
and analytics experts but a
DEFICIENCY OF TALENT
will result in a shortfall of
Source: Accenture “Analytics in Action,“ March 2013.
55%
of the Fortune 100
have attended live
Cloudera training
Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012.
100%
of the top 20 global
technology firms to
use Hadoop
Cloudera has trained
employees from
Big Data
professionals from
Cloudera Trains the Top Companies
Intro to Data
Science
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
HBase
Training
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Data Analyst
Training
Run full analyses natively on Big Data without BI software
Eliminate complexity to perform ad hoc queries in real time
Developer
Training
Learning Path: Developers
Data Analyst
Training
Implement massively distributed, columnar storage at scale
Enable random, real-time read/write access to all data
HBase
Training
Configure, install, and monitor clusters for optimal performance
Implement security measures and multi-user functionality
Vertically integrate basic analytics into data management
Transform and manipulate data to drive high-value utilization
Enterprise
Training
Use Cloudera Manager to speed deployment and scale the cluster
Learn which tools and techniques improve cluster performance
Administrator
Training
Learning Path: Administrators
1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
Over 15,000 students trained since 2009
5 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
7 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 5,000 accredited Cloudera professionals
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
Cloudera is the best vendor evangelizing
the Big Data movement and is doing a
great service promoting Hadoop in the
industry. Developer training was a great
way to get started on my journey.
Cloudera Training for Apache HBase
About the Course
 This course was created for people in developer and operations roles,
including
–Developers
–DevOps
–Database Administrator
–Data Warehouse Engineer
–Administrators
 Also useful for others who want to access HBase
–Business Intelligence Developer
–ETL Developers
–Quality Assurance Engineers
Intended Audience
 Developers who want to learn details of MapReduce programming
–Recommend Cloudera Developer Training for Apache Hadoop
 System administrators who want to learn how to install/configure tools
–Recommend Cloudera Administrator Training for Apache Hadoop
Who Should Not Take this Course
 No prior knowledge of Hadoop is required
 What is required is an understanding of
–Basic end-user UNIX commands
 An optional understanding of
–Basic relational database concepts
–Basic knowledge of SQL
Course Prerequisites
SELECT id, first_name, last_name
FROM customers;
ORDER BY last_name;
$ mkdir /data
$ cd /data
$ rm /home/tomwheeler/salesreport.txt
During this course, you will learn:
 The core technologies of Apache HBase
 How HBase and HDFS work together
 How to work with the HBase shell, Java API, and Thrift API
 The HBase storage and cluster architecture
 The fundamentals of HBase administration
 Best practices for installing and configuring HBase
 Advanced features of the HBase API
 The importance of schema design in HBase
 How to work with HBase ecosystem projects
Course Objectives
 Hadoop Introduction
–Hands-On Exercise - Using HDFS
 Introduction to HBase
 HBase Concepts
–Hands-On Exercise - HBase Data Import
 The HBase Administration API
–Hands-On Exercise - Using the HBase Shell
 Accessing Data with the HBase API Part 1
–Hands-On Exercise - Data Access in the HBase Shell
 Accessing Data with the HBase API Part 2
–Hands-On Exercise - Using the Developer API
Course Outline
 Accessing Data with the HBase API Part 3
–Hands-On Exercise - Filters
 HBase Architecture Part 1
–Hands-On Exercise - Exploring HBase
 HBase Architecture Part 2
–Hands-On Exercise - Flushes and Compactions
 Installation and Configuration Part 1
 Installation and Configuration Part 2
–Hands-On Exercise - Administration
 Row Key Design in HBase
Course Outline (cont’d)
 Schema Design in HBase
–Hands-On Exercise - Detecting Hot Spots
 The HBase Ecosystem
–Hands-On Exercise - Hive and HBase
Course Outline (cont’d)
 A Scan can be used when:
–The exact row key is not known
–A group of rows needs to be accessed
 Scans can be bounded by a start and stop row key
–The start row key is included in the results
–The stop row is not included in the results and the Scan will exhaust its
data upon hitting the stop row key
 Scans can be limited to certain column families or column descriptors
Scans
 A scan without a start
and stop row will scan
the entire table
 With a start row of
"jordena" and an end
row of "turnerb"
–The scan will return
all rows starting at
"jordena" and not
include "turnerb"
Scanning
Row key Users Table
aaronsona fname: Aaron lname: Aaronson
harrise fname: Ernest lname: Harris
jordena fname: Adam lname: Jorden
laytonb fname: Bennie lname: Layton
millerb fname: Billie lname: Miller
nununezw fname: Willam lname: Nunez
rossw fname: William lname: Ross
sperberp fname: Phyllis lname: Sperber
turnerb fname: Brian lname: Turner
walkerm fname: Martin lname: Walker
zykowskiz fname: Zeph lname: Zykowski
 Retrieve a group of rows with scan
 General form:
 Examples:
Scanning Rows With scan in HBase Shell
hbase> scan 'tablename' [,options]
hbase> scan 'table1'
hbase> scan 'table1', {LIMIT => 10}
hbase> scan 'table1', {STARTROW => 'start',
STOPROW => 'stop'}
hbase> scan 'table1', {COLUMNS =>
['fam1:col1', 'fam2:col2']}
Scan Java API: Complete Code
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();
Scan Java API: Scan and ResultScanner
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();
The Scan object is created and will scan all rows. The scan
is executed on the table and a ResultScanner object is
returned.
Scan Java API: Iterating
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();Using a for loop, you iterate through all Result objects
in the ResultScanner. Each Result can be used to get
the values.
Python Scan Code: Complete Code
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
Python Scan Code: Open Scanner
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
Call scannerOpen to create a scan object on the Thrift
server. This returns a scanner id that uniquely identifies the
scanner on the server.
Python Scan Code: Get the List
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
The scannerGet method needs to be called with the
unique id. This returns a row of results.
Python Scan Code: Iterating Through
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)The while loop continues as long as the scanner returns a
new row. Columns must be addressed with column family,
":", and the column descriptor. row gets populated by
another call to scannerGet and the loop is repeated.
Python Scan Code: Closing the Scanner
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
The scannerClose method call is very important. This
closes the Scan object on the Thrift server. Not calling this
method can leak Scan objects on the server.
 Scan results can be retrieved in batches to improve performance
–Performance will improve but memory usage will increase
 Java API:
 Python with Thrift:
Scanner Caching
Scan s = new Scan();
s.setCaching(20);
rowsArray = client.scannerGetList(scannerId, 10)
Introduction to Apache HBase Training

More Related Content

PPTX
Introduction to Data Analyst Training
PPTX
Hadoop And Their Ecosystem
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PPTX
Introduction to Cloudera's Administrator Training for Apache Hadoop
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PPTX
Introduction to Hadoop Developer Training Webinar
PDF
Apache Drill - Why, What, How
PPTX
Using Apache Drill
Introduction to Data Analyst Training
Hadoop And Their Ecosystem
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Introduction to Cloudera's Administrator Training for Apache Hadoop
Spark SQL versus Apache Drill: Different Tools with Different Rules
Introduction to Hadoop Developer Training Webinar
Apache Drill - Why, What, How
Using Apache Drill

What's hot (20)

PDF
Hadoop ecosystem
PPTX
Real time hadoop + mapreduce intro
PPTX
PDF
Apache Spark & Hadoop
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PPTX
The Hadoop Ecosystem
PPTX
Introduction to Hadoop
PDF
Hadoop Overview & Architecture
 
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPT
Hadoop MapReduce Fundamentals
PDF
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
PPTX
PPT
Deployment and Management of Hadoop Clusters
PPT
Hadoop Tutorial
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Apache Drill
PDF
Applications on Hadoop
PDF
An introduction to apache drill presentation
PPTX
Hadoop overview
Hadoop ecosystem
Real time hadoop + mapreduce intro
Apache Spark & Hadoop
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
The Hadoop Ecosystem
Introduction to Hadoop
Hadoop Overview & Architecture
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Hadoop MapReduce Fundamentals
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Deployment and Management of Hadoop Clusters
Hadoop Tutorial
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
HADOOP TECHNOLOGY ppt
Apache Drill
Applications on Hadoop
An introduction to apache drill presentation
Hadoop overview
Ad

Viewers also liked (10)

PPTX
Deploying Enterprise-grade Security for Hadoop
PPTX
Hive and HiveQL - Module6
PPTX
03 hive query language (hql)
PPT
Hadoop Operations: How to Secure and Control Cluster Access
PPTX
HBaseCon 2015: Analyzing HBase Data with Apache Hive
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PPTX
Cloudera Impala: A Modern SQL Engine for Hadoop
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
PDF
Impala: Real-time Queries in Hadoop
PPTX
Introduction to YARN and MapReduce 2
Deploying Enterprise-grade Security for Hadoop
Hive and HiveQL - Module6
03 hive query language (hql)
Hadoop Operations: How to Secure and Control Cluster Access
HBaseCon 2015: Analyzing HBase Data with Apache Hive
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Impala: Real-time Queries in Hadoop
Introduction to YARN and MapReduce 2
Ad

Similar to Introduction to Apache HBase Training (20)

PDF
Apache hadoop-administrator-training
DOCX
Hadoop online training in india
PPTX
Introduction to Designing and Building Big Data Applications
PDF
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
DOCX
Hadoop course content
PPTX
Conference 2014: Rajat Arya - Deployment with GraphLab Create
ODT
Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...
PDF
Hadoop and Mapreduce Certification
PDF
PDF
ITB2016 - Building ColdFusion RESTFul Services
PDF
Big data analytics_using_hadoop
PDF
Hadoop_Admin_eVenkat
PDF
Hadoop online training
PPTX
Apache HAWQ Architecture
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PDF
Beginning hive and_apache_pig
PPTX
Learn How to Run Python on Redshift
PDF
Best hadoop-online-training
PDF
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
PPT
Orchestrating the Intelligent Web with Apache Mahout
Apache hadoop-administrator-training
Hadoop online training in india
Introduction to Designing and Building Big Data Applications
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Hadoop course content
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...
Hadoop and Mapreduce Certification
ITB2016 - Building ColdFusion RESTFul Services
Big data analytics_using_hadoop
Hadoop_Admin_eVenkat
Hadoop online training
Apache HAWQ Architecture
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Beginning hive and_apache_pig
Learn How to Run Python on Redshift
Best hadoop-online-training
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Orchestrating the Intelligent Web with Apache Mahout

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PPTX
The various Industrial Revolutions .pptx
PPTX
Microsoft Excel 365/2024 Beginner's training
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Flame analysis and combustion estimation using large language and vision assi...
Final SEM Unit 1 for mit wpu at pune .pptx
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
The various Industrial Revolutions .pptx
Microsoft Excel 365/2024 Beginner's training
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
A comparative study of natural language inference in Swahili using monolingua...
Convolutional neural network based encoder-decoder for efficient real-time ob...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Benefits of Physical activity for teenagers.pptx
A review of recent deep learning applications in wood surface defect identifi...
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Enhancing emotion recognition model for a student engagement use case through...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Module 1.ppt Iot fundamentals and Architecture
OpenACC and Open Hackathons Monthly Highlights July 2025
Getting started with AI Agents and Multi-Agent Systems
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
A contest of sentiment analysis: k-nearest neighbor versus neural network
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Flame analysis and combustion estimation using large language and vision assi...

Introduction to Apache HBase Training

  • 1. Introduction to Apache HBase Training Jesse Anderson Curriculum Developer and Instructor
  • 2. Agenda • Why Cloudera Training? • Target Audience and Prerequisites • Course Outline • Short Presentation Based on Actual Course Material - Using Scans to Access Data • Q&A
  • 3. 32,000trained professionals by 2015 Rising demand for Big Data and analytics experts but a DEFICIENCY OF TALENT will result in a shortfall of Source: Accenture “Analytics in Action,“ March 2013.
  • 4. 55% of the Fortune 100 have attended live Cloudera training Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012. 100% of the top 20 global technology firms to use Hadoop Cloudera has trained employees from Big Data professionals from Cloudera Trains the Top Companies
  • 5. Intro to Data Science Design schemas to minimize latency on massive data sets Scale hundreds of thousands of operations per second HBase Training Learn to code and write MapReduce programs for production Master advanced API topics required for real-world data analysis Implement recommenders and data experiments Draw actionable insights from analysis of disparate data Data Analyst Training Run full analyses natively on Big Data without BI software Eliminate complexity to perform ad hoc queries in real time Developer Training Learning Path: Developers
  • 6. Data Analyst Training Implement massively distributed, columnar storage at scale Enable random, real-time read/write access to all data HBase Training Configure, install, and monitor clusters for optimal performance Implement security measures and multi-user functionality Vertically integrate basic analytics into data management Transform and manipulate data to drive high-value utilization Enterprise Training Use Cloudera Manager to speed deployment and scale the cluster Learn which tools and techniques improve cluster performance Administrator Training Learning Path: Administrators
  • 7. 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Most Experienced Instructors Over 15,000 students trained since 2009 5 Widest Geographic Coverage Most classes offered: 50 cities worldwide plus online 6 Most Relevant Platform & Community CDH deployed more than all other distributions combined 7 Depth of Training Material Hands-on labs and VMs support live instruction Leader in Certification Over 5,000 accredited Cloudera professionals 4 State of the Art Curriculum Classes updated regularly as Hadoop evolves 8 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training?
  • 8. Cloudera is the best vendor evangelizing the Big Data movement and is doing a great service promoting Hadoop in the industry. Developer training was a great way to get started on my journey.
  • 9. Cloudera Training for Apache HBase About the Course
  • 10.  This course was created for people in developer and operations roles, including –Developers –DevOps –Database Administrator –Data Warehouse Engineer –Administrators  Also useful for others who want to access HBase –Business Intelligence Developer –ETL Developers –Quality Assurance Engineers Intended Audience
  • 11.  Developers who want to learn details of MapReduce programming –Recommend Cloudera Developer Training for Apache Hadoop  System administrators who want to learn how to install/configure tools –Recommend Cloudera Administrator Training for Apache Hadoop Who Should Not Take this Course
  • 12.  No prior knowledge of Hadoop is required  What is required is an understanding of –Basic end-user UNIX commands  An optional understanding of –Basic relational database concepts –Basic knowledge of SQL Course Prerequisites SELECT id, first_name, last_name FROM customers; ORDER BY last_name; $ mkdir /data $ cd /data $ rm /home/tomwheeler/salesreport.txt
  • 13. During this course, you will learn:  The core technologies of Apache HBase  How HBase and HDFS work together  How to work with the HBase shell, Java API, and Thrift API  The HBase storage and cluster architecture  The fundamentals of HBase administration  Best practices for installing and configuring HBase  Advanced features of the HBase API  The importance of schema design in HBase  How to work with HBase ecosystem projects Course Objectives
  • 14.  Hadoop Introduction –Hands-On Exercise - Using HDFS  Introduction to HBase  HBase Concepts –Hands-On Exercise - HBase Data Import  The HBase Administration API –Hands-On Exercise - Using the HBase Shell  Accessing Data with the HBase API Part 1 –Hands-On Exercise - Data Access in the HBase Shell  Accessing Data with the HBase API Part 2 –Hands-On Exercise - Using the Developer API Course Outline
  • 15.  Accessing Data with the HBase API Part 3 –Hands-On Exercise - Filters  HBase Architecture Part 1 –Hands-On Exercise - Exploring HBase  HBase Architecture Part 2 –Hands-On Exercise - Flushes and Compactions  Installation and Configuration Part 1  Installation and Configuration Part 2 –Hands-On Exercise - Administration  Row Key Design in HBase Course Outline (cont’d)
  • 16.  Schema Design in HBase –Hands-On Exercise - Detecting Hot Spots  The HBase Ecosystem –Hands-On Exercise - Hive and HBase Course Outline (cont’d)
  • 17.  A Scan can be used when: –The exact row key is not known –A group of rows needs to be accessed  Scans can be bounded by a start and stop row key –The start row key is included in the results –The stop row is not included in the results and the Scan will exhaust its data upon hitting the stop row key  Scans can be limited to certain column families or column descriptors Scans
  • 18.  A scan without a start and stop row will scan the entire table  With a start row of "jordena" and an end row of "turnerb" –The scan will return all rows starting at "jordena" and not include "turnerb" Scanning Row key Users Table aaronsona fname: Aaron lname: Aaronson harrise fname: Ernest lname: Harris jordena fname: Adam lname: Jorden laytonb fname: Bennie lname: Layton millerb fname: Billie lname: Miller nununezw fname: Willam lname: Nunez rossw fname: William lname: Ross sperberp fname: Phyllis lname: Sperber turnerb fname: Brian lname: Turner walkerm fname: Martin lname: Walker zykowskiz fname: Zeph lname: Zykowski
  • 19.  Retrieve a group of rows with scan  General form:  Examples: Scanning Rows With scan in HBase Shell hbase> scan 'tablename' [,options] hbase> scan 'table1' hbase> scan 'table1', {LIMIT => 10} hbase> scan 'table1', {STARTROW => 'start', STOPROW => 'stop'} hbase> scan 'table1', {COLUMNS => ['fam1:col1', 'fam2:col2']}
  • 20. Scan Java API: Complete Code Scan s = new Scan(); ResultScanner rs = table.getScanner(s); for (Result r : rs) { String rowKey = Bytes.toString(r.getRow()); byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES); String user = Bytes.toString(b); } s.close();
  • 21. Scan Java API: Scan and ResultScanner Scan s = new Scan(); ResultScanner rs = table.getScanner(s); for (Result r : rs) { String rowKey = Bytes.toString(r.getRow()); byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES); String user = Bytes.toString(b); } s.close(); The Scan object is created and will scan all rows. The scan is executed on the table and a ResultScanner object is returned.
  • 22. Scan Java API: Iterating Scan s = new Scan(); ResultScanner rs = table.getScanner(s); for (Result r : rs) { String rowKey = Bytes.toString(r.getRow()); byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES); String user = Bytes.toString(b); } s.close();Using a for loop, you iterate through all Result objects in the ResultScanner. Each Result can be used to get the values.
  • 23. Python Scan Code: Complete Code scannerId = client.scannerOpen("tablename") row = client.scannerGet(scannerId) while row: columnvalue = row.columns.get(columnwithcf).value row = client.scannerGet(scannerId) client.scannerClose(scannerId)
  • 24. Python Scan Code: Open Scanner scannerId = client.scannerOpen("tablename") row = client.scannerGet(scannerId) while row: columnvalue = row.columns.get(columnwithcf).value row = client.scannerGet(scannerId) client.scannerClose(scannerId) Call scannerOpen to create a scan object on the Thrift server. This returns a scanner id that uniquely identifies the scanner on the server.
  • 25. Python Scan Code: Get the List scannerId = client.scannerOpen("tablename") row = client.scannerGet(scannerId) while row: columnvalue = row.columns.get(columnwithcf).value row = client.scannerGet(scannerId) client.scannerClose(scannerId) The scannerGet method needs to be called with the unique id. This returns a row of results.
  • 26. Python Scan Code: Iterating Through scannerId = client.scannerOpen("tablename") row = client.scannerGet(scannerId) while row: columnvalue = row.columns.get(columnwithcf).value row = client.scannerGet(scannerId) client.scannerClose(scannerId)The while loop continues as long as the scanner returns a new row. Columns must be addressed with column family, ":", and the column descriptor. row gets populated by another call to scannerGet and the loop is repeated.
  • 27. Python Scan Code: Closing the Scanner scannerId = client.scannerOpen("tablename") row = client.scannerGet(scannerId) while row: columnvalue = row.columns.get(columnwithcf).value row = client.scannerGet(scannerId) client.scannerClose(scannerId) The scannerClose method call is very important. This closes the Scan object on the Thrift server. Not calling this method can leak Scan objects on the server.
  • 28.  Scan results can be retrieved in batches to improve performance –Performance will improve but memory usage will increase  Java API:  Python with Thrift: Scanner Caching Scan s = new Scan(); s.setCaching(20); rowsArray = client.scannerGetList(scannerId, 10)

Editor's Notes

  • #19: scan 'table1'Scans the entire tablescan 'table1', {LIMIT => 10}Scans the first 10 rows in the tablescan 'table1', {STARTROW => 'start', STOPROW => 'stop'} Scan between the start and stop rowsscan 'table1', {COLUMNS => ['fam1:col1', 'fam2:col2']} Scans the entire table for just those 2 column familities
  • #20: The full code listing. A virtual line by line discussion follows.
  • #28: Note that for the Python code, the row comes back as an array.