SlideShare a Scribd company logo
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved
IBM Software
Information Management
IBM PureData System for Analytics
Hands-On Labs
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved
Table of Contents :
Connecting to the Host and Database
Database Administration
Data Distribution
NzAdmin
Loading and Unloading Data
Backup & Restore
Query Optimization
Optimization Objects
Groom
Stored Procedures
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
Connecting to the Host and Database
Hands-On Lab
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 10
Table of Contents
1 Introduction .......................................................................................3
1.1 VMware Basics....................................................................................3
1.2 Architecture of the Virtual Machines.....................................................3
1.3 Tips and Tricks on Using the PureData System Virtual Machines ........3
2 Connecting to PureData System Host.............................................4
2.1 Open the Virtual Machines in VMware .................................................4
2.2 Start the Virtual Machines....................................................................5
3 Connecting to System Database Using nzsql ................................6
3.1 Using PuTTY .......................................................................................6
3.2 Connect to the System Database Using nzsql .....................................7
3.3 Commonly Used Commands and SQL Statements..............................8
3.4 Exit nzsql.............................................................................................9
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 10
1 Introduction
1.1 VMware Basics
VMware® Player and VMware Workstation are the synonym for test beds and developer environments across the IT industry.
While having many other functions for this specific purpose it allows the easy distribution of an “up and running” PureData
System system right to anybody’s computer – be it a notebook, desktop, or server. The VMware image can be deployed for
simple demos and educational purposes or it can be the base of your own development and experiments on top of the given
environment.
What is a VMware image?
VMware is providing a virtual computer environment on top of existing operating systems on Intel® or AMD™ processor based
systems. The virtual computer has all the usual components like a CPU, memory and disks as well as network, USB devices or
sound. The CPU and memory are simply the existing resources provided by the underlying operating system (indicated as
processes starting with “vmware”). On the other hand, virtual machine disks are a collection of files in the host operating system
that can be copied between any system and platform. The virtual disk files make up the most part of the image while the actual
description file of the virtual machine is small in size.
1.2 Architecture of the Virtual Machines
For the hands-on lab portion of the bootcamp, we will be using 2 virtual machines (VM) to demonstrate the usability of PureData
System systems. Because of the nature of virtualized environment and the host hardware, we will be limited in terms of
performance. Please use these exercises as a guide to familiarize with the PureData System systems only.
The virtual images are adaptations of an appliance for their portability and convenience. We will be running a virtual image to act
as the host machine and the other image as a SPU that typically resides in a PureData System appliance. The Host image will
be the main gateway where the Netezza Performance Server (NPS) code resides and will be accessed. The second image is
the SPU where it contains 5 virtual hard drives of 20 GB each as well as a virtual FPGA. The hard disks here are not partitioned
into primary, mirror and temp partitions as you would observe on a PureData System appliance. Instead, 4 of the disks only
contain primary data partitions and the fifth disk is used for temporary data.
Host SPU
VMware
NPS code
temp
FPGAFPGAFPGA
Host Operating System
PuTTY
1.3 Tips and Tricks on Using the PureData System Virtual Machines
The PureData System appliance is designed and fine tuned for a specific set of hardware. In order to demonstrate the system in
a virtualized environment, some adaptations were made on the virtual machines. To ensure the labs run smoothly, we have
listed some pointers for using the VMs:
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 10
Always boot up the Host image first before the SPU image
When booting up the VMs, start the Host image first. Once it is fully booted, the SPU image can be started at which time the
Host image would be listening for connections from the SPU machine. The connection then should be made automatically.
After pausing the virtual machines, nz services need to be restarted
In the case that the VMs were paused (the host operating system went into sleep or hibernation modes, the images were
paused in VMware Workstation,,, etc). To continue using the VMs, run the following commands in the prompt of the Host image.
When starting the SPU image for the first time, there will be a prompt for whether the image was copied or moved.
The first time SPU image is booted, VMware Workstation will prompt with the question whether the image was copied or moved,
the user should click on “I moved it”. This will insure that the SPU image will have the same MAC address as before. This is
crucial for making sure the Host and SPU images can be connected.
2 Connecting to PureData System Host
In most Bootcamp locations this chapter will already have been prepared by your bootcamp instructor. You can review it
to learn how the images would be installed. But if your NPS system is already running, jump straight to chapter 3.
2.1 Open the Virtual Machines in VMware
2.1.1 Unpacking the Images
The virtual machines for the PureData System Bootcamp are delivered in a self-extractable set of rar files. For easy handling the
files are compressed to 700MB volumes. Download all the volumes to the same directory. Double click the executable file and
select the destination folder to begin the extraction process.
2.1.2 Open the HOST Virtual Machine
There are 2 methods to start the VMware virtual machines:
Option 1: Double click on the file “HOST.vmx” in your Windows Explorer or Linux file browser.
Option 2: Select it through the File > Open… icon in the VMware console. This will bring up the dialog to browse to the folder
where the VM image resides, select “HOST.vmx” and click on the “Open” button.
Either option should bring up the VMware console
[nz@netezza ~]$ nzstop
[nz@netezza ~]$ nzstart
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 10
2.1.3 Open the SPU Virtual Machine
Repeat the steps you chose in the previous section to open the “SPU.vmx” file. The VMware console should be look similar to
the following, with a tab for each image:
2.2 Start the Virtual Machines
2.2.1 Start and Login into the Host Virtual Machine
To start using the virtual machines, first boot up the Host machine. Click on the “HOST” tab, then press the “Power On”
button in the upper left corner (marked in a red circle above). You should see the RedHat operating system boot up screen,
allow it to boot for a couple minutes until it runs to the PureData System login prompt.
At the login prompt, login with the following credentials:
Username: nz
Password: nz
Once logged in, we can check the state of the machine by issuing the following command:
The system state should be in “Discovering” state which signifies that the host machine is ready for connection with the SPU’s:
2.2.2 Starting the SPU Virtual Machine
Now we can start the SPU image. Similar to how we started the Host image, click on the SPU tab in the VMware console, then
click on the “Power on” button.
The first time the SPU image is booted, the following prompt will display to ask if the virtual machine was moved or copied:
[nz@netezza ~]$ nzstate
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 10
Choose the “I moved it” radio button, and click “OK”. This will ensure that the previously configured MAC address in the SPU
image will remain the same. This is crucial for the communication between the Host and SPU virtual machines.
After the SPU is fully booted, you should see the screen similar to the following. Note the bottom right corner where it displays
that there are 5 virtual hard disks in healthy state.
We can now go back to the Host image to check the status of the connection. Click on the “HOST” tab, and enter the following
command in the prompt:
The system state should display “Online”
3 Connecting to System Database Using nzsql
Most Bootcamp locations will have a predefined PuTTY entry netezza that already contains the IP address; open it by
double-clicking on the saved connection.
3.1 Using PuTTY
Since we will not be using any graphical interface tools from the Host virtual machine, there is an alternative to using the
PureData System prompts directly in VMware. We can connect to the Host via SSH using tools such as PuTTY. We will be
using the PuTTY console for the rest of the labs since this better simulates the real life scenario of connecting to a remote
PureData System system.
First, locate the PuTTY executable in the folder where the VMs were extracted. Under the folder “Tools” you should be able to
find the file “putty.exe”. Execute it by a double click. In the PuTTY interface, enter the IP of the Host image as 192.168.239.2 and
select SSH as the connection type. Finally, click “Open” to start the session.
[nz@netezza ~]$ nzstate
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 10
Once the prompt window is open, log in with the following credentials:
Username: nz
Password: nz
We are now ready for connection to the system database and execute commands in the PuTTY command prompt.
3.2 Connect to the System Database Using nzsql
Since we have not created any user and databases yet, we will connect to the default database as the default user, with the
following credentials:
Database: system
Username: admin
Password: password
When issuing the nzsql command, the user supplies the user account, password and the database to connect to using the
syntax, below is an example of how this would be done. Do not try to execute that command it is just demonstrating the syntax:
Alternatively, these values can be stored in the command shell and passed to the nzsql command when it is issued without any
arguments. Let’s verify the current database, user and password values stored in the command shell by issuing the commands:
The output should look similar to the following:
Since the current values correspond to our desired values, no modification is required.
[nz@netezza ~]$ printenv NZ_DATABASE
[nz@netezza ~]$ printenv NZ_USER
[nz@netezza ~]$ printenv NZ_PASSWORD
nzsql –d [db_name] –u [user] –pw [password]
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 10
Next, let’s take a look at what options are available to start nzsql. Type in the following command
The -? option will list the usage and all options for the nzsql command. In this exercise, we will start nzsql without arguments. In
the command prompt, issue the command:
This will bring up the nzsql prompt below that shows a connection to the system database as user admin:
3.3 Commonly Used Commands and SQL Statements
There are commonly used commands that start with “” which we will demonstrate in this section. First, we will run the 2 help
commands to familiarize ourselves with these handy commands. The h command will list the available SQL commands, while
the ? command is used to list the internal slash commands. Examine the output for both commands:
From the output of the ? command, we found the l internal command we can use to find out all the databases:
Let’s find out all the databases by entering
Secondly, we will use “dSt” to find out the system tables within the system database.
SYSTEM(ADMIN)=> h
SYSTEM(ADMIN)=> ?
[nz@netezza ~]$ nzsql
SYSTEM(ADMIN)=> dSt
List of relations
Name | Type | Owner
------------------------------+--------------+-------
_T_ACCESS_TIME | SYSTEM TABLE | ADMIN
_T_ACL | SYSTEM TABLE | ADMIN
_T_ACTIONFRAG | SYSTEM TABLE | ADMIN
_T_AGGREGATE | SYSTEM TABLE | ADMIN
_T_ALTBASE | SYSTEM TABLE | ADMIN
_T_AM | SYSTEM TABLE | ADMIN
_T_AMOP | SYSTEM TABLE | ADMIN
_T_AMPROC | SYSTEM TABLE | ADMIN
_T_ATTRDEF | SYSTEM TABLE | ADMIN
.
.
.
SYSTEM(ADMIN)=> l
List of databases
DATABASE | OWNER
-----------+-----------
MASTER_DB | ADMIN
SYSTEM | ADMIN
(2 rows)
[nz@netezza ~]$ nzsql -?
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 10
Note: press the space bar to scroll down the result set when you see “--More--“ on the screen.
From the previous command, we can see that there is a user table called “_T_USER”. To find out what is stored in that table, we
will use the describe command d:
This will return all the columns of the _T_USER system table. Next, we want to know the existing users stored in the table. In
case too many rows are returned at once, we will first calculate the number of rows it contains by enter the following query:
The query above is essentially the same as “SELECT COUNT (*) FROM _T_USER;”, we have demonstrated the sub-select
syntax in case there is a complex query that needed to have the result set evaluated. The result should show there is currently 1
entry in the user table. We can enter the following query to list the user names:
3.4 Exit nzsql
To exit nzsql, use the command q to return to the PureData System system.
SYSTEM(ADMIN)=> SELECT USENAME FROM _T_USER;
SYSTEM(ADMIN)=>d _T_USER
SYSTEM(ADMIN)=> SELECT COUNT(*) FROM (SELECT * FROM _T_USER) AS "Wrapper";
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of
10
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at “Copyright and
trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
IBM Software
Information Management
Database Administration
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 18
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Creating IBM PureData System Users and Groups......3
2.1 Creating New PureData System Users ............................4
2.2 Creating New PureData System Groups..........................5
3 Creating a PureData System Database .........................7
3.1 Creating a Database and Transferring Ownership ...........7
3.2 Assigning Authorities and Privileges ................................9
3.3 Creating PureData System Tables.................................11
3.4 Using DML Queries .......................................................14
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 18
1 Introduction
A factory-configured and installed IBM PureData System will include some of the following components:
• An IBM PureData System warehouse appliance with pre-installed IBM PureData System software
• A preconfigured Linux operating system (with PureData System modifications)
• Several preconfigured Linux users and groups:
o The nz user is the default PureData System system Administration account
o The group is the default group
• An IBM PureData System database user named ADMIN. The ADMIN user is the database super-user, and has full
access to all system functions and objects
• A preconfigured database group named PUBLIC. All database users are automatically placed in the group PUBLIC and
therefore inherit all of its privileges
The IBM PureData System warehouse appliance includes a highly optimized SQL dialect called PureData System Structured
Query Language. You can use SQL commands to create and manage your PureData System databases, user access, and
permissions for the databases, as well as to query and modify the contents of the databases
On a new IBM PureData System system, there is typically one main database, SYSTEM, and a database template, MASTER_DB.
IBM PureData System uses the MASTER_DB as a template for all other user databases that are created on the system.
Initially, only the ADMIN user can create new databases, but the ADMIN user can grant other users permission to create
databases as well. The ADMIN user can also make another user the owner of a database, which gives that user ADMIN-like
control over that database and its contents. The database creator becomes the default owner of the database. The owner can
remove the database and all its objects, even if other users own objects within the database. Within a database, permitted users
can create tables and populate them with data and query its contents.
1.1 Objectives
This lab will guide you through the typical steps to create and manage new IBM PureData System users and groups after an
IBM PureData System has been delivered and configured. This will include creating a new database and assigning the
appropriate privileges. The users and the database that you create in this lab will be used as a basis for the remaining labs in
this bootcamp. After this lab you will have a basic understanding on how to plan and create an IBM PureData System database
environment.
• The first part of this lab will examine creating IBM PureData System users and groups
• The second part of this lab will explore creating and using a database and tables. The table schema to be used within
this bootcamp will be explained in the Data Distribution lab.
2 Creating IBM PureData System Users and Groups
The initial task after an IBM PureData System has been set up is to create the database environment. This typically begins by
creating a new set of database users and user groups before creating the database. You will use the ADMIN user to start
creating additional database users and users groups. Then you will assign the appropriate authorities after the database has
been created in the next section. The ADMIN user should only be used to perform administrative tasks within the IBM PureData
System and is not recommended for regular use. Also, it is highly advisable to develop a security access model to control user
access against the database and the database objects in an IBM PureData System. This will involve creating separate users to
perform certain tasks.
The security access model for this bootcamp environment will use three PureData System database users:
o LABADMIN
o LABUSER
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 18
o DBUSER
and two PureData System database user groups:
o LAGRP
o LUGRP
1. Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user “nz” with password “nz”. 192.168.239.2 is
the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images
are hosted remotely, the instructors will provide the host IPs which will vary between machines
2. Connect to the system database as the PureData System database super-user, ADMIN, using the nzsql interface:
or,
There are different options you can use with the nzsql interface. Here we present two options, where the first option uses
information set in the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD. By default the
environment variables are set to the following values:
NZ_DATATASE=system
NZ_USER=admin
NZ_PASSWORD=password
So you do not need to specify the database name or the user. In the second option the information is explicitly stated using
the –d, -u, and –pw options, which specifies the database name, the user, and the user’s password, respectively. This
option is useful when you want to connect to a different database or use a different user than specified in the NZ
environment variables.
You will see the following:
2.1 Creating New PureData System Users
The three new PureData System database users will be initially created using the ADMIN user. The LABADMIN user will be the
full owner of the bootcamp database. The LABUSER user will be allowed to perform data manipulation language (DML)
operations (INSERT, UPDATE, DELETE) against all of the tables in the database, but they will not be allowed to create new
objects like tables in the database. And lastly, the DBUSER user will only be allowed to read tables in the database, that is, they
will only have LIST and SELECT privilege against tables in the database.
The basic syntax to create a user is:
Welcome to nzsql, the Netezza SQL interactive terminal.
Type: h for help with SQL commands
? for help on internal slash commands
g or terminate with semicolon to execute query
q to quit
SYSTEM(ADMIN)=>
[nz@netezza ~}$ nzsql –d system –u admin –pw password
[nz@netezza ~]$ nzsql
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 18
1. As the PureData System database super-user, ADMIN, you can now start to create the first user, LABADMIN, which will
be the administrator of the database: (Note user and group names are not case sensitive)
Later in this lab you will assign administrative ownership of the lab database to this user.
2. Now you will create two additional PureData System database users that will have restricted access to the database.
The first user, LABUSER, will have full DML access to the data in the tables, but will not be able to create or alter tables.
For now you will just create the user. We will set the privileges after the database is created :
3. Finally we create the user DBUSER. This user will have even more limited access to the database since it will only be
allowed to select data from the tables within the database. Again, you will set the privileges after the database is
created :
4. To list the existing PureData System database users in the environment use the du internal slash option:
This will return a list of all database users:
The additional information like USERRESOURCEGROUP is intended for resource management, which is covered later in the
WLM presentation.
2.2 Creating New PureData System Groups
PureData System database user groups are useful for organizing and managing PureData System database users. By default
PureData System contains one group with the name PUBLIC. All users are members in the PUBLIC group when they are
created. Users can be members of other groups as well though. In this section we will create two new PureData System
database user groups. They will be initially created by the ADMIN user.
We will create an administrative group LAGRP which is short for Lab Admin Group. This group will contain the LABADMIN user.
The second group we create will be the LUGRP or Lab User Group. This group will contain the users LABUSER and DBUSER.
SYSTEM(ADMIN)=> du
SYSTEM(ADMIN)=> create user dbuser with password 'password';
SYSTEM(ADMIN)=> create user labuser with password 'password';
SYSTEM(ADMIN)=> create user labadmin with password 'password';
CREATE USER username WITH PASSWORD ‘string’;
List of Users
USERNAME | VALIDUNTIL | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | USERESOURCEGRPID | USERESOURCEGRPNAME | CROSS_JOINS_ALLOWED
-----------+------------+----------+----------------+--------------+--------------+--------------+------------------+--------------------+---------------------
ADMIN | | | 0 | 0 | NONE | NONE | | _ADMIN_ | NULL
DBUSER | | 0 | 0 | 0 | NONE | NONE | | PUBLIC | NULL
LABADMIN | | 0 | 0 | 0 | NONE | NONE | | PUBLIC | NULL
LABUSER | | 0 | 0 | 0 | NONE | NONE | | PUBLIC | NULL
(4 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 18
Two different methods will be used to add the existing users to the newly created groups. Alternatively, the groups could be
created first and then the users. The basic command to create a group is:
1. As the PureData System database super-user, ADMIN, you will now create the first group, LAGRP, which will be the
administrative group for the LABADMIN user :
2. After the LAGRP group is created you will now add the LABADMIN user to this group. This is accomplished by using the
ALTER statement. You can either ALTER the user or the group, for this task you will ALTER the group to add the
LABADMIN user to the LAGRP group:
To ALTER the user you would use the following command :
3. Now you will create the second group, LUGRP, which will be the user group for the both the LABUSER and DBUSER
users. You can specify the users to be included in the group when creating the group:
If you had created the group before creating the user, you could add the user to the group when creating the user. To create
the LABUSER user and add it to an existing group LUGRP, you would use the following command:
4. To list the existing PureData System groups in the environment use the dg internal slash option:
This will return a list of all groups in the system. In our test system this is the default group PUBLIC and the two groups you
have just created:
The other columns are explained in the WLM presentation.
SYSTEM(ADMIN)=> dg
create user LABUSER with in group LUGRP;
SYSTEM(ADMIN)=> create group lugrp with add user labuser, dbuser;
alter user labadmin with group lagrp ;
SYSTEM(ADMIN)=> alter group lagrp with add user labadmin;
CREATE GROUP groupname;
List of Groups
GROUPNAME | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | GRORSGPERCENT | RSGMAXPERCENT | JOBMAX | SOSS_JOINS_ALLOWED
-----------+----------+----------------+--------------+--------------+--------------+---------------+---------------+--------+-------------------
LAGRP | 0 | 0 | 0 | NONE | NONE | 0 | 100 | 0 | NULL
LUGRP | 0 | 0 | 0 | NONE | NONE | 0 | 100 | 0 | NULL
PUBLIC | 0 | 0 | 0 | NONE | NONE | 20 | 100 | 0 | NULL
(3 rows)
SYSTEM(ADMIN)=> create group lagrp;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 18
5. To list the users in a group you can use one of two internal slash options, dG, or dU. The internal slash option dG
will list the groups with the associated users:
This returns a list of all groups and the users they contain:
The internal slash option dU will list the users with the associated group:
In this case the output is ordered by the users:
3 Creating a PureData System Database
The next step after the PureData System database users and user groups have been created is to create the lab database. You
will continue to use the ADMIN user to create the lab database then assign the appropriate authorities and privileges to the users
created in the previous sections. The ADMIN user can also be used to create tables within the new database. However, the
ADMIN user should only be used to perform administrative tasks. After the appropriate privileges have been assigned by the
ADMIN user, the database can be handed over to the end-users to start creating and populating the tables in the database.
3.1 Creating a Database and Transferring Ownership
The lab database that will be created will be named LABDB. It will be initially created by the ADMIN user and then ownership of
the database will be transferred to the LABADMIN user. The LABADMIN user will have full administrative privileges against the
LABDB database. The basic syntax to create a database is:
SYSTEM(ADMIN)=> dU
SYSTEM(ADMIN)=> dG
List of Groups a User is a member
USERNAME | GROUPNAME
-----------+-----------
DBUSER | LUGRP
DBUSER | PUBLIC
LABADMIN | LAGRP
LABADMIN | PUBLIC
LABUSER | LUGRP
LABUSER | PUBLIC
(6 rows)
List of Users in a Group
GROUPNAME | USERNAME
-----------+-----------
LAGRP | LABADMIN
LUGRP | DBUSER
LUGRP | LABUSER
PUBLIC | DBUSER
PUBLIC | LABADMIN
PUBLIC | LABUSER
(6 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 18
1. As the PureData System database super-user, ADMIN, you will create the first database, LABDB, using the CREATE
DATABASE command :
The database LABDB has been created.
2. To view the existing databases use the internal slash option l :
This will return the following list:
The owner of the newly created LABDB database is the ADMIN user. The other databases are the default database SYSTEM
and the template database MASTER_DB.
3. At this point you could continue by creating new tables as the ADMIN user. However, the ADMIN user should only be
used to create users, groups, and databases, and to assign authorities and privileges. Therefore we will transfer
ownership of the LABDB database from the ADMIN user to the LABADMIN user we created previously. The ALTER
DATABASE command is used to transfer ownership of an existing database :
This is the only method to transfer ownership of a database to an existing user. The CREATE DATABASE command does
not include this option.
4. Check that the owner of the LABDB database is now the LABADMIN user :
The owner of the LABDB database is now the LABADMIN user.
SYSTEM(ADMIN)=> l
SYSTEM(ADMIN)=> alter database labdb owner to labadmin;
SYSTEM(ADMIN)=> l
SYSTEM(ADMIN)=> create database labdb;
CREATE DATABASE database_name;
List of databases
DATABASE | OWNER
-----------+-----------
LABDB | LABADMIN
MASTER_DB | ADMIN
SYSTEM | ADMIN
(3 rows)
List of databases
DATABASE | OWNER
-----------+-----------
LABDB | ADMIN
MASTER_DB | ADMIN
SYSTEM | ADMIN
(3 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 18
The LABDB database is now created and the LABADMIN user has full privileges on the LABDB database. The user can create
and alter objects within the database. You could now continue and start creating tables as the LABADMIN user. However, we will
first finish assigning privileges to the two remaining database users that were created in the previous section.
3.2 Assigning Authorities and Privileges
One last task for the ADMIN user is to assign the privileges to the two users we created earlier, LABUSER and DBUSER. LABUSER
user will have full DML rights against all tables in the LABDB database. It will not be allowed to create or alter tables within the
database. User DBUSER will have more restricted access in the database and will only be allowed to read data from the tables in
the database. The privileges will be controlled by a combination of setting the privileges at the group and user level.
The LUGRP user group will be granted LIST and SELECT privileges against the database and tables within the database. So any
member of the LUGRP will have these privileges. The full data manipulation privileges will be granted individually to the LABUSER
user.
The GRANT command that is used to assign object privileges has the following syntax:
1. As the PureData System database super-user, ADMIN, connect to the LABDB database using the internal slash option
c:
You should see that you have successfully connected to the database:
You will notice that the database name in command prompt has changed from SYSTEM to LABDB.
2. First you will grant LIST privilege on the LABDB database to the group LUGRP. This will allow members of the LUGRP to
view and connect to the LABDB database :
3. To list the object permissions for a group use the following internal slash option, dpg :
You will see the following output:
The X in the L column of the list denotes that the LUGRP group has LIST object privileges on the LABDB global object.
LABDB(ADMIN)=> dpg lugrp
LABDB(ADMIN)=> grant list on labdb to lugrp;
SYSTEM(ADMIN)=> c labdb admin password
You are now connected to database LABDB as user admin.
LABDB(ADMIN)=>
SYSTEM(ADMIN)=> c labdb admin password
GRANT <objective_privilege> ON <object> TO { PUBLIC | GROUP <group> | <username> }
Group object permissions for group 'LUGRP'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S
---------------+-------------+-------------------------------------+---------------------------------------------
GLOBAL | LABDB | X |
(1 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of
18
4. With the current privileges set for the LABUSER and DBUSER, they can now view and connect to the LABDB database
as members of the LUGRP group. But these two users have no privileges to access any of the objects within the
database. So you will grant LIST and SELECT privilege to the tables within the LABDB database to the members of the
LUGRP :
5. View the object permissions for the LUGRP group :
This will create the following results:
The X in the L and S column denotes that the LUGRP group has both LIST and SELECT privileges on all of the tables in the
LABDB database. (The LIST privilege is used to allow users to view the tables using the internal slash opton d.)
6. The current privileges satisfy the DBUSER user requirements, which is to allow access to the LABDB database and
SELECT access to all the tables in the database. But these privileges do not satisfy the requirements for the LABUSER
user. The LABUSER user is to have full DML access to all the tables in the database. So you will grant SELECT, INSERT,
UPDATE, DELETE, LIST, and TRUNCATE privileges on tables in the LABDB database to the LABUSER user:
7. To list the object permissions for a user use the dpu <user name> internal slash option,:
This will return the following:
The X under the L, S, I, U, D, T columns indicates that the LABUSER user has LIST, SELECT, INSERT, UPDATE, DELETE,
and TRUNCATE privileges on all of the tables in the LABDB database.
LABDB(ADMIN)=> dpu labuser
LABDB(ADMIN)=> grant select, insert, update, delete, list, truncate on table to labuser;
LABDB(ADMIN)=> dpg lugrp
LABDB(ADMIN)=> grant list, select on table to lugrp;
User object permissions for user 'LABUSER'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S
---------------+-------------+-------------------------------------+---------------------------------------------
LABDB | TABLE | X X X X X X |
(1 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
Group object permissions for group 'LUGRP'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S
---------------+-------------+-------------------------------------+---------------------------------------------
GLOBAL | LABDB | X |
LABDB | TABLE | X X |
(2 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of
18
Now that all of the privileges have been set by the ADMIN user the LABDB database can be handed over to the end-users. The
end-users can use the LABADMIN user to create objects, which include tables, in the database and also maintain the database.
3.3 Creating PureData System Tables
The LABADMIN user will be used to create tables in the LABDB database instead of the ADMIN user. Two tables will be created
in this lab. The remaining tables for the LABDB database schema will be created in the Data Distribution lab. Data Distribution is
an important aspect that should be considered when creating tables. This concept is not covered in this lab since it is discussed
separately in the Data Distribution presentation. The two tables that will be created are the REGION and NATION tables. These
two tables will be populated with data in the next section using LABUSER user. Two methods will be utilized to create these
tables. The basic syntax to create a table is:
1. Connect to the LABDB database as the LABADMIN user using the internal slash option c:
You will see the following results:
You will notice that the user name in the command prompt has changed from ADMIN to LABADMIN. Since you already had
an opened session you could use the internal slash option c to connect to the database. However, if you had handed over
this environment to the end user they would need to initiate a new connection using the nzsql interface.
To use the nzsql interface to connect to the LABDB database as the LABADMIN user you could use the following options:
or with the short form, omitting the options:
or you could set the environment variables to the following values and issue nzsql without options.
nzsql labdb labadmin password
nzsql –d labdb –u labadmin –pw password
LABDB(ADMIN)=> c LABDB labadmin password
You are now connected to database LABDB as user labadmin.
LABDB(LABADMIN)=>
CREATE TABLE table_name
(
column_name type [ [ constraint_name ] column_constraint
[ constraint_characteristics ] ] [, ... ]
[ [ constraint_name ] table_constraint [ constraint_characteristics ] ] [, ... ]
)
[ DISTRIBUTE ON ( column [, ...] ) ]
LABDB(ADMIN)=> c labdb labadmin password
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of
18
In further labs we will often leave out the password parameter since it has been set to the same value “password” for all
users.
2. Now you can create the first table in the LABDB database. The first table you will create is the REGION table with the
following columns and datatypes :
Column Name Data Type
R_REGIONKEY INTEGER
R_NAME CHAR(25)
R_COMMENT VARCHAR(152)
To create that table execute the following command:
3. To list the tables in the LABDB database use the dt internal slash option:
This will show the table you just created
4. To describe a table you can use the internal slash option d <table name>:
This shows a description of the created table:
LABDB(LABADMIN)=> d region
LABDB(LABADMIN)=> dt
LABDB(LABADMIN)=> create table region (r_regionkey integer, r_name char(25),
r_comment varchar(152));
NZ_DATABASE=LABDB
NZ_USER=LABADMIN
NZ_PASSWORD=password
Table "REGION"
Attribute | Type | Modifier | Default Value
-------------+------------------------+----------+---------------
R_REGIONKEY | INTEGER | |
R_NAME | CHARACTER(25) | |
R_COMMENT | CHARACTER VARYING(152) | |
Distributed on hash: "R_REGIONKEY"
List of relations
Name | Type | Owner
--------+-------+----------
REGION | TABLE | LABADMIN
(1 row)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of
18
The distributed on hash clause is the distribution method used by the table. If you do not explicitly specify a distribution
method, a default distribution is used. In our system this is a hash distribution on the first column R_REGIONKEY. This
concept is discussed in the Data Distribution presentation and lab.
5. Instead of typing out the entire create table statement at the nzsql command line you can read and execute
commands from a file. You’ll use this method to create the NATION table in the LABDB database with the following
columns and data types:
Column Name Data Type Constraint
N_NATIONKEY INTEGER NOT NULL
N_NAME CHAR(25) NOT NULL
N_REGIONKEY INTEGER NOT NULL
N_COMMENT VARCHAR(152) ---
The full create table statement for the NATION table:
6. The statement can be found in the nation.ddl file under the /labs/databaseAdministration directory. To read and
execute commands from a file use the i <file> internal slash option:
7. List all the tables in the LABDB database:
We will now see a list containing the two tables you created:
8. Describe the NATION table :
LABDB(LABADMIN)=> dt
LABDB(LABADMIN)=> i /labs/databaseAdministration/nation.ddl
create table nation
(
n_nationkey integer not null,
n_name char(25) not null,
n_regionkey integer not null,
n_comment varchar(152)
)
distribute on random;
List of relations
Name | Type | Owner
--------+-------+----------
NATION | TABLE | LABADMIN
REGION | TABLE | LABADMIN
(2 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of
18
This will show the following results:
The distributed on random is the distribution method used, in this case the rows in the NATION table are distributed in
round-robin fashion. This concept will be discussed separately in the Data Distribution presentation and lab.
It is possible to continue to use LABADMIN user to perform DML queries since it is the owner of the database and holds all
privileges on all of the objects in the databases. However, the LABUSER and DBUSER users will be used to perform DML queries
against the tables in the database.
3.4 Using DML Queries
We will now use the LABUSER user to populate data into both the REGION and NATION tables. This user has full data
manipulation language (DML) privileges in the database, but no data definition language (DDL) privileges. Only the LABADMIN
has full DDL privileges in the database. Later in this course more efficient methods to populate tables with data are discussed.
The DBUSER will also be used to read data from the tables, but it can not insert data in to the tables since is has limited DML
privileges in the database.
1. Connect to the LABDB database as the LABUSER user using the internal slash option, c:
You will see the following result:
You will notice that the user name in the command prompt has changed from LABADMIN to LABUSER.
2. First check which tables exist in the LABDB database using the dt internal slash option:
You should see the following list:
LABDB(LABUSER)=> dt
LABDB(ADMIN)=> c labdb labuser password
You are now connected to database LABDB as user labuser.
LABDB(LABUSER)=>
LABDB(ADMIN)=> c labdb labuser password
LABDB(LABADMIN)=> d nation
Table "NATION"
Attribute | Type | Modifier | Default Value
-------------+------------------------+----------+---------------
N_NATIONKEY | INTEGER | NOT NULL |
N_NAME | CHARACTER(25) | NOT NULL |
N_REGIONKEY | INTEGER | NOT NULL |
N_COMMENT | CHARACTER VARYING(152) | |
Distributed on random: (round-robin)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of
18
Remember that the LABUSER user is a member of the LUGRP group which was granted LIST privileges on the tables in the
LABDB database. This is the reason why it can list and view the tables in the LABDB database. If it did not have this privilege
it would not be able to see any of the tables in the LABDB database.
3. The LABUSER user was created to perform DML operations against the tables in the LABDB database. However, it was
restricted on performing DDL operations against the database. Let’s see what happens when you try create a new table,
t1, with one column, C1, using the INTEGER data type:
You will see the following error message:
As expected the create table statement is not allowed since LABUSER user does not have the privilege to create tables in
the LABDB database.
4. Let’s continue by performing DML operations that the LABUSER user is allowed to perform against the tables in the
LABDB database. Insert a new row into the REGION table:
You will see the following results:
As expected this operation is successful. The output of the INSERT gives feedback about the number of successfully
inserted rows.
5. Issue the SELECT statement against the REGION table to check the new row you just added to the table:
This should return the row you just inserted:
LABDB(LABUSER)=> select * from region;
LABDB(LABUSER)=> insert into region values (1, 'NA', 'north america');
INSERT 0 1
LABDB(LABUSER)=> insert into region values (1, 'NA', 'north america');
LABDB(LABUSER)=> create table t1 (c1 integer);
ERROR: CREATE TABLE: permission denied.
LABDB(LABUSER)=> create table t1 (c1 integer);
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
1 | NA | north america
(1 rows)
List of relations
Name | Type | Owner
--------+-------+----------
NATION | TABLE | LABADMIN
REGION | TABLE | LABADMIN
(2 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of
18
6. Instead of typing DML statements at the nzsql command line, you can read and execute statements from a file. You
will use this method to add the following three rows to the REGION table:
R_REGIONKEY R_NAME R_COMMENT
2 SA South America
3 EMEA Europe, Middle East, Africa
4 AP Asia Pacific
This is done with a SQL script containing the following commands:
It can be found in the region.dml file under the /labs/databaseAdministration directory. To read and execute commands from
a file use the i <file> internal slash option:
You will see the following result. You can see from the output that the SQL script contained three INSERT statements.
7. You will load data into the NATION table using an external table with the following command:
You will see that 14 rows are inserted to the table:
Loading data into a table is covered in the Loading and Unloading Data presentation and lab.
8. Now you will switch over to the DBUSER user, who only has SELECT privilege on the tables in the LABDB database. This
privilege is granted to this user since he is a member of the LUGRP group. Use the internal slash option, c <database
name> <user> <password> to connect to the LABDB database as the DBUSER user:
You will see the following:
LABDB(LABUSER)=> c LABDB dbuser password
You are now connected to database LABDB as user dbuser.
LABDB(DBUSER)=>
LABDB(LABUSER)=> c labdb dbuser password
LABDB(LABUSER)=> insert into nation select * from external
‘/labs/databaseAdministration/nation.del’
INSERT 0 14
LABDB(LABUSER)=> insert into nation select * from external
'/labs/databaseAdministration/nation.del';
LABDB(LABUSER)=> i /labs/databaseAdministration/region.dml
INSERT 0 1
INSERT 0 1
INSERT 0 1
LABDB(LABUSER)=> i /labs/databaseAdministration/region.dml
insert into region values (2, 'sa', 'south america');
insert into region values (3, 'emea', 'europe, middle east, africa');
insert into region values (4, 'ap', 'asia pacific');
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 17 of
18
You will notice that the user name in the command prompt changes from LABUSER to DBUSER.
9. Before trying to view rows from tables in the LABDB database, try to add a new row to the REGION table:
You should see the following error message:
As expected the INSERT statement is not allowed since the DBUSER does not have the privilege to add rows to any tables in
the LABDB database.
10. Now select all rows from the REGION table:
You should get the following output:
11. Finally let's run a slightly more complex query. We want to return all nation names in Asia Pacific, together with their
region name. To do this you need to execute a simple join using the NATION and REGION tables. The join key will be
the region key, and to restrict results on the AP region you need to add a WHERE condition:
This should return the following results, containing all countries from the ap region.
Congratulations you have completed the lab. You have successfully created the lab database, 2 tables, and database users
and user groups with various privileges. You also ran a couple of simple queries. In further labs you will continue to use this
database by creating the full schema.
LABDB(DBUSER)=> select n_name, r_name from nation, region where
n_regionkey = r_regionkey and r_name = 'ap';
LABDB(DBUSER)=> select * from region;
LABDB(DBUSER)=> insert into region values (5, 'np', 'north pole');
ERROR: Permission denied on "REGION".
LABDB(DBUSER)=> insert into region values (5, 'NP', 'north pole');
N_NAME | R_NAME
---------------------------+---------------------------
macau | ap
new zealand | ap
australia | ap
japan | ap
hong kong | ap
(5 rows)
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
1 | na | north america
2 | sa | south america
3 | emea | europe, middle east, Africa
4 | ap | asia pacific
(4 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 18 of
18
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at “Copyright and
trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
Data Distribution
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 19
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Skew.................................................................................4
2.1 Data Skew.......................................................................4
2.2 Processing Skew.............................................................7
3 Co-Location ...................................................................10
3.1 Investigation ..................................................................10
3.2 Co-Located Joins...........................................................12
4 Schema Creation...........................................................15
4.1 Investigation ..................................................................15
4.2 Solution .........................................................................16
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 19
1 Introduction
IBM PureData System is a family of data-warehousing appliances that combine high performance with low administrative effort.
Due to the unique data warehousing centric architecture of PureData System, most performance tuning tasks are either not
necessary or automated. Unlike normal data warehousing solutions, no tablespaces need to be created or tuned, there are also
no indexes, buffer pools or partitions.
Since PureData System is built on a massively parallel architecture that distributes data and workloads over a large number of
processing and data nodes, the single most important tuning factor is picking the right distribution key. The distribution key
governs which data rows of a table are distributed to which data slice and it is very important to pick an optimal distribution key
to avoid data skew, processing skew and to make joins co-located whenever possible.
1.1 Objectives
In this lab we will cover a typical scenario in a POC or customer engagement which involves an existing data warehouse for
customer transactions.
Figure 1 LABDB database
Figure 1 shows a visualization of the tables in the data warehouse and the relationships between the tables. The warehouse
contains the customers of the company, their orders, and the line items that are part of the order. The warehouse also has a list
of suppliers, providing the parts that are part of the shipped line items.
For this lab we already have the DDLs for creation of the tables and load files containing the warehouse data. Both have already
been transformed in a format usable by PureData System. In this lab we will define the distribution keys for these tables.
In addition to the data and the DDLs we also have received a couple of queries from the customer that are usually run against
the warehouse. Those are important input as well for picking optimal distribution keys.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 19
2 Skew
Tables in PureData System are distributed across data slices based on the distribution method and key. If a bad data distribution
method has been picked, it will result in skewed tables or processing skew. Data skew occurs when the distribution method puts
significantly more records of a table on one data slice than on other data slices. Apart from bad performance this also results in
a situation where the PureData System can hold significantly less data than expected.
Processing skew occurs if processing of queries is mainly taking place on some data slices for example because queries only
apply to data on those data slices. Both types of skew result in suboptimal performance since in a parallel system the slowest
node defines the total execution time.
2.1 Data Skew
The first table we will create is LINEITEM, the main fact table of the schema. It contains roughly 6 million rows.
1. Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user “nz” with password “nz”. 192.168.239.2 is
the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images
are hosted remotely, the instructors will provide the host IPs which will vary between machines
2. If you are continuing from the previous lab and are already connected to NZSQL quit the NZSQL console with the q
command.
3. To create the LINEITEM table, switch to the lab directory /labs/dataDistribution. To do this use the following command:
(Notice that you can use bash auto complete by using the Tab key to complete folder and files names)
4. Create the LINEITEM table by using the following script. Since the fact table is quite large this can take a couple
minutes.
You should see a similar result to the following. The error message at the beginning is expected since the script tries to
clean up existing LINEITEM tables:
5. Now lets have a look at the created table, open the nzsql console by entering the command: nzsql
[nz@netezza dataDistribution]$ ./create_lineitem_1.sh
ERROR: Table 'LINEITEM' does not exist
CREATE TABLE
Load session of table 'LINEITEM' completed successfully
[nz@netezza dataDistribution]$ nzsql
Welcome to nzsql, the Netezza SQL interactive terminal.
Type: h for help with SQL commands
? for help on internal slash commands
g or terminate with semicolon to execute query
q to quit
SYSTEM(ADMIN)=>
[nz@netezza dataDistribution]$ ./create_lineitem_1.sh
[nz@netezza ~]$ cd /labs/dataDistribution/
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 19
6. Connect to the database LABDB as user LABADMIN by typing the following command:
SYSTEM(ADMIN)=> c LABDB LABADMIN
You should now be connected to the LABDB database as the LABADMIN user.
7. Let’s have a look at the table we just created. First we want to see a description of its columns and distribution key. Use
the NZSQL describe command d LINEITEM to get a description of the table. This should have the following result:
We can see that the LINEITEM table has 16 columns with different data types. Some of the columns have a “key” suffix and
substrings containing the names of other tables and are most likely foreign keys of dimension tables. The distribution key is
L_LINESTATUS, which is of a CHAR(1) data type.
8. Now let’s have a look at the data in the table. To return a limited number of rows you can use the limit keyword in your
select queries. Execute the following select command to return 10 rows of the LINEITEM table. For readability we only
select a couple of columns including the order key, the ship date and the linestatus distribution key:
You will see the following results:
SYSTEM(ADMIN)=> c LABDB LABADMIN
You are now connected to database LABDB as user LABADMIN.
LABDB(LABADMIN) =>
LABDB(LABADMIN)=> SELECT L_ORDERKEY, L_QUANTITY, L_SHIPDATE, L_LINESTATUS FROM
LINEITEM LIMIT 10;
TPCH(TPCHADMIN)=> d LINEITEM
Table "LINEITEM"
Attribute | Type | Modifier | Default Value
-----------------+-----------------------+----------+---------------
L_ORDERKEY | INTEGER | NOT NULL |
L_PARTKEY | INTEGER | NOT NULL |
L_SUPPKEY | INTEGER | NOT NULL |
L_LINENUMBER | INTEGER | NOT NULL |
L_QUANTITY | NUMERIC(15,2) | NOT NULL |
L_EXTENDEDPRICE | NUMERIC(15,2) | NOT NULL |
L_DISCOUNT | NUMERIC(15,2) | NOT NULL |
L_TAX | NUMERIC(15,2) | NOT NULL |
L_RETURNFLAG | CHARACTER(1) | NOT NULL |
L_LINESTATUS | CHARACTER(1) | NOT NULL |
L_SHIPDATE | DATE | NOT NULL |
L_COMMITDATE | DATE | NOT NULL |
L_RECEIPTDATE | DATE | NOT NULL |
L_SHIPINSTRUCT | CHARACTER(25) | NOT NULL |
L_SHIPMODE | CHARACTER(10) | NOT NULL |
L_COMMENT | CHARACTER VARYING(44) | NOT NULL |
Distributed on hash: "L_LINESTATUS"
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 19
From this limited sample we can not make any definite judgments but we can make a couple of assumptions. While the
L_ORDERKEY column is not unique it seems to have a lot of distinct values. The L_SHIPDATE column also appears to
have a lot of distinct shipping date values. Our current distribution key L_LINESTATUS on the other hand has only two
shown values which may make it a bad distribution key. It is possible that you get different results. Since a database table is
an unordered set it is probable that you get different results for example only “O” or “F” values in the L_LINESTATUS
column.
9. We will now verify the number of distinct values in the L_LINESTATUS column with a “SELECT DISTINCT …” call. To
return a list of all values that are in the L_LINESTATUS column execute the following SQL command:
You should see the following results:
We can see that the L_LINESTATUS column only contains two distinct values. As a distribution key, this will result in a table
that is only distributed to two of the available dataslices.
10. We verify this by executing the following SQL call, which will return a list of all dataslices which contain rows of the
LINEITEM table, and the corresponding number of rows stored in them:
This will result in a similar output to the following:
LABDB(LABADMIN)=> SELECT DISTINCT L_LINESTATUS FROM LINEITEM;
L_LINESTATUS
--------------
O
F
(2 rows)
LABDB(LABADMIN)=> SELECT L_ORDERKEY, L_QUANTITY, L_SHIPDATE, L_LINESTATUS FROM
LINEITEM LIMIT 10;
L_ORDERKEY | L_QUANTITY | L_SHIPDATE | L_LINESTATUS
------------+------------+------------+--------------
2 | 38.00 | 1997-01-28 | O
6 | 37.00 | 1992-04-27 | F
34 | 13.00 | 1998-10-23 | O
34 | 22.00 | 1998-10-09 | O
34 | 6.00 | 1998-10-30 | O
38 | 44.00 | 1996-09-29 | O
66 | 31.00 | 1994-02-19 | F
66 | 41.00 | 1994-02-21 | F
70 | 8.00 | 1994-01-12 | F
70 | 13.00 | 1994-03-03 | F
(10 rows)
LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;
LABDB(LABADMIN)=> SELECT DISTINCT L_LINESTATUS FROM LINEITEM;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 19
Every PureData System table has a hidden column DATASLICEID, which contains the id of the dataslice the selected row is
being stored in. By executing a SQL query that does a GROUP BY on this column and counts the number of rows for each
dataslice id, data skew can be detected.
In this case the table has been, as we already expected, distributed to only two of the available four dataslices. This means
that we only use half of the available space and it will also result in low performance during most query executions. In
general a good distribution key should have a big number of distinct values with a good value distribution. Columns with a
low number of distinct values, especially boolean columns should not be considered as distribution keys.
2.2 Processing Skew
Even in tables that are distributed evenly across dataslices, data processing for queries can be concentrated or skewed to a
limited number of dataslices. This can happen because PureData System is able to ignore data extents (sets of data pages) that
do not fit to a given WHERE condition. We will cover the mechanism behind that in the zone map chapter.
1. First we will pick a new distribution key. As we have seen it should have a big number of distinct values. One of the
columns that did fit this description was the L_SHIPDATE column. Check the number of distinct values in the shipdate
column with the COUNT(DISTINCT … ) statement:
You will get a result similar to the following:
The column has over 2500 distinct values and has therefore more than enough distinct values to guarantee a good data
distribution on 4 dataslices. Of course this is under the assumption that the value distribution is good as well.
2. Now let’s reload the LINEITEM table with the new distribution key. For this we need to change the SQL of the loading
script we executed at the beginning of the lab. Exit the nzsql console by entering: q
3. You should now be in the lab directory /labs/dataDistribution. The table creation statement is situated in the lineitem.sql
file. We will need to make changes to the file with a text editor. Open the file with the default linux text editor vi. To do
this enter the following command:
vi lineitem.sql
4. The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you
will be in the command mode. To change the file you need to switch into the insert mode by pressing “i”. The editor will
show an – INSERT – at the bottom of the screen.
5. You can now use the cursor keys to navigate to the DISTRIBUTE ON clause at the bottom of the create command.
Change the distribution key to “l_shipdate”. The editor should now look like the following:
LABDB(LABADMIN)=> SELECT COUNT(DISTINCT L_SHIPDATE) FROM LINEITEM;
COUNT
-------
2526
(1 row)
LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;
DATASLICEID | COUNT
-------------+---------
1 | 3004998
4 | 2996217
(2 rows)
LABDB(LABADMIN)=> SELECT COUNT(DISTINCT L_SHIPDATE) FROM LINEITEM;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 19
6. We will now save our changes. Press “Esc” to switch back into command mode. You should see that the “—INSERT—
“ string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without
any questions. If you made a mistake editing and would like to undo it press “Esc” then enter :q! and go back to step
3.
7. Now repeat steps 3-5 of section 2.1 Data Skew:
a. Recreate and load the LINEITEM table with your new distribution key by executing
the ./create_lineitem_1.sh command
b. Use the nzsql command to enter the command console
c. Switch to the LABDB database by using the c LABDB LABADMIN command.
8. Now we verify that the new distribution key results in a good data distribution. For this we will repeat the query, which
returns the number of rows for each datasliceid of the LINEITEM table. Execute the following command:
Your results should look similar to the following:
LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;
create table lineitem
(
l_orderkey integer not null ,
l_partkey integer not null ,
l_suppkey integer not null ,
l_linenumber integer not null ,
l_quantity decimal(15,2) not null ,
l_extendedprice decimal(15,2) not null ,
l_discount decimal(15,2) not null ,
l_tax decimal(15,2) not null ,
l_returnflag char(1) not null ,
l_linestatus char(1) not null ,
l_shipdate date not null ,
l_commitdate date not null ,
l_receiptdate date not null ,
l_shipinstruct char(25) not null ,
l_shipmode char(10) not null ,
l_comment varchar(44) not null
)
DISTRIBUTE ON (l_shipdate);
~
~
~
-- INSERT --
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 19
We can see that the data distribution is much better now. All four data slices have a roughly equal amount of rows.
9. Now that we have a database table with a good data distribution lets look at a couple of queries we have received from
the customer. The following query is executed regularly by the customer. It returns the average quantity shipped on a
given day grouped by the shipping mode. Execute the following query:
Your results should look like the following:
This query will take all rows from the 29th
March of 1996 and compute the average value of the L_QUANTITY column for
each L_SHIPMODE value. It is a typical warehousing query insofar as a date column is used to restrict the row set that is
taken as input for computation.
In this example most rows of the LINEITEM table will be filtered away, only rows that have the specified date will be used as
input for computation of the AVG aggregation.
10. Execute the following SQL statement to see on which data slice we can find the rows from the 29th
March of 1996:
You should see the following:
LABDB(LABADMIN)=> SELECT AVG(L_QUANTITY) AS AVG_Q, L_SHIPMODE FROM LINEITEM WHERE
L_SHIPDATE = '1996-03-29' GROUP BY L_SHIPMODE;
AVG_Q | L_SHIPMODE
-----------+------------
26.045455 | MAIL
27.147826 | TRUCK
26.038567 | FOB
24.780282 | RAIL
25.708556 | AIR
24.494186 | REG AIR
25.562500 | SHIP
(7 rows)
TPCH(TPCHADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;
DATASLICEID | COUNT
-------------+---------
2 | 1497649
3 | 1501760
4 | 1501816
1 | 1499990
(4 rows)
LABDB(LABADMIN)=> SELECT COUNT(*), DATASLICEID FROM LINEITEM WHERE L_SHIPDATE =
'1996-03-29' GROUP BY DATASLICEID;
LABDB(LABADMIN)=> SELECT AVG(L_QUANTITY) AS AVG_Q, L_SHIPMODE FROM LINEITEM WHERE
L_SHIPDATE = '1996-03-29' GROUP BY L_SHIPMODE;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of
19
Since we used the shipping date column as a distribution key, all rows from a specific date can be found on one data slice
and therefore also one SPU. This means that for our previous query all rows on other data slices are dismissed and the
computation takes only place on one dataslice and SPU. This is known as processing skew. While this one SPU is working
the other SPUs will be idling.
Columns that are often used in WHERE conditions shouldn’t be used as distribution keys, since this can easily result in
processing skew. In warehousing environments this is especially true for date columns.
Good distribution keys are key columns; they have lots of distinct values and very rarely result in processing skew. In our
example we have a couple of distribution keys to choose from: L_SUPPKEY, L_ORDERKEY, L_PARTKEY. All have a big
number of distinct values.
3 Co-Location
The most basic warehouse schema consists of a fact table containing a list of all business transactions and a set of dimension
tables that contain the different actors, objects, locations and time points that have taken part in these transactions. This means
that most queries will not only access one database table but will require joins between tables.
In PureData System database, tables are distributed over a potentially large numbers of data slices on different SPUs. This
means that during a join of two tables there are two possibilities.
• Rows of the two tables that belong together are situated on the same dataslice, which means that they are co-located
and can be joined locally
• Rows that belong together are situated on different dataslices which means that tables need to be redistributed.
3.1 Investigation
Obviously co-location has big performance advantages. In the following section we will demonstrate that by introducing a
second table ORDERS.
1. Switch to the Linux command line, if you are in the NZSQL console. Do this with the q command.
2. Switch to the data distribution lab directory with the command cd /labs/dataDistribution
3. Create and load the ORDERS table by executing the following command: ./create_orders_1.sh
4. Enter the NZSQL console with the nzsql labdb labadmin command
5. Let’s take a look at the ORDERS table with the d orders command. You should see the following results.
LABDB(LABADMIN)=> SELECT COUNT(*), DATASLICEID FROM LINEITEM WHERE L_SHIPDATE =
'1996-03-29' GROUP BY DATASLICEID;
COUNT | DATASLICEID
-------+-------------
2501 | 2
(1 row)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of
19
The orders table has a key column O_ORDERKEY that is most likely the primary key of the table. It contains information on
the order value, priority and date and has been distributed on random. This means that PureData System doesn’t use a
hash based algorithm to distribute the data. Instead, rows are distributed randomly on the available data slices.
You can check the data distribution of the table, using the methods we have used before for the LINEITEM table. The data
distribution will be perfect. There will also not be any processing skew for queries on the single table, since in a random
distribution there can be no correlation between any WHERE condition and the distribution key.
6. We have received another typical query from our customer. It returns the average total price and item quantity of all
orders grouped by the shipping priority. This query has to join together the LINEITEM and ORDERS tables to get the
total order cost from the orders table and the quantity for each shipped item from the LINEITEM table. The tables are
joined with an inner join on the L_ORDERKEY column. Execute the following query and note the approximate execution
time:
You should see the following results:
Notice that the query takes about a minute to complete on our machine. The actual execution times on your machine will be
different.
7. Remember that the ORDERS table was distributed randomly and the LINEITEM table is still distributed by the
L_SHIPDATE column. The join on the other hand is taking place on the L_ORDERKEY and O_ORDERKEY columns.
We will now have a quick look at what is happening inside PureData System in this scenario. To do this we use the
PureData System EXPLAIN function. This will be more thoroughly covered in the Optimization lab.
LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE,AVG(L.L_QUANTITY) AS QUANTITY,
O.O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
PRICE | QUANTITY | O_ORDERPRIORITY
---------------+-----------+-----------------
189285.029553 | 25.526186 | 2-HIGH
189219.594349 | 25.532474 | 5-LOW
189093.608965 | 25.513563 | 1-URGENT
189026.093657 | 25.494518 | 3-MEDIUM
188546.457203 | 25.472923 | 4-NOT SPECIFIED
(5 rows)
LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY,
O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
LABDB(LABADMIN)=> d orders
Table "ORDERS"
Attribute | Type | Modifier | Default Value
-----------------+-----------------------+----------+---------------
O_ORDERKEY | INTEGER | NOT NULL |
O_CUSTKEY | INTEGER | NOT NULL |
O_ORDERSTATUS | CHARACTER(1) | NOT NULL |
O_TOTALPRICE | NUMERIC(15,2) | NOT NULL |
O_ORDERDATE | DATE | NOT NULL |
O_ORDERPRIORITY | CHARACTER(15) | NOT NULL |
O_CLERK | CHARACTER(15) | NOT NULL |
O_SHIPPRIORITY | INTEGER | NOT NULL |
O_COMMENT | CHARACTER VARYING(79) | NOT NULL |
Distributed on random: (round-robin)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of
19
Execute the following command:
You will get a long output. Scroll up till you see your command in the text window. The start of the EXPLAIN output should
look like the following:
The EXPLAIN functionality will be covered in detail in a following chapter but it is easy to see what is happening here.
What’s happening is the system is redistributing both the ORDERS and LINEITEM tables. This is very bad because both
tables are of significant size so there is a considerable overhead. This inefficient redistribution occurs because the tables
are not distributed on a useful column. In the next section we will fix this.
3.2 Co-Located Joins
In the last section we have seen that a query using joins can result in costly data redistribution during join execution when the
joined tables are not distributed on the join key. In this section we will reload the tables based on the mutual join key to enhance
performance during joins.
1. Exit the NZSQL console with the q command.
2. Switch to the dataDistribution directory with the cd /labs/dataDistribution command
3. Change the distribution key in the lineitem.sql file to L_ORDERKEY:
a. Open the file with the vi editor by executing the command: vi lineitem.sql
b. Switch to INSERT mode by pressing “i”
c. Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON
(L_ORDERKEY)
d. Exit the INSERT mode by pressing ESC
EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,AVG(L.L_QUANTITY) AS QUANTITY,
O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {}]
-- Estimated Rows = 1500000, Width = 27, Cost = 0.0 .. 578.6, Conf = 100.0
Projections:
1:O.O_TOTALPRICE 2:O.O_ORDERPRIORITY 3:O.O_ORDERKEY
[SPU Distribute on {(O.O_ORDERKEY)}]
[HashIt for Join]
Node 2.
[SPU Sequential Scan table "LINEITEM" as "L" {(L.L_SHIPDATE)}]
-- Estimated Rows = 6001215, Width = 12, Cost = 0.0 .. 2417.5, Conf = 100.0
Projections:
1:L.L_QUANTITY 2:L.L_ORDERKEY
[SPU Distribute on {(L.L_ORDERKEY)}]
...
LABDB(LABADMIN)=>EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,
AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE
L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of
19
e. Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should
look like the following:
4. Change the Distribution key in the orders.sql file to O_ORDERKEY.
a. Open the file with the vi editor by executing the command: vi orders.sql
b. Switch to INSERT mode by pressing “i”
c. Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON
(O_ORDERKEY)
d. Exit the INSERT mode by pressing ESC
e. Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should
look like the following:
5. Recreate and load the LINEITEM table with the distribution key L_ORDERKEY by executing the
command: ./create_lineitem_1.sh
create table orders
(
o_orderkey integer not null ,
o_custkey integer not null ,
o_orderstatus char(1) not null ,
o_totalprice decimal(15,2) not null ,
o_orderdate date not null ,
o_orderpriority char(15) not null ,
o_clerk char(15) not null ,
o_shippriority integer not null ,
o_comment varchar(79) not null
)
DISTRIBUTE ON (o_orderkey);
~
:wq!
create table lineitem
(
l_orderkey integer not null ,
l_partkey integer not null ,
l_suppkey integer not null ,
l_linenumber integer not null ,
l_quantity decimal(15,2) not null ,
l_extendedprice decimal(15,2) not null ,
l_discount decimal(15,2) not null ,
l_tax decimal(15,2) not null ,
l_returnflag char(1) not null ,
l_linestatus char(1) not null ,
l_shipdate date not null ,
l_commitdate date not null ,
l_receiptdate date not null ,
l_shipinstruct char(25) not null ,
l_shipmode char(10) not null ,
l_comment varchar(44) not null
)
DISTRIBUTE ON (l_orderkey);
~
:wq!
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of
19
6. Recreate and load the ORDERS table with the distribution key O_ORDERKEY by executing the
command: ./create_orders_1.sh
7. Enter the NZSQL console by executing the following command: nzsql labdb labadmin
8. Repeat executing the explain of our join query from the previous section by executing the following command:
The query itself has not been changed. The only changes are in the distribution keys of the involved tables. You will again
see a long output. Scroll up to the start of the output, directly after your query. You should see a similar output to the
following:
Again we do not want to make a complete analysis of the explain output. We will cover that in more detail in later chapters.
But if you compare the output with the output of the last section you will see that the [SPU Distribute on
O.O_ORDERKEY)}] nodes have vanished. The reason is that the join is now co-located because both tables are
distributed on the join key.
You may see a distribution node further below during the execution of the group by clause, but this is forecast to distribute
only hundred rows which has no negative performance influence.
9. Finally execute the joined query again:
The query should return the same results as in the previous section but run much faster even in the VMWare environment.
In a real PureData System appliance with 6, 12 or more SPUs the difference would be much more significant.
You now have loaded the LINEITEM and ORDERS table into your PureData System appliance using the optimal distribution
key for these tables for most situations.
a. Both tables are distributed evenly across dataslices, so there is no data skew.
LABDB(LABADMIN)=>EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,
AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE
L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY;
EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY,
O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP
BY O_ORDERPRIORITY;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 1500000, Width = 27, Cost = 0.0 .. 578.6, Conf = 100.0
Projections:
1:O.O_TOTALPRICE 2:O.O_ORDERPRIORITY 3:O.O_ORDERKEY
[HashIt for Join]
Node 2.
[SPU Sequential Scan table "LINEITEM" as "L" {(L.L_ORDERKEY)}]
-- Estimated Rows = 6001215, Width = 12, Cost = 0.0 .. 2417.5, Conf = 100.0
Projections:
1:L.L_QUANTITY 2:L.L_ORDERKEY
...
LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY,
O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of
19
b. The distribution key is highly unlikely to result in processing skew, since most where conditions will restrict a
key column evenly
c. Since ORDERS is a parent table of LINEITEM, with a foreign key relationship between them, most queries
joining them together will utilize the join key. These queries will be co-located.
Now finally we will pick the distribution keys of the full schema.
4 Schema Creation
Now that we have created the ORDERS and LINEITEM tables we need to pick the distribution keys for the remaining tables as
well.
4.1 Investigation
Figure 2 LABDB database
You will notice that it is much harder to find optimal distribution keys in a more complicated schema like this. In many situations
you will be forced to choose between enabling co-located joins between one set of tables or another one.
The following provides some details on the remaining tables:
Table Number of Rows Primary Key
REGION 5 R_REGIONKEY
NATION 25 N_NATIONKEY
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of
19
CUSTOMER 150000 C_CUSTKEY
ORDERS 1500000 O_ORDERKEY
SUPPLIER 10000 S_SUPPKEY
PART 200000 P_PARTKEY
PARTSUPP 800000 ---
LINEITEM 6000000 --
And on the involved relationships:
Parent Table Child Table Parent Table Join Column Child Table Join Column
REGION NATION R_REGIONKEY N_REGIONKEY
NATION CUSTOMER N_NATIONKEY C_NATIONKEY
NATION SUPPLIER N_NATIONKEY S_NATIONKEY
CUSTOMER ORDERS C_CUSTKEY O_CUSTKEY
ORDERS LINEITEM O_ORDERKEY L_ORDERKEY
SUPPLIER LINEITEM S_SUPPKEY L_SUPPKEY
SUPPLIER PARTSUPP S_SUPPKEY PS_SUPPKEY
PART LINEITEM P_PARTKEY L_PARTKEY
PART PARTSUPP P_PARTKEY PS_PARTKEY
Given all that you heard in the presentation and lab, try to fill in the distribution keys in the chart below. Let’s assume that we will
not change the distribution keys for LINEITEM and ORDERS anymore.
Table Distribution Key (up to 4 columns) or Random
REGION
NATION
CUSTOMER
SUPPLIER
PART
PARTSUPP
ORDERS O_ORDERKEY
LINEITEM L_ORDERKEY
4.2 Solution
It is important to note that there is no optimal way to pick distribution keys. It always depends on the queries that run against the
database. Without these queries it is only possible to follow some general rules:
• Co-Location between big tables (esp. if a fact table is involved) is more important than between small tables
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 17 of
19
• Very small tables can be broadcast by the system with little performance penalty. If one table of a join is broadcast the
other will not need to be redistributed
• If you suspect that there will be lots of queries joining two big tables but you cannot distribute both of them on the
expected join key, distributing one table on the join key is better than nothing, since it will lead to a single redistribute
instead of a double redistribute.
If we break down the problem, we can see that PART and PARTSUPP are the biggest two of the remaining tables and we have
already based on available customer queries distributed the LINEITEM table on the order key, so it seems to make sense to
distribute PART and PARTSUPP on their join keys.
CUSTOMER is big as well and has two relationships. The first relationship is with the very small NATION table that is easily
broadcasted by the system. The second relationship is with the ORDERS table which is big as well but already distributed by the
order key. But as mentioned above a single redistribute is better than a double redistribute. Therefore it makes sense to
distribute the CUSTOMER table on the customer key, which is also the join key of this relationship.
The situation is very similar for the SUPPLIER table. It has two very large child tables PARTSUPP and LINEITEM which are
both related to it through the supplier key, so it should be distributed on this key.
NATION and REGION are both very small and will most likely be broadcasted by the Optimizer. You could distribute those
tables randomly, on their primary keys, on their join keys. In this case we have decided to distribute both on their primary keys
but there is no definite right or wrong approaches. One possible solution for the distribution keys could be the following.
Table Distribution Key (up to 4 columns) or Random
REGION R_REGIONKEY
NATION N_NATIONKEY
CUSTOMER C_CUSTKEY
SUPPLIER S_SUPPKEY
PART P_PARTKEY
PARTSUPP PS_PARTKEY
ORDERS O_ORDERKEY
LINEITEM L_ORDERKEY
Finally we will actually load the remaining tables.
1. You should still be connected to the LABDB database. We now need to recreate the NATION and REGION tables with
a new distribution key. To drop the old versions execute the following command:
2. Quit the NZSQL console with the q command.
3. Navigate to the lab folder by executing the following command: cd /labs/dataDistribution
4. Verify the SQL script creating the remaining 6 tables with the command: more remaining_tables.sql
You will see the SQL script used for creating the remaining tables with the distribution keys mentioned above. Press the
Enter key to scroll lower until you reach the end of the file.
5. Actually create the remaining tables and load the data into it with the following command: ./create_remaining.sh
LABDB(LABADMIN)=> DROP TABLE NATION, REGION;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 18 of
19
You should see the following results. The error message at the top is expected since the script tries to clean up any old
tables of the same name in case a reload is necessary.
Congratulations! You just have defined data distribution keys for a customer data schema in PureData System. You can
have a look at the created tables and their definitions with the commands you used in the previous chapters. We will
continue to use the tables we created in the following labs.
[nz@netezza dataDistribution]$ ./create_remaining.sh
ERROR: Table 'NATION' does not exist
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
Load session of table 'NATION' completed successfully
Load session of table 'REGION' completed successfully
Load session of table 'CUSTOMER' completed successfully
Load session of table 'SUPPLIER' completed successfully
Load session of table 'PART' completed successfully
Load session of table 'PARTSUPP' completed successfully
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 19 of
19
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or
registered
trademarks of International Business Machines Corporation
in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence
in this
information with a trademark symbol (® or ™), these
symbols indicate
U.S. registered or common law trademarks owned by IBM
at the time
this information was published. Such trademarks may also
be registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web at
“Copyright and trademark information” at
ibm.com/legal/copytrade.shtml
Other company, product and service names may be
trademarks or service marks of others.
References in this publication to IBM products and services
do not imply that IBM intends to make them available in all
countries in which
IBM operates.
No part of this document may be reproduced or transmitted
in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date
of initial
publication. Product data is subject to change without notice.
Any
statements regarding IBM’s future direction and intent are
subject to
change or withdrawal without notice, and represent goals
and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY,
EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS
ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and
conditions of
the agreements (e.g. IBM Customer Agreement, Statement
of Limited
Warranty, International Program License Agreement, etc.)
under which
they are provided.
IBM Software
Information Management
IBM PureData System Administrator
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 16
Table of Contents
1 Introduction .....................................................................3
2 Installing NzAdmin..........................................................3
3 The System Tab...............................................................4
4 The Database Tab............................................................6
5 Tools...............................................................................14
5.1 Workload Management..................................................14
5.2 Table Storage................................................................15
5.3 Table Skew....................................................................15
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 16
1 Introduction
In this lab we will explore the features of the IBM PureData System Administrator GUI tool, NzAdmin. NzAdmin is a Windows-
based application that allows users to manage the system, obtain hardware information and status, and manage various aspects
of user databases, tables, and objects. NzAdmin consists of two distinct environments: the System tab and the Database tab.
We will look at both. When you click either tab, the system displays the tree view on the left side and the data view on the right
side.
The VMWare image we are using in the labs differs significantly from a normal PureData System appliance. There is
only one virtualized SPA and SPU, only 4 dataslices and no dataslice mirroring. In addition to that some NzAdmin
functions do not work with the VM. For example the SPU and SPA sections are blank and the data distribution of a
table cannot be displayed. Nevertheless most functionality works and should provide a good overview.
2 Installing NzAdmin
NzAdmin is part of the PureData System client tools for Windows. It can be installed with a standard Windows installer and
doesn’t require the JDBC or ODBC drivers to be installed, since it contains its own connection libraries.
1. The installation package is in “C:BootcampNetezza_Bootcamp_VMsToolsnzsetup.exe”
(The base directory C:Bootcamp may differ in your environment, there should be a shortcut on your Desktop
as well, if you cannot find it ask the instructor for help)
2. Install the NzAdmin client by double-clicking on the Installer and accepting all standard settings.
3. You can start NzAdmin from the Windows Start Menu. Programs->IBM PureData System -> IBM PureData System
Administrator
4. Connect to your PureData System host with the IP address taken from the VM where PureData System Host is running
(you can use ifconfig eth0 in the Linux terminal window. In our lab the IP address is 192.168.239.2, username “admin”,
and password “password”.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 16
5. You should see the following:
The Admin client has a navigation menu on the left with two views System and Database.
The System view contains information about the general status of the PureData System hardware and the PureData
System Performance Server software. It displays system information and provides information about possible system
problems like a hard disc failure. It also contains statistics like the user space usage.
The database view contains information about the user databases in the system. It displays information about all database
objects like tables, views, sequences, synonyms, user defined functions, procedures etc. It also provides the user with the
tools necessary to manage groups and access privileges. You can also view the current active database sessions and their
queries and a recent history of all queries that have been run on the system. Finally you can see the backup history for
each database.
The menu bar contains common actions like refresh or connect. In addition to that it provides access to some
administration tools like Workload Management, a tool for the identification of table skew etc.
3 The System Tab
In this section we will inspect the hardware components that make up a PureData System Appliance system using NzAdmin,
including the SPUs and the data slices.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 16
1. The default view is the main hardware view, which shows a dashboard representing the general health of the system.
Unfortunately the hardware information cannot be gathered for our VM. But we see the disc usage at the bottom. Note
that the most important measure is actually the Maximum storage utilization. If one disc runs full, no new data can be
added to the system.
2. Unfortunately the SPA and SPU sections are empty for our VM system, normally we could see health information of all
Snippet processing arrays, snippet processing units and their data slices and hard discs. The next available section is
data slices. When you select it, you can see that our VM has 4 dataslices 1-4 on four hard discs 1002-1005. Normally
we would also see which disc contains the mirrors of these discs, but our VM system doesn’t mirror its data slices.
3. Under the data slice section you can see the currently active event rules. Event rules monitor the system for note worthy
occurrences and act accordingly i.e. sending an email or raising an alert. For example by sending a mail to an
administrator in case of a hardware failure. Unlike a real PureData System appliance only a very small set of event rules
is enabled. You could use the New Event Rule wizard to add new events or generate test events.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 16
4 The Database Tab
In this section we will learn how NzAdmin can be used to view and manage user database objects including tables, users,
groups and sessions.
1. Switch tabs to the Database tab. This is the area where database tables, users, groups and sessions can be viewed
and managed.
You may not have some of the database objects displayed in the image, this shouldn’t change the lab in any way.
2. In the Database tab, expand Databases and click on LABDB. NzAdmin can view all the objects of the following types:
tables, views, sequences, synonyms, functions, aggregates, procedures, and libraries. You can also create objects of
the following types: tables, views, sequences, and synonyms. Furthermore, many of these object types can be
managed in some way using NzAdmin - for example we have control over tables in NzAdmin. Finally we can see the
currently running Groom processes in the Groom Status section.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 16
3. Click on Tables in the tree or data view. For each table in the LABDB database you can view information such as the
owner, creation date, number of columns, size of table on disk, data skew, row count, and percentage organized if
enabled.
4. If you right click on a table you can selected ways in which to manage the table including changing the name, owner,
columns, organizing keys, default value, generating or viewing statistics, viewing record distribution, truncating and/or
dropping the table.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 16
Unfortunately one of the most important menu entries “Record Distribution” which gives you a graphical distribution of
the data distribution of the table doesn’t work in our VMware environment.
5. To look at information about the columns, distribution key, organizing keys, DLL, and users/groups with privileges for a
table double click on the table entry to bring up the details:
This view shows the columns of the table and their constraints. It also shows if the columns are Zone Map enabled or
not - Zone Maps are an important performance feature in PureData System and will be discussed later in this course.
You can set access privileges to the Table with the Privileges button. The DDL button returns the command to create
the table and is a convenient feature for administrators.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 16
6. Close the Table Properties view again and click on the Users field of the left navigation tree. Here you can create and
manage users.
Users can either be created from a context menu on the Users folder or from the Options Menu at the top of the screen.
To manage users use the context menu that is displayed when you right click on the user you want to manage.
NzAdmin allows you to rename or drop users, change their privileges and workload management settings etc.
7. You can do the same management for groups in the Groups section of the Database tab.
8. Click on Sessions in the Database tab. Here you can see who is currently logged into the PureData System and the
commands they have issued. You can also abort sessions or change their priority in a workload management
environment (this has to be setup before you can change the priority).
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of
16
9. To see and manage active queries you can expand Sessions in the Database tab and click on Active Queries, however,
there are no queries running at this time.
10. Comprehensive query history information can be seen by clicking on the Query History section in the Database tab.
PureData System keeps a query history of the last 2000 queries. For a full audit log you would need to use the Query
History database. Select the View Query Entry menu item from the context menu, to get a more structured view for the
values of a specific query:
11. Another window is opened showing the fields of the query history table in a more structured way:
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of
16
PureData System saves for each recent query a significant amount of information, which can help you to identify
queries that behave badly. Important values are estimated and actually elapsed seconds, result rows and of course the
actual SQL statement.
12. It is also possible to get information about the actual execution plan of the query. We will discuss this in more detail in
future modules. To see a graphical representation of an Explain output right-click on a query and select Explain->Graph:
13. You should see a similar window, the actual graph may differ depending on the query you pick:
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of
16
This graph shows how the PureData System appliance plans to execute the query in question. It is an important part of
troubleshooting the occasional misbehaving queries. We will discuss this in more detail in the Query Optimization
module. You can also get a textual view by selecting Explain->Verbose.
14. Close the graph and display the plan file in the context menu with the “View Plan File” entry. You should see a similar
window to the following. Please scroll down to the bottom:
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of
16
Plan files look similar to Verbose Explain information but there is a significant difference. Explain information tells you
how the system plans to execute a query. The Plan files add information on how the query was actually executed
including actual execution times and are an invaluable help for debugging queries that failed or took longer than
expected.
15. Finally select the Backup->History field in the navigator.
PureData System logs all backups and restore sessions in the system. You will see an empty list but if you return to this
view after the Backup and Restore lab you will see the backup and restore processes you started. The backup history
allows PureData System to provide easy incremental or cumulative backups and to synchronize backups with the
groom process - we will discuss more about that in a later section.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of
16
5 Tools
In this section we will learn how to set workload management system settings with NzAdmin, as well as search for data skew,
and view disk usage by database or user.
5.1 Workload Management
1. From the menu bar at the top of NzAdmin click on Tools Workload Management System Settings. Using this tool
we can limit the maximum number of rows allowed in a table, enable query timeouts, session idle timeouts, and default
session priority
2. From the Workload Management menu option, click into Performance Summary
3. From the Summary pane, we can look at activities that happened in the last hour in an aggregate view. Keep this in
mind for the Workload Management module.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of
16
5.2 Table Storage
1. From the menu bar at the top of NzAdmin click on Tools Table Storage. This is a tool, which will tell us the total size
in MB for each database or the total size of all the databases a user owns.
5.3 Table Skew
1. From the menu bar at the top of NzAdmin click on Tools Table Skew. This tool displays tables that meet or exceed a
specified data skew threshold between data slices.
Once an administrator has seen in the main overview that the maximal storage differs significantly from the average
story he can use this tool to find the skewed tables. He can then fix them for example by redistributing them with a
CTAS table. Skewed tables not only limit the available storage but also significantly lower the performance.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of
16
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at “Copyright and
trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
Loading and Unloading Data
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 32
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 External Tables................................................................3
2.1 Unloading Data using External Tables .............................5
2.2 Dropping External Tables ..............................................15
2.3 Loading Data using External Tables ..............................16
3 Loading Data using the nzload Utility........................18
3.1 Using the nzload Utility with Command Line Options...19
3.2 Using the nzload Utility with a Control File...................22
3.3 (Optional) Using nzload with Bad Records...................24
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 32
1 Introduction
In every data warehouse environment there is a need to load new data into the database. The task to load data into the
database is not just a one time operation but rather a continuous operation that can occur hourly, daily, weekly, or even monthly.
The loading of the data into a database is vital operation that needs to be supported by the data warehouse system. IBM
PureData System provides a framework to support not only the loading of data into the PureData System database environment
but also the unloading of data from the database environment. This framework contains more than one component, some of
these components are:
• External Tables – These are tables stored as flat files on the host or client systems and registered like tables in the
PureData System catalog. They can be used to load data into the PureData System appliance or unload data to the file
system.
• nzload – This is a wrapper command line tool around external tables that provides an easy method loading data into
the PureData System appliance.
• Format Options – These are options for formatting the data load to and from external tables.
1.1 Objectives
This lab will help you explore the IBM PureData System framework components for loading and unloading data from the
database. You will use the various commands to create external tables to unload and load data. Then you will get a basic
understanding of the nzload utility. In this lab the REGION and NATION tables in the LABDB database are used to illustrate the
use of external tables and the nzload utility. After this lab you will have a good understanding on how to load and unload data
from a PureData System database environment
• The first part of this lab will explore using External Tables to unload and load data.
• The second part of this lab will discuss using the nzload utility to load records into tables.
2 External Tables
An external table allows PureData System to treat an external file as a database table. An external table has a definition (a table
schema) in the PureData System system catalog but the actual data exists outside of the PureData System appliance database.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 32
This is referred to as a datasource file. External tables can be used to access files which are stored on the file system. After you
have created the external table definition, you can use INSERT INTO statements to load data from the external file into a
database table, or SELECT FROM statements to query the external table. Different methods are described to create and use
external tables using the nzsql interface. Along with this the external datasource files for the external tables are examined, so a
second session will be used to help view these files.
I. Connect to your PureData System image using PuTTY. Login to 192.168.239.2 as user “nz” with password “nz”.
(192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp)
II. Change to the lab working directory /labs/movingData with the following command
cd /labs/movingData
III. Connect to the LABDB database as the database owner, LABADMIN, using the nzsql interface :
You should see the following results
IV. Now in this lab we will need to alternatively execute SQL commands and operating system commands. To make it easier
for you, we will open a second putty session for executing operating system commands like nzload, view generated
external files etc. It will be referred to as session 2 throughout the lab.
The picture above shows the two PuTTY windows that you will need. Session 1 will be used for SQL commands and
session 2 for operating system prompt commands.
V. Open another session using PuTTY. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the
default IP address for a local VM, the IP may be different for your Bootcamp)
Also make sure that you change to the correct directory, /labs/movingData:
[nz@netezza ~] cd /labs/movingData
[nz@netezza ~] nzsql -d LABDB -u labadmin -pw password
Welcome to nzsql, the Netezza SQL interactive terminal.
Type: h for help with SQL commands
? for help on internal slash commands
g or terminate with semicolon to execute query
q to quit
LABDB(LABADMIN)=>
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 32
2.1 Unloading Data using External Tables
External tables will be used to unload rows from the LABDB database as records into an external datasource file. Various
methods to create and use external tables will be explored unloading rows from either REGION or NATION tables. Five basic
different user cases are presented for you to follow so that you can gain a better understanding of how to use external tables to
unload data from a database.
2.1.1 Unloading data with an External Table created with the SAMEAS clause
The first external table will be used to unload data from the REGION table into an ASCII delimited text file. This external table will
be named ET1_REGION using the same column definition as the REGION table. After the ET1_REGION external table is created
you will then use it to unload all the rows from the REGION table. The records for the ET1_REGION external table will be in the
external datasource file, et1_region_flat_file. The basic syntax to create this type of external table is:
The SAMEAS clause allows the external table to be created with the same column definition of the referred. This is referred to as
implicit schema definition.
1. As the LABDB database owner, LABADMIN, you will create the first basic external table using the same column
definitions as the REGION table:
2. To list the external tables in the LABDB database you use the internal slash option, dx:
Which will list the external table you just created:
3. You can also list the properties of the external table using the following internal slash option to describe the table, d
<external table name> :
Which will list the properties of the ET1_REGION external table:
CREATE EXTERNAL TABLE table_name
SAMEAS table_name
USING external_table_options
LABDB(LABADMIN)=> d et1_region
List of relations
Name | Type | Owner
------------+----------------+----------
ET1_REGION | EXTERNAL TABLE | LABADMIN
(1 rows)
LABDB(LABADMIN)=> dx
LABDB(LABADMIN)=> create external table et1_region sameas region
using (dataobject ('/labs/movingData/et1_region_flat_file'));
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 32
This output includes the columns and associated data types in the external table. You will notice that this is similar to the
REGION table since the external table was created using the SAMEAS clause in the CREATE EXTERNAL TABLE command.
The output also includes the properties of the external table. The most notable property is the DataObject property that
shows the location and the name of the external datasource file used for the external table. We will examine some of the
other properties in this lab.
4. Now that the external table is created you can use it to unload data from the REGION table using an INSERT statement :
External Table "ET1_REGION"
Attribute | Type | Modifier
-------------+------------------------+----------
R_REGIONKEY | INTEGER |
R_NAME | CHARACTER(25) |
R_COMMENT | CHARACTER VARYING(152) |
DataObject - '/labs/movingData/et1_region_flat_file'
adjustdistzeroint - f
bool style - 1_0
code set -
compress - FALSE
cr in string - f
ctrl chars - f
date delim - -
date style - YMD
delim - |
encoding - INTERNAL
escape -
fill record - f
format - TEXT
ignore zero - f
log dir - /tmp
max errors - 1
max rows - 0
null value - NULL
quoted value - NO
remote source -
require quotes - f
skip rows - 0
socket buf size - 8388608
timedelim - :
time round nanos - f
time style - 24HOUR
trunc string - f
y2base - 0
includezeroseconds - f
record length -
record delimiter -
nullindicator bytes -
layout -
decimaldelim -
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 32
5. You can use this external table like a regular table by issuing SQL statements. Try issuing a simple SELECT FROM
statement against ET1_REGION external table:
Which will return all the rows in the ET1_REGION external table:
You will notice that this is the same data that is in the REGION table. But the data retrieved for this SELECT statement was
from the datasource of this external table and not from the data within the database.
6. The main reason for creating an external table is to unload data from a table to a file. Using the second putty session
review the file that was created, et1_region_flat_file, in the /labs/movingData directory:
The file should look similar to the following:
This is an ASCII delimited flat file containing the data from the REGION table. The column delimiter used in this file was the
default character ‘|.’
2.1.2 Unloading data with an External Table using the AS SELECT clause
The second external table will also be used to unload data from the REGION table into an ASCII delimited text file using a
different method. The external table will be created and the data will be unloaded in the same create statement. So a separate
step is not required to unload the data. The external table will be named ET2_REGION and the external datasource file will be
named et2_region_flat_file. The basic syntax to create this type of external table is:
The AS clause allows the external table to be created with the same columns returned in the SELECT FROM statement, which is
referred to as implicit table schema definition. This also unloads the rows at the same time the external table is created.
CREATE EXTERNAL TABLE table_name 'filename'
AS select_statement;
2|sa|south america
1|na|north america
4|ap|asia pacific
3|emea|europe, middle east, africa
[nz@netezza movingData]$ more et1_region_flat_file
R_REGIONKEY | R_NAME | R_COMMENT
------------+---------------------------+-----------------------------
2 | sa | south america
1 | na | north america
4 | ap | asia pacific
3 | emea | europe, middle east, Africa
(4 rows)
LABDB(LABADMIN)=> select * from et1_region;
LABDB(LABADMIN)=> insert into et1_region select * from region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 32
1. The first method used to create an external table required the data to be unloaded in a second step using an INSERT
statement. Now you will create an external table and unload the data in a single step:
This command created the external table ET2_REGION using the same definition as the REGION table and also unloaded
the data to the et2_region_flat_file.
2. LIST the EXTERNAL TABLES in the LABDB database:
Which will list all the external tables in the LABDB database:
You will notice that there are now two external tables. You can also list the properties of the external table, but the output
will be similar to the output in the last section, except for the filename.
3. Using the second session review the file that was created, et2_region_flat_file, in the /labs/movingData directory:
The file should look similar to the following:
This file is exactly the same as the file you reviewed in the last chapter. The difference this time is that we didn’t need to
unload it explicitly.
2.1.3 Unloading data with an external table using defined columns
The first two external tables that you created used the exact same columns from the REGION table, using an implicit table
schema. You can also create an external table by explicitly specifying the columns. This is referred to as explicit table schema.
The third external table that you create will still be used to unload data from the REGION table but only from the R_NAME and
R_COMMENT columns. The ET3_REGION external table will be created in one step and then the data will be unloaded in the
et3_region_flat_file ASCII delimited text file using a different delimiter string. The basic syntax to create this type of external
table is:
CREATE EXTERNAL TABLE table_name
({column_name type} [, ... ])
[USING external_table_options}]
2|sa|south america
1|na|north america
4|ap|asia pacific
3|emea|europe, middle east, africa
[nz@netezza movingData]$ more et2_region_flat_file
List of relations
Name | Type | Owner
------------+----------------+----------
ET1_REGION | EXTERNAL TABLE | LABADMIN
ET2_REGION | EXTERNAL TABLE | LABADMIN
(2 rows)
LABDB(LABADMIN)=> dx
LABDB(LABADMIN)=> create external table et2_region
'/labs/movingData/et2_region_flat_file' as select * from region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 32
1. You will create a new external table to only include the R_NAME and R_COMMENT columns, and exclude the
R_REGIONKEY column from the REGION table. Along with this you will change the delimiter string from the default ‘|’ to
‘=’:
2. LIST the properties of the ET3_REGION external table
Which will list the properties of the ET3_REGION external table:
LABDB(LABADMIN)=> create external table et3_region (r_name char(25),
r_comment varchar(152)) USING (dataobject
('/labs/movingData/et3_region_flat_file') DELIMITER '=');
LABDB(LABADMIN)=> d et3_region
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of 32
You will notice that there are only two columns for this external table since you only specified two columns when creating
the external table. The rest of the output is very similar to the properties of the other two external tables that you created,
with two main exceptions. The first difference is obviously the Dataobjects field, since the filename is different. The other
difference is the string used for the delimiter, since it is now ‘=’ instead of the default, ‘|’.
3. Now you will unload the data from the REGION table but only the data from columns R_NAME and R_COMMENT:
(Alternatively, you could create the external table and unload the data in one step using the following command:
LABDB(LABADMIN)=> insert into et3_region select r_name, r_comment from region;
External Table "ET3_REGION"
Attribute | Type | Modifier
-------------+------------------------+----------
R_NAME | CHARACTER(25) |
R_COMMENT | CHARACTER VARYING(152) |
DataObject - '/labs/movingData/et3_region_flat_file'
adjustdistzeroint - f
bool style - 1_0
code set -
compress - FALSE
cr in string - f
ctrl chars - f
date delim - -
date style - YMD
delim - =
encoding - INTERNAL
escape -
fill record - f
format - TEXT
ignore zero - f
log dir - /tmp
max errors - 1
max rows - 0
null value - NULL
quoted value - NO
remote source -
require quotes - f
skip rows - 0
socket buf size - 8388608
timedelim - :
time round nanos - f
time style - 24HOUR
trunc string - f
y2base - 0
includezeroseconds - f
record length -
record delimiter -
nullindicator bytes -
layout -
decimaldelim -
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of 32
4. Using the second session review the file that was created, et3_region_flat_file, in the /labs/movingData directory:
The file should look similar to the following:
You will notice that only two columns are present in the flat file using the ‘=’ string as a delimiter.
2.1.4 (Optional) Unloading data with an External Table from two tables
The first three external tables unloaded data from one table. The next external table you will create will be based on using a
table join between the REGION and NATION table. The two tables will be joined on the REGIONKEY and only the N_NAME and
R_NAME columns will be defined for the external table. This exercise will illustrate how data can be unloaded using SQL
statements other than a simple SELECT FROM statement. The external table will be named ET_NATION_REGION using another
ASCII delimited text file named et_nation_file_flat_file.
1. For the next external table you will unload data from both the REGION and NATION table joined on the REGIONKEY
column to list all of the countries and their associated regions. Instead of specifying the columns in the create external
table statement you will use the AS SELECT option:
2. LIST the properties of the ET_NATION_REGION external table
Which will show the properties of the ET_NATION_REGION table:
create external table et4_test '/labs/movingData/et4_region_flat_file'
using (delimiter '=') as select r_name, r_comment from region;
LABDB(LABADMIN)=> d et_nation_region
LABDB(LABADMIN)=> create external table et_nation_region
'/labs/movingData/et_nation_region_flat_file' as select n_name, r_name from
nation, region where n_regionkey=r_regionkey;
sa=south america
na=north america
ap=asia pacific
emea=europe, middle east, africa
[nz@netezza movingData]$ more et3_region_flat_file
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of 32
You will notice that the external table was created using the two columns specified in the SELECT clause: N_NAME and
R_NAME.
3. View the data of the ET_NATION_REGION external table:
Which will show all the rows from the ET_NATION_REGION table:
LABDB(LABADMIN)=> select * from et_nation_region;
External Table "ET_NATION_REGION"
Attribute | Type | Modifier
-----------+---------------+----------
N_NAME | CHARACTER(25) | NOT NULL
R_NAME | CHARACTER(25) |
DataObject - '/labs/movingData/et_NATION_REGION_flat_file'
adjustdistzeroint - f
bool style - 1_0
code set -
compress - FALSE
cr in string - f
ctrl chars - f
date delim - -
date style - YMD
delim - |
encoding - INTERNAL
escape -
fill record - f
format - TEXT
ignore zero - f
log dir - /tmp
max errors - 1
max rows - 0
null value - NULL
quoted value - NO
remote source -
require quotes - f
skip rows - 0
socket buf size - 8388608
timedelim - :
time round nanos - f
time style - 24HOUR
trunc string - f
y2base - 0
includezeroseconds - f
record length -
record delimiter -
nullindicator bytes -
layout -
decimaldelim -
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of 32
This is the result of the joining the NATION and REGION table on the REGIONKEY column to return just the N_NAME and
R_NAME columns.
4. And now using the second session review the file that was created, et_nation_region_flat_file, in the /labs/movingData
directory:
Which should look similar to the following:
You can see that we created a flat delimited flat file from a complex SQL statement. External tables are a very flexible and
powerful way to load, unload and transfer data.
2.1.5 (Optional) Unloading data with an External Table using the compress format
The previous external tables that you created used the default ASCII delimited text format. This last external table will be similar
to the second external table that you created. But instead of the using an ASCII delimited text format you will use the
compressed binary format. The name of the external table will be ET4_REGION and the datasource file name will be
et4_region_compress. The basic syntax to create this type of external table is:
brazil|sa
guyana|sa
venezuela|sa
portugal|emea
australia|ap
united kingdom|emea
united arab emirates|emea
south africa|emea
hong kong|ap
new zealand|ap
japan|ap
macau|ap
canada|na
united states|na
[nz@netezza movingData]$ more et_nation_region_flat_file
N_NAME | R_NAME
---------------------------+---------------------------
brazil | sa
guyana | sa
venezuela | sa
portugal | emea
australia | ap
united kingdom | emea
united arab emirates | emea
south africa | emea
hong kong | ap
new zealand | ap
japan | ap
macau | ap
canada | na
united states | na
(14 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of 32
The external table options COMPRESS and FORMAT must be specified to use the compressed binary format.
1. You will now create one last external table using a similar method that you used to create the second external table, in
section 2.1.2. But instead of using an ASCII delimited-text format the datasource will be compressed. This is achieved
by using the COMPRESS and FORMAT external table options:
As a reminder the external table is created and the data is unloaded in the same operation using the AS SELECT clause.
2. LIST the properties of the ET4_REGION external table
Which will list the properties of the ET4_REGION table:
CREATE EXTERNAL TABLE table_name 'filename'
USING (COMPRESS true FORMAT ‘internal’)
AS select_statement;
LABDB(LABADMIN)=> d et4_region
LABDB(LABADMIN)=> create external table et4_region
'/labs/movingData/et4_region_compress' using (compress true format 'internal') as
select * from region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of 32
You will notice that the options for COMPRESS has changed from FALSE to TRUE indicating that the datasource file is
compressed. And the FORMAT has changed from TEXT to INTERNAL, which is required for compressed files.
2.2 Dropping External Tables
Dropping external tables is similar to dropping a regular PureData System table. The column definition for the external table is
removed from the PureData System catalog. Keep in mind that dropping the table doesn’t delete the external datasource file so
External Table "ET4_REGION"
Attribute | Type | Modifier
-------------+------------------------+----------
R_REGIONKEY | INTEGER |
R_NAME | CHARACTER(25) |
R_COMMENT | CHARACTER VARYING(152) |
DataObject - '/labs/movingData/et4_region_compress'
adjustdistzeroint -
bool style -
code set -
compress - TRUE
cr in string -
ctrl chars -
date delim -
date style -
delim -
encoding -
escape -
fill record -
format - INTERNAL
ignore zero -
log dir -
max errors -
max rows -
null value -
quoted value -
remote source -
require quotes -
skip rows -
socket buf size - 8388608
timedelim -
time round nanos -
time style -
trunc string -
y2base -
includezeroseconds -
record length -
record delimiter -
nullindicator bytes -
layout -
decimaldelim -
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of 32
this also has to be maintained. So the external datasource file can still be used for loading data into a different table. In this
chapter you will drop the ET1_REGION table, but you will not delete the associated external datasource file, et1_region_flat_file.
This datasource file will be used later in this lab to load data into the REGION table.
1. Drop the first external table that you created, ET1_REGION, using the DROP TABLE command
The same drop command for tables is used for external tables, so there is no separate DROP EXTERNAL TABLE
command.
2. Verify that the external table has been dropped using the internal slash option, dx:
Which will list all the external tables in the LABDB database:
In this list the four remaining external tables that you created still exist.
3. Even though the external table definition no longer exists within the LABDB database, the flat file named
et1_region_flat_file still exits in the /labs/movingData directory. Verify this by using the second putty session:
Which will list all of the files in the /labs/movingData directory:
You can see that the file et1_REGION_flat_file still exists. This file can still be used to load data into another similar table.
2.3 Loading Data using External Tables
External tables can also be used to load data into tables in the database. In this chapter data will be loaded into the REGION
table, so you will first have to remove the existing rows from the REGION table. The method to load data from external tables into
a table is similar to using the DML INSERT INTO and SELECT FROM statements. You will use two different methods to load
data into the REGION table, one using an external table and the other using the external datasource file directly. Loading data
into a table from any external table will have an associated log file with a default name of <table_name>.<database_name>.log
1. Before loading the data into the REGION table, delete the rows from the data using the TRUNCATE TABLE command:
et1_region_flat_file et2_region_flat_file et4_region_compress et3_region_flat_file et_nation_region
[nz@netezza movingData]$ ls
List of relations
Name | Type | Owner
------------------+----------------+----------
ET2_REGION | EXTERNAL TABLE | LABADMIN
ET3_REGION | EXTERNAL TABLE | LABADMIN
ET4_REGION | EXTERNAL TABLE | LABADMIN
ET_NATION_REGION | EXTERNAL TABLE | LABADMIN
(4 rows)
LABDB(LABADMIN)=> dx
LABDB(LABADMIN)=> drop table et1_region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 17 of 32
2. Check that the table is empty with the SELECT * command:
You should see that the table contains no data.
3. You will load data into the REGION table from the ET2_REGION external table using an INSERT statement:
4. Check to ensure that the table contains the four rows using the SELECT * statement.
You should see that the table now contains 4 rows.
5. Again delete the rows in the REGION table:
6. Check to ensure that the table is empty using the SELECT * statement.
You should see that the table contains no rows.
7. You will load data into the REGION table using the ASCII delimited file that was created for external table ET1_REGION.
Remember that the definition of the external table was removed from that database, but the external data source file,
et1_region_flat_file, still exists:
8. Check to ensure that the table contains the four rows using the SELECT * statement.
You should see that the table now contains four rows.
9. Since this is a load operation there is always an associated log file, <table>.<database>.nzlog created for each load
performed. By default this log file is created in the /tmp directory. In the second session review this file:
LABDB(LABADMIN)=> select * from region;
LABDB(LABADMIN)=> select * from region;
LABDB(LABADMIN)=> select * from region;
LABDB(LABADMIN)=> select * from region;
[nz@netezza movingData]$ more /tmp/REGION.LABDB.nzlog
LABDB(LABADMIN)=> insert into region select * from external
'/labs/movingData/et1_region_flat_file';
LABDB(LABADMIN)=> truncate table region;
LABDB(LABADMIN)=> insert into region select * from et2_region;
LABDB(LABADMIN)=> truncate table region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 18 of 32
The log file should look similar to the following:
You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information
to identify the table.
3 Loading Data using the nzload Utility
The nzload command is a SQL CLI client application that allows you to load data from the local host or a remote client, on all
the supported client platforms. The nzload command processes command-line load options to send queries to the host to
create an external table definition, run the insert/select query to load data, and when the load completes, drop the external table.
The nzload command is a command-line program that accepts options from multiple sources, where some of the sources can
be from the:
• Command line
Load started at:01-Jan-11 12:34:56 EDT
Database: LABDB
Tablename: REGION
Datafile: /labs/movingData/et1_region_flat_file
Host: netezza
Load Options
Field delimiter: '|' NULL value: NULL
File Buffer Size (MB): 8 Load Replay REGION (MB): 0
Encoding: INTERNAL Max errors: 1
Skip records: 0 Max rows: 0
FillRecord: No Truncate String: No
Escape Char: None Accept Control Chars: No
Allow CR in string: No Ignore Zero: No
Quoted data: NO Require Quotes: No
BoolStyle: 1_0 Decimal Delimiter: '.'
Date Style: YMD Date Delim: '-'
Time Style: 24HOUR Time Delim: ':'
Time extra zeros: No
Statistics
number of records read: 4
number of bad records: 0
-------------------------------------------------
number of records loaded: 4
Elapsed Time (sec): 3.0
-----------------------------------------------------------------------------
Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 19 of 32
• Control file
• NZ Environment Variables
Without a control file, you can only do one load at a time. Using a control file allows multiple loads. The nzload command
connects to a database with a user name and password, just like any other PureData System appliance client application. The
user name specifies an account with a particular set of privileges, and the system uses this account to verify access.
For this section of the lab you will continue to use the LABADMIN user to load data into the LABDB database. The nzload utility
will be used to load records from an external datasource file into the REGION table. Along with this the nzload log files will be
reviewed to examine the nzload options. Since you will be loading data into a populated REGION table, you will use the
TRUNCATE TABLE command to remove the rows from the table.
We will continue to use the two putty sessions from the external table lab.
• Session One, which is connected to the NZSQL console to execute SQL commands, for example to review tables after
load operations
• Session Two, which will be used for operating system commands, to execute nzload commands, view data files, …
3.1 Using the nzload Utility with Command Line Options
The first method for using the nzload utility to load data in the REGION table will specify options at the command line. You will
only need to specify the datasource file. We will use default options for the rest. The datasource file will be the
et1_region_flat_file that you created in the External Tables section. The basic syntax for this type of command is:
1. As the LABDB database owner, LABADMIN first remove the rows in the REGION table:
2. Check to ensure that the rows have been removed from the table using the SELECT * statement:
The REGION table should return no rows.
3. Using the second session at the OS command line you will use the nzload utility to load data from the et1_region_flat
file into the REGION table using the following command line options, -db <database name>, -u <user>, -pw
<password>, -t <table name>, -df <data file>, and –delimiter <string>:
Note: The filename in the image is et”L”_region_flat_file, this is an inconsistency that will be fixed in the next iteration of the
image.
Which will return the following status message:
Load session of table 'REGION' completed successfully
nzload –db <database> -u <username> –pw <password> -df <datasource filename>
LABDB(LABADMIN)=> select * from region;
[nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t
region -df etl_region_flat_file -delimiter '|'
LABDB(LABADMIN)=> truncate table region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 20 of 32
4. Check to ensure that the rows have been load into the table using the SELECT * statement:
Which will return all of the rows in the REGION table:
These rows were loaded from the records in the etl_region_flat_file file.
5. For every load task performed there is always an associated log file, <table>.<db>.nzlog created. By default this log file
is created in the current working directory, which is the /labs/movingData directory. In the second session review this file:
[nz@netezza movingData]$ more REGION.LABDB.nzlog
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
1 | na | north america
4 | ap | asia pacific
2 | sa | south america
3 | emea | europe, middle east, africa
(4 rows)
LABDB(LABADMIN)=> select * from region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 21 of 32
You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information
to identify the database and table.
The –db, -u, and –pw, options specify the database name, the user, and the password, respectively. Alternatively, you could
omit these options if the NZ environment variables are set to the appropriate database, username and password values. Since
the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD are set to system, admin, and password, you
need to use these options so the load will be against the LABDB database using the LABADMIN user.
The other options:
-t specifies the target table name in the database
-df specifies the datasource file to be loaded
-delimiter specifies the string to use as the delimiter in an ASCII delimited text file.
There are other options that you can use with the nzload utility. These options were not specified here since the default values
were sufficient for this load task.
Load started at:01-Jan-11 12:34:56 EDT
Database: LABDB
Tablename: REGION
Datafile: /labs/movingData/et1_region_flat_file
Host: netezza
Load Options
Field delimiter: '|' NULL value: NULL
File Buffer Size (MB): 8 Load Replay REGION (MB): 0
Encoding: INTERNAL Max errors: 1
Skip records: 0 Max rows: 0
FillRecord: No Truncate String: No
Escape Char: None Accept Control Chars: No
Allow CR in string: No Ignore Zero: No
Quoted data: NO Require Quotes: No
BoolStyle: 1_0 Decimal Delimiter: '.'
Date Style: YMD Date Delim: '-'
Time Style: 24HOUR Time Delim: ':'
Time extra zeros: No
Statistics
number of records read: 4
number of bad records: 0
-------------------------------------------------
number of records loaded: 4
Elapsed Time (sec): 3.0
-----------------------------------------------------------------------------
Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 22 of 32
The following command is equivalent to the nzload command we used above. It’s intended to demonstrate some of the options
that can be used with the nzload command but can be omitted when default values are used. It’s only for demonstrating
purposes:
The –lf, -bf, and –maxErrors options are explained in the next exercise. The –compress and –format options indicate
that the datasource file is an ASCII delimited text file. For a compressed binary datasource file the following options would be
used, -compress true –format internal.
3.2 Using the nzload Utility with a Control File.
As demonstrated in section 3.1 you can run the nzload command by specifying the command line options or you can use
another method by specifying the options in a file, which is referred to as a control file. This is useful since the file can be
updated and modified over time since loading data into a database for a data warehouse environment is a continuous operation.
An nzload control file has the following basic structure:
And the –cf option is used at the nzload command line to use a control file:
The –u and –pw options are optional if the NZ_USER and NZ_PASSWORD environment variables are set to the appropriate user
and password. Using the –u and –pw options overrides the values in the NZ environment variables.
In this session you will again load rows into an empty REGION table using the nzload utility with a control file. The control file
will set the following options: delimiter, logDir, logFile, and badFile, along with the database, and tablename. The
datasource file to be used in this session is the region.del file.
1. As the LABDB database owner, LABADMIN first remove the rows in the REGION table::
Check to ensure that the rows have been removed from the table using the SELECT * statement. The table should contain
no rows.
2. Using the second session at the OS command line you will create the control file to be used with the nzload utility to
load data into the REGION table using the region.del data file. The control file will include the following options:
Parameter Value
Database Database name
nzload –u <username> -pw <password> -cf <control file>
DATAFILE <filename>
{
[<option name> <option value>]
}
nzload –db labdb –u labadmin –pw password –t region
–df et1_region_flat_file –delimiter ‘|’
–outputDir ‘<current directory>’
–lf <table>.<database>.nzlog –bf<table>.<database>.nzlog
–compress false –format text
–maxErrors 1
LABDB(LABADMIN)=> truncate table region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 23 of 32
Tablename Table name
Delimiter Delimiter string
LogDir Log directory
LogFile Log file name
BadFile Bad record log file name
And the data file will be the region.del file instead of the et1_region_flat_file that you used in section 3.1.
We already created the control file in the lab directory. Review it in the second putty session with the following command:
The control file looks like the following:
3. Still in the second session you will load the data using the nzload utility with the control file you created, using the
following command line options: -u <user>, -pw <user>, -cf <control file>
Which will return the following status message:
4. Check the nzload log which was renamed from the default to region.log which is located in the /labs/movingData
directory.
You should see a successful load
5. Check to ensure that the rows are in the REGION table in the first putty session with the nzsql console:
You should see the added rows.
[nz@netezza movingData]$ more region.log
LABDB(LABADMIN)=> select * from region;
DATAFILE /labs/movingData/region.del
{
Database labdb
Tablename region
Delimiter '|'
LogDir /labs/movingData
LogFile region.log
BadFile region.bad
}
[nz@netezza movingData]$ more control_file
Load session of table 'REGION' completed successfully
[nz@netezza movingData]$ nzload -u labadmin -pw password -cf control_file
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 24 of 32
3.3 (Optional) Using nzload with Bad Records
The first two methods illustrated how to use the nzload utility to load data into an empty table using command line options or a
control file. In a data warehousing environment you will most of the time incrementally add data to a table already containing
some rows.
There will be instances where records from a datasource might not match the datatypes in the table. When this occurs the load
will abort when the first bad record is encountered. This is the default behaviour and is controlled by the maxErrors option,
which is set to a default value of 1.
For this exercise you will add additional rows to the NATION table. Since you will be adding rows to the NATION table there will
be no need to truncate the table. The datasource file you will be using is the nation.del file, which unfortunately has a bad record.
1. First check the NATION table by listing all of the rows in the table using the SELECT * statement in the first putty
session:
Which will list all the rows in the NATION table:
2. Using the second session at the OS command line you will use the nzload utility to load data from the nation.del file
into the NATION table using the following command line options, -db <database name>, -u <user>, -pw
<password>, -t <table name>, -df <data file>, and –delimiter <string>
Which will return the following status message:
This is an indication that the load has failed due to a bad record in the datasource file
Error: External Table : count of bad input rows reached maxerrors limit
See /labs/movingData/NATION.LABDB.nzlog file
Error: Load Failed, records not inserted.
N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT
-------------+---------------------------+-------------+----------------------------------
1 | canada | 1 | canada
2 | united states | 1 | united states of america
10 | australia | 4 | australia
5 | venezuela | 2 | venezuela
8 | united arab emirates | 3 | al imarat al arabiyah multahidah
9 | south africa | 3 | south africa
3 | brazil | 2 | brasil
11 | japan | 4 | nippon
12 | macau | 4 | aomen
14 | new zealand | 4 | new zealand
4 | guyana | 2 | guyana
6 | united kingdom | 3 | united kingdom
7 | portugal | 3 | portugal
13 | hong kong | 4 | xianggang
(14 rows)
LABDB(LABADMIN)=> select * from nation;
[nz@netezza movingData]$ nzload -db LABDB -u labadmin -pw password -t nation
-df nation.del -delimiter '|'
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 25 of 32
3. Since the load has failed no rows were loaded into the NATION table, which you can confirm by using the SELECT *
statement (in the first session):
Which will return the rows in the NATION table:
4. In the second session you can check the log file, NATION.LABDB.nzlog, to determine the problem:
[nz@netezza movingData] more NATION.LABDB.nzlog
N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT
-------------+---------------------------+-------------+----------------------------------
1 | canada | 1 | canada
2 | united states | 1 | united states of america
10 | australia | 4 | australia
5 | venezuela | 2 | venezuela
8 | united arab emirates | 3 | al imarat al arabiyah multahidah
9 | south africa | 3 | south africa
3 | brazil | 2 | brasil
11 | japan | 4 | nippon
12 | macau | 4 | aomen
14 | new zealand | 4 | new zealand
4 | guyana | 2 | guyana
6 | united kingdom | 3 | united kingdom
7 | portugal | 3 | portugal
13 | hong kong | 4 | xianggang
(14 rows)
LABDB(LABADMIN)=> select * from nation;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 26 of 32
The Statistics section indicates that 10 records were read before the bad record was encountered during the load process.
As expected no rows were inserted into the table since the default is to abort the load when one bad record is encountered.
The log file also provides information about the bad record:
10(1) [1, INT4] expected field delimiter or end or record, “2”[t]
10(1) indicates the input record number (10) within the file and the offset (1) within the row where a problem was
encountered. [1, INT(4)] indicates the column number (1) within the row and the data type (INT(4)) for the column.
“2”[t] indicates the char that caused the problem ([2]). So putting this all together the problem is that the ‘2t’ was in a field
for an INT(4) column, which is the N_NATIONKEY in the NATION table. ‘2t’ is not a valid integer so this is why the load
marked this as a bad record.
5. You can confirm that this observation is correct by examining the nation.del datasource file that was used for the load.
In the second session execute the following command:
Which will display the nation.del file with the following text:
[nz@netezza movingData] more nation.del
Load started at:01-Jan-11 12:34:56 EDT
Database: LABDB
Tablename: NATION
Datafile: /home/nz/movingData/nation.del
Host: netezza
Load Options
Field delimiter: '|' NULL value: NULL
File Buffer Size (MB): 8 Load Replay REGION (MB): 0
Encoding: INTERNAL Max errors: 1
Skip records: 0 Max rows: 0
FillRecord: No Truncate String: No
Escape Char: None Accept Control Chars: No
Allow CR in string: No Ignore Zero: No
Quoted data: NO Require Quotes: No
BoolStyle: 1_0 Decimal Delimiter: '.'
Date Style: YMD Date Delim: '-'
Time Style: 24HOUR Time Delim: ':'
Time extra zeros: No
Found bad records
bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined]
----------------------------------------------------------------------------------------------------------------------------
1: 10(1) [1, INT4] expected field delimiter or end of record, "2"[t]
Statistics
number of records read: 10
number of bad records: 1
-------------------------------------------------
number of records loaded: 0
Elapsed Time (sec): 1.0
-----------------------------------------------------------------------------
Load completed at: 01-Jan-11 12:34:57 EDT
=============================================================================
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 27 of 32
You will notice on the 10th
line the following record:
2t|denmark|3|denmark
So we made the correct assumption that the ‘2t’ is causing the problem. From this list you can assume that the correct
value should be 24.
6. Alternatively you could instead examine the nzload bad log file NATION.LABDB.nzbad, which will contain all bad
records that are processed during a load. In the second session execute the following command:
Which will display the NATION.LABDB.nzbad file text:
This is the same row identified in the nation.del file using the log file to locate the record within the file. Since the default is to
stop the load after the first bad record is processed there is only one row. If you were to change the default behaviour to
allow more bad records to be processed this file could potentially contain more records. It provides a comfortable overview
of all the records that created exceptions during load.
7. We have the option of changing the NATION.del file to change ‘2t’ to ’24,’ and then rerun the same nzload command
as in step 7. Instead you will rerun a similar load but you will allow 10 bad records to be encountered during the load
process. To change the default behaviour you need to use the command option -maxErrors. You will also change the
name of the nzbad file using the –bf command option and the log filename using the –lf command option:
2t|denmark|3|denmark
[nz@netezza movingData] more NATION.LABDB.nzbad
15|andorra|2|andorra
16|ascension islan|3|ascension
17|austria|3|osterreich
18|bahamas|2|bahamas
19|barbados|2|barbados
20|belgium|3|belqique
21|chile|2|chile
22|cuba|2|cuba
23|cook islands|4|cook islands
2t|denmark|3|denmark
25|ecuador|2|ecuador
26|falkland islands|3|islas malinas
27|fiji|4|fiji
28|finland|3|suomen tasavalta
29|greenland|1|kalaallit nunaat
30|great britain|3|great britian
31|gibraltar|3|gibraltar
32|hungary|3|magyarorszag
33|iceland|3|lyoveldio island
34|ireland|3|eire
35|isle of man|3|isle of man
36|jamaica|2|jamaica
37|korea|4|han-guk
38|luxembourg|3|luxembourg
39|monaco|3|Monaco
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 28 of 32
Which will return the following status message:
Now the load is successful.
8. Check to ensure that the new loaded rows are in the NATION table:
Which will list all of the rows in the NATION table:
So now all of the new records were loaded except for the one bad row with nation key 24.
Load session of table 'NATION' completed successfully
N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT
-------------+---------------------------+-------------+----------------------------------
2 | united states | 1 | united states of america
11 | japan | 4 | nippon
18 | bahamas | 2 | bahamas
19 | barbados | 2 | barbados
20 | belgium | 3 | belqique
25 | ecuador | 2 | ecuador
33 | iceland | 3 | lyoveldio island
34 | ireland | 3 | eire
39 | monaco | 3 | monaco
3 | brazil | 2 | brasil
4 | guyana | 2 | guyana
5 | venezuela | 2 | venezuela
9 | south africa | 3 | south africa
13 | hong kong | 4 | xianggang
15 | andorra | 2 | andorra
27 | fiji | 4 | fiji
28 | finland | 3 | suomen tasavalta
30 | great britain | 3 | great britian
36 | jamaica | 2 | jamaica
37 | korea | 4 | han-guk
38 | luxembourg | 3 | luxembourg
6 | united kingdom | 3 | united kingdom
7 | portugal | 3 | portugal
10 | australia | 4 | australia
12 | macau | 4 | aomen
14 | new zealand | 4 | new zealand
26 | falkland islands | 3 | islas malinas
29 | greenland | 1 | kalaallit nunaat
31 | gibraltar | 3 | gibraltar
32 | hungary | 3 | magyarorszag
1 | canada | 1 | canada
8 | united arab emirates | 3 | al imarat al arabiyah multahidah
16 | ascension islan | 3 | ascension
17 | austria | 3 | osterreich
21 | chile | 2 | chile
22 | cuba | 2 | cuba
23 | cook islands | 4 | cook islands
35 | isle of man | 3 | isle of man
(38 rows)
LABDB(LABADMIN)=> select * from nation;
[nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t nation
-df nation.del -delimiter '|' -maxerrors 10 -bf nation.bad -lf nation.log
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 29 of 32
9. Even though the nzload command received a successful message it is good practice to review the nzload log file for
any problems, for example bad rows that are under the maxErrors threshold. In the second putty session execute the
following command:
The log file should be similar to the following:
The main difference to before is that all of the data records in the data source file were processed (25.) 24 records were
loaded because there was one bad record in the data source file.
10. Now you will correct the bad row and load it into the NATION table. There are couple of options you could use. One
option is to extract the bad row from the original data source file and create a new data source file with the correct
record. However, this task could be tedious when dealing with large data source files and potentially many bad records.
The other option, which is more appropriate, is to use the bad log file. All bad records that can not be loaded into the
table are placed in the bad log file. So in the second session use vi to open and edit the nation.bad file and change the
‘2t’ to ‘24’ in the first field.
[nz@netezza movingData]$ vi nation.bad
Load started at:01-Jan-11 12:34:56 EDT
Database: LABDB
Tablename: NATION
Datafile: /home/nz/movingData/nation.del
Host: netezza
Load Options
Field delimiter: '|' NULL value: NULL
File Buffer Size (MB): 8 Load Replay REGION (MB): 0
Encoding: INTERNAL Max errors: 1
Skip records: 0 Max rows: 0
FillRecord: No Truncate String: No
Escape Char: None Accept Control Chars: No
Allow CR in string: No Ignore Zero: No
Quoted data: NO Require Quotes: No
BoolStyle: 1_0 Decimal Delimiter: '.'
Date Style: YMD Date Delim: '-'
Time Style: 24HOUR Time Delim: ':'
Time extra zeros: No
Found bad records
bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined]
----------------------------------------------------------------------------------------------------------------------------
1: 10(1) [1, INT4] expected field delimiter or end of record, "2"[t]
Statistics
number of records read: 25
number of bad records: 1
-------------------------------------------------
number of records loaded: 24
Elapsed Time (sec): 3.0
-----------------------------------------------------------------------------
Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
[nz@netezza movingData] more nation.log
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 30 of 32
11. The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you
will be in the command mode. To change the file you need to switch into the insert mode by pressing “i”. The editor will
show an – INSERT – at the bottom of the screen.
12. You can now use the cursor keys to navigate. Change the first two chars of the bad row from 2t to 24. Your screen
should look like the following:
13. We will now save our changes. Press “Esc” to switch back into command mode. You should see that the “—INSERT—
“ string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without any
questions.
14. After the nation.bad file has modified to correct the record issue a nzload to load the modified nation.bad file:
Which will return the following status message:
15. And now check the new row has been loaded into the table in session one:
Which will return all rows in the NATION table:
Load session of table 'NATION' completed successfully
LABDB(LABADMIN)=> select * from nation;
[nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t nation
-df nation.bad -delimiter '|'
24|denmark|3|denmark
~
~
~
~
~
~
~
~
~
~
-- INSERT --
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 31 of 32
The row in bold denotes the new row that was added to the table, which was the bad record you corrected.
N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT
-------------+---------------------------+-------------+----------------------------------
1 | canada | 1 | canada
8 | united arab emirates | 3 | al imarat al arabiyah multahidah
16 | ascension islan | 3 | ascension
17 | austria | 3 | osterreich
21 | chile | 2 | chile
22 | cuba | 2 | cuba
23 | cook islands | 4 | cook islands
35 | isle of man | 3 | isle of man
24 | denmark | 3 | denmark
2 | united states | 1 | united states of america
11 | japan | 4 | nippon
18 | bahamas | 2 | bahamas
19 | barbados | 2 | barbados
20 | belgium | 3 | belqique
25 | ecuador | 2 | ecuador
33 | iceland | 3 | lyoveldio island
34 | ireland | 3 | eire
39 | monaco | 3 | monaco
3 | brazil | 2 | brasil
4 | guyana | 2 | guyana
5 | venezuela | 2 | venezuela
9 | south africa | 3 | south africa
13 | hong kong | 4 | xianggang
15 | andorra | 2 | andorra
27 | fiji | 4 | fiji
28 | finland | 3 | suomen tasavalta
30 | great britain | 3 | great britian
36 | jamaica | 2 | jamaica
37 | korea | 4 | han-guk
38 | luxembourg | 3 | luxembourg
6 | united kingdom | 3 | united kingdom
7 | portugal | 3 | portugal
10 | australia | 4 | australia
12 | macau | 4 | aomen
14 | new zealand | 4 | new zealand
26 | falkland islands | 3 | islas malinas
29 | greenland | 1 | kalaallit nunaat
31 | gibraltar | 3 | gibraltar
32 | hungary | 3 | magyarorszag
(39 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 32 of 32
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at “Copyright and
trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
Backup Restore
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 23
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Creating a QA Database .................................................4
3 Creating the Test Database............................................8
4 Backing up and Restoring a Database........................10
4.1 Backing up the Database...............................................10
4.2 Verifying the Backups....................................................11
4.3 Restoring the Database .................................................15
4.4 Single Table Restore .....................................................18
5 Backing up User Data and Host Data ..........................20
5.1 User Data Backup..........................................................21
5.2 Host Data Backup..........................................................21
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 23
1 Introduction
PureData System appliances are 99.99% reliable and all internal components are redundant. Nevertheless regular
backups should be part of any data warehouse. The first reason for this is disaster recovery, for example in case of a fire
in the data warehouse. The second reason is to undo changes like accidental deletes.
For disaster recovery, backups should be stored in a different location than the data center that hosts the data warehouse.
For most of the big companies this will be a backup server which will have a version of Veritas Netbackup, Tivoli Storage
Manager, or similar software, furthermore, backing up to a file server is also possible.
1.1 Objectives
In the last labs we have created our LABDB database, and loaded the data into it. In this lab we will first set up a QA
database that contains a subset of the tables and data of the full database. To create the tables we will use cross
database access from our QA database to the LABDB production database.
We will then use the schema-only function of nzbackup to create a test database that contains the same tables and data
objects as the QA database but no data. Test data will later be added specifically for testing needs. After that we will do a
multistep backup of our QA database and test the restore functionality. Testing backups by restoring them is generally a
good idea and should be done during the development phase and also at regular intervals. After all - you are not fully sure
what a backup contains until you restore it.
Finally we will backup the system user data and the host data. While a database backup saves all users and groups that
are involved in that database, a full user backup may be needed to get the full picture - for example to archive users and
groups that are not used in any database. Also host data should be backed up regularly. In case of a host failure, which
leaves the user data on the S-Blades intact, having a recent host backup will make the recovery of the appliance much
faster and more straightforward.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 23
Figure 1 LABDB database
2 Creating a QA Database
In this chapter we will create a QA database called LABDBQA, which contains a subset of the tables. It will contain the full
NATION and REGION tables and the CUSTOMER table with a subset of the data. We will first create our QA database then
we will connect to it and use CTAS tables to create the table copies. We will use cross-database access to create our
CTAS tables from the foreign LABDB database. This is possible since PureData System allows read-only cross database
access if fully qualified names are used.
In this lab we will switch regularly between the operating system prompt and the nzsql console. The operating system
prompt will be used to execute the backup and restore commands and review the created files. The nzsql console will be
used to create the tables and further review the changes made to the user data using the restore commands.
To make this easier you should open two putty sessions, the first one will be used to execute the operating system
commands and it will be referred to as session 1 or the OS session, in the second session we will start the nzsql console.
It will be referred to as session 2 or the nzsql session. You can also see which session to use from the command prompt
in the screenshots.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 23
Figure 2 The two putty sessions for this lab, OS session 1 on the left, NZSQL session 2 on the right
1. Open the first putty session. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default
IP address for a local VM, the IP may be different for your Bootcamp)
2. Access the lab directory for this lab with the following command,
3. Open the second putty session. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the
default IP address for a local VM, the IP may be different for your Bootcamp)
4. Access the lab directory for this lab with the same command as before
5. Start the NZSQL console with the following command: nzsql
This will connect you to the SYSTEM database with the ADMIN user. These are the default settings stored in
the environment variables of the NZ user.
6. Create our empty QA database with the following command:
7. Connect to the QA database with the following command:
8. Create a full copy of the REGION table from the LABDB database:
With this statement we create a local REGION table in the currently connected QA database that has the same
definition and content as the REGION table from the LABDB database. The CREATE TABLE AS statement is one of
the most flexible administrative tools for a PureData System administrator.
LABDBQA(ADMIN)=> create table region as select * from labdb..region;
SYSTEM(ADMIN)=> c LABDBQA
SYSTEM(ADMIN)=> create database LABDBQA;
[nz@netezza ~]$ cd /labs/backupRestore/
[nz@netezza ~]$ cd /labs/backupRestore/
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 23
We can easily access tables of databases we are currently not connected to, but only for read operations. We couldn’t
insert data into a database we are not connected to.
9. Lets verify that the content has been copied over correctly. First lets look at the original data in the LABDB
database:
You should see four rows in the result set:
To access a table from a foreign database we need to have the fully qualified name. Notice that we leave out the
schema name between the two dots. Schemas are not fully supported in PureData System and since each table
name needs to be unique in a given database it can be omitted.
10. Now let’s compare that to our local REGION table:
You should see the same rows as before although they can be in a different order:
11. Now we copy over the NATION table as well:
12. And finally we will copy over a subset of our CUSTOMER table, we will only use the rows from the automobile
market segment for the QA database:
You will see that this inserts almost 30000 rows into the QA customer table, this is roughly a fifth of the original table:
LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where
c_mktsegment = 'AUTOMOBILE';
LABDBQA(ADMIN)=> create table nation as select * from labdb..nation;
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
1 | na | north america
3 | emea | europe, middle east, africa
4 | ap | asia pacific
2 | sa | south america
(4 rows)
LABDBQA(ADMIN)=> select * from region;
LABDBQA(ADMIN)=> select * from labdb..region;
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
2 | sa | south america
3 | emea | europe, middle east, africa
1 | na | north america
4 | ap | asia pacific
(4 rows)
LABDBQA(ADMIN)=> select * from labdb..region;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 23
13. We will now create a view NATIONSBYREGION which returns a list of nation names with their corresponding
region names. This is used in a couple of applications:
14. Let’s have a look at what the view returns:
You should get a list of all nations and their corresponding region name:
Views are a very convenient way to hide SQL complexity. They can also be used to implement column level security
by creating views of tables that only contain a subset of columns.They are fully supported by PureData System.
15. Verify the created tables with the following command: d
You will see that our QA database only contains the three tables we just created:
LABDBQA(ADMIN)=> select * from nationsbyregions;
R_NAME | N_NAME
---------------------------+---------------------------
sa | guyana
emea | united arab emirates
ap | macau
sa | brazil
emea | portugal
ap | japan
na | canada
sa | venezuela
emea | south africa
ap | hong kong
na | united states
emea | united kingdom
ap | australia
ap | new zealand
(14 rows)
LABDBQA(ADMIN)=> select * from nationsbyregions;
LABDBQA(ADMIN)=> create view nationsbyregions as select r_name, n_name from nation,
region where r_regionkey = n_regionkey;
LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where
c_mktsegment = 'AUTOMOBILE';
INSERT 0 29752
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 23
16. Finally we will create a QA user and make him owner of the database. Create the user with:
17. Make him the owner of the QA database:
We have successfully created our QA database using cross database CTAS statements. Our QA database contains
three tables, a view and we have a user that is the owner of this database. In the next chapter we will use backup and
restore to create an empty copy of the QA database for the test database.
3 Creating the Test Database
In this chapter we will use schema-only backup and restore to create an empty copy of the QA database as test database.
This will not contain any data since the developers will fill it with test-specific data. Schema only backup is a convenient
way to recreate databases without the contained user data.
1. Switch to the OS session and create the schema only backup of our QA database:
To do this we need to specify three parameters the database we want to backup, the file system location where to
save the backup files to and the –schema-only parameter to specify that user data shouldn’t be backed up.
Normally backups shouldn’t be saved on the host hard discs but on a remote network file server. Not only is
this essential for disaster recovery but the host hard discs are small, optimized for speed and not intended to
hold large amount of data. They are strictly intended for PureData System software and operational data.
Later we will have a deeper look at the created files and the logs but for the moment we will not go into that.
2. Now we will restore the test database from this backup:
We can restore a database to a different database name. We simply need to specify the new name in the
–db parameter and the old name in the –sourcedb parameter.
[nz@netezza backupRestore]$ nzrestore -dir /tmp/bkschema -db labdbtest -sourcedb
labdbqa -schema-only
[nz@netezza backupRestore]$ nzbackup -schema-only -db labdbqa -dir /tmp/bkschema
LABDBQA(ADMIN)=> alter database labdbqa owner to qauser;
LABDBQA(ADMIN)=> create user qauser;
LABDBQA(ADMIN)=> d
List of relations
Name | Type | Owner
------------------+-------+-------
CUSTOMER | TABLE | ADMIN
NATION | TABLE | ADMIN
NATIONSBYREGIONS | VIEW | ADMIN
REGION | TABLE | ADMIN
(4 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 23
3. In the nzsql session we will verify that we successfully created an empty copy of our database. See all available
databases with the following command: l
You can see that the LABDBTEST database was successfully created and that its privilege information have been
copied as well, the owner is QAUSER as in the LABDBQA database.
4. First we do not want the QA user being the owner of the test database, change the owner to ADMIN for now:
5. Now lets check the contents of our test database, connect to it with: c labdbtest
6. Check if our test database contains all the objects of the QA database: d
You will see the three tables and the view we created:
PureData System Backup does save all database objects including views, stored procedures etc. Also all users,
groups and privileges that refer to the backed up database are saved as well.
7. Since we used the –schema-only option we have not copied any data verify this for the NATION table with the
following command:
You will see an empty result set as expected. The –schema-only backup option is a convenient way to save your
database schema and to create empty copies of your database. Apart from the missing user data it will create a full
1:1 copy of the original database. You could also restore the database to a different PureData System Appliance. This
LABDBTEST(ADMIN)=> select * from nation;
LABDBTEST(ADMIN)=> d
List of relations
Name | Type | Owner
------------------+-------+-------
CUSTOMER | TABLE | ADMIN
NATION | TABLE | ADMIN
NATIONSBYREGIONS | VIEW | ADMIN
REGION | TABLE | ADMIN
(4 rows)
LABDBTEST(ADMIN)=> alter database labdbtest owner to admin;
LABDBTEST(ADMIN)=> l
List of databases
DATABASE | OWNER
-----------+----------
INZA | ADMIN
LABDB | LABADMIN
LABDBQA | QAUSER
LABDBTEST | QAUSER
MASTER_DB | ADMIN
NZA | ADMIN
NZM | ADMIN
NZR | ADMIN
SYSTEM | ADMIN
(9 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of 23
would only require that the backup server location is accessible from both PureData System Appliances. It could even
be a differently sized appliance and the target appliance can have a higher version number of the NPS software than
the source. It cannot be lower though.
4 Backing up and Restoring a Database
.
PureData System’s user data backup will create a backup of the complete database, including all database objects and
user data. Even global objects like Users and privileges that are used in the database are backed up. Backup and Restore
is therefore a very easy and straightforward process.
Since PureData System has no transaction log, point in time restore is not possible. Therefore frequent backups are
advisable. NPS supports full, differential and cumulative backups that allow easy and fast regular data backups. An
example backup strategy would be monthly full backups, weekly cumulative backups and daily differential.
Since PureData System is not intended to be used nor has been designed as an OLTP database, this should provide
enough backup flexibility for most situations. For example to run differential backups after the daily ETL processes that
feed the warehouse.
Figure 3 A typical backup strategy
This chapter we will create a backup of our QA database. We will then do a differential backup and then do a restore.
Our VMWare environment has some specific restrictions that only allow the restoration of up to 2 increments. The labs
will work correctly but don’t be surprised of errors during restore operations of more than 2 increments.
4.1 Backing up the Database
PureData System’s backup is organized in so called backup sets. Every new full backup creates a new backup set. Differential
and cumulative backups are per default added to the last backup set. But they can be added to a different backup set as well. In
this section we will switch between the two putty sessions.
1. In the OS session execute the following command to create a full backup of the QA database:
You should get the following result:
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of 23
This command will create a full user data backup of the LABDBQA database.
Each backup set has a unique id that can be later used to access it. Per default the last active backup set is used for
restore and differential backups.
In this lab we split up the backup between two file system locations. You can specify up to 16 file system
locations after the –dir parameter. Alternatively you could use a directory list file as well with the –dirfile
option. Splitting up the backup between different file servers will result in higher backup performance.
2. In the NZSQL session we will now add a new row to the REGION table. First connect to the QA database:
3. Now add a new entry for the north pole to the REGION table:
4. In the OS session create an differential backup:
We now create a differential backup with the –differential option. This will create a new entry to the backup set
we created previously only containing the differences since the full backup. You can see that the backup set id hasn’t
changed.
5. In the NZSQL session add the south pole to the REGION table:
You have now one full backup with the original 4 rows in the REGION table, then a differential backup that has
additionally the north pole entry and a current state that has in addition to that the south pole region.
4.2 Verifying the Backups
In this subchapter we will have a closer look at the files and logs that are created during the PureData System Backup
process.
1. In the OS session display the backup history of your Appliance:
You should get the following result:
[nz@netezza backupRestore]$ nzbackup -history
LABDBQA(ADMIN)=> insert into region values (6, 'sp', 'south pole');
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
-differential
LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');
LABDBTEST(ADMIN)=> c labdbqa
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
Backup of database labdbqa to backupset 20111214173551 completed successfully.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of 23
PureData System keeps track of all backups and saves them in the system catalog. This is used for differential
backups and it is also integrated with the Groom process. Since PureData System doesn’t use transaction logs it
needs logically deleted rows for differential backups. Per default Groom doesn’t remove a logically deleted row that
has not been backed up yet. Therefore the Groom process is integrated with the backup history. We will explain this in
more detail in the Transaction and Groom modules.
In our machine we have done three backups, one backup set containing the schema only backup and two backups for
the second backup set, one full and one differential. Lets have a closer look at the log that has been generated for the
last differential backup.
2. In the OS session, switch to the log directory of the backupsrv process, which is the process responsible for
backing up data:
The /nz/kit/log directory contains the log directories for all PureData System processes.
3. Display the end of the log for the last differential backup process. You will need to replace the XXX values with
the actual values of your log. You can cut and paste the log name from the history output above. We are
interested in the last differential backup process:
You will see the following result:
[[nz@netezza backupsvr]$ tail backupsvr.21594.2011-12-14.log
2011-12-14 12:44:59.445051 EST Info: [21604] Postgres client pid: 21606, session: 19206
2011-12-14 12:45:00.461034 EST Info: Capturing deleted rows
2011-12-14 12:45:03.971731 EST Info: Backing up table REGION
2011-12-14 12:45:04.675441 EST Info: Backing up table NATION
2011-12-14 12:45:06.077822 EST Info: Backing up table CUSTOMER
2011-12-14 12:45:08.673602 EST Info: Operation committed
2011-12-14 12:45:08.673636 EST Info: Wrote 264 bytes in less than one second to location
1
2011-12-14 12:45:08.673643 EST Info: Wrote 385 bytes in less than one second to location
2
2011-12-14 12:45:08.682316 EST Info: Backup of database labdbqa to backupset
20111214173551 completed successfully.
2011-12-14 12:45:08.767215 EST Info: NZ-00023: --- program 'backupsvr' (21594) exiting
on host 'netezza' ... ---
[nz@netezza backupsvr]$ tail backupsvr.xxxxx.xxxx-xx-xx.log
[nz@netezza backupRestore]$ cd /nz/kit/log/backupsvr/
[nz@netezza backupRestore]$ nzbackup -history
Database Backupset Seq # OpType Status Date Log File
-------- -------------- ----- ------- --------- ------------------- --------------------
----------
LABDBQA 20111213225029 1 SCHEMA COMPLETED 2011-12-13 17:50:29
backupsvr.10724.2011-12-13.log
LABDBQA 20111214173551 1 FULL COMPLETED 2011-12-14 12:35:51
backupsvr.21406.2011-12-14.log
LABDBQA 20111214173551 2 DIFF COMPLETED 2011-12-14 12:44:53
backupsvr.21594.2011-12-14.log
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of 23
You can see that the process backed up the three tables REGION, NATION and CUSTOMER and wrote the result to
two different locations. You also see the amount of data written to these locations. Since we only added a single row
the amount of data is tiny. If you look at the log of the full backup you will see a lot more data being written.
4. Now let’s have a look at the files that are created during the backup process, enter the first backup location:
5. And display the contents with ls
You will see the following result:
The PureData System folder contains all backup sets for all PureData System appliances that use this backup
location. If you need to move the backup you always have to move the complete folder.
6. Enter the Netezza folder with cd Netezza and display the contents with ls :
You will see the following result:
Under the main Netezza folder you will find sub folders for each Netezza host that is backed up to this location. In our
case we only have one Netezza host called “netezza”. But if your company had multiple Netezza hosts you would
find them here.
7. Enter the Netezza folder with cd Netezza and display the contents with ls :
Below the host you will find all the databases of the host that have been backed up to this location, in our case the QA
database.
8. Enter the LABDBQA folder with cd LABDBQA and display the contents with ls :
In this folder you can see all the backup sets that have been saved for this database. Each backup set corresponds to
one full backup plus an optional set of differential and cumulative backups. Note that we backed up the schema to a
different location so we only have one backup set in here.
9. Enter the backup set folder with cd <your backupset id> and display the contents with ls :
[nz@netezza bk1]$ ls
Netezza
[nz@netezza backupsvr]$ cd /tmp/bk1
[nz@netezza Netezza]$ ls
netezza
[nz@netezza netezza]$ ls
LABDBQA
[nz@netezza LABDBQA]$ ls
20111214173551
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of 23
Under the backup set are folders for each backup that has been added to that backup set. “1” is always the full
backup followed by additional differential or cumulative backups. We will later use these numbers to restore our
database to a specific backup of the backup set.
10. Enter the full backup with cd 1 and display the contents with ls :
As expected it’s a differential backup.
11. Enter the FULL folder with cd FULL and display the contents with ls :
The data folder contains the user data, the md folder contains metadata including the schema definition of the
database.
12. Enter the data folder with cd data and display detailed information with ll :
You can see that there are three data files two small files for the REGION and NATION table and a big file for the
CUSTOMER table.
13. Now switch to the md folder with cd ../md and display the contents with ls :
This folder contains information about the files that contribute to the backup and the schema definition of the database
in the schema.xml
14. Let’s have a quick look inside the schema.xml file:
You should see the following result:
[nz@netezza 20111214173551]$ ls
1 2
[nz@netezza 1]$ ls
FULL
[nz@netezza FULL]$ ls
data md
[nz@netezza data]$ ll
total 1120
-rw------- 1 nz nz 338 Dec 14 12:36 206208.full.2.1
-rw------- 1 nz nz 451 Dec 14 12:36 206222.full.2.1
-rw------- 1 nz nz 1132410 Dec 14 12:36 206238.full.1.1
[nz@netezza md]$ ls
contents.txt loc1 schema.xml stream.0.1 stream.1.1.1.1
[nz@netezza md]$ more schema.xml
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of 23
As you see this file contains a full XML description of your database, including table definition, views, users etc.
15. Switch back to the lab folder with :
You should now have a pretty good understanding of the PureData System Backup process, in the next subchapter
we will demonstrate the restore functionality.
4.3 Restoring the Database
In this subchapter we will restore our database first to the first increment and then we will upgrade our database to the
next increment.
PureData System allows you to return a database to a specific increment in your backup set. If you want to do an
incremental restore the database must be locked. Tables can be queried but not changed until the database is in the
desired state and unlocked.
1. In the NZSQL session we will now drop the QA database and the QA user, first connect to the SYSTEM database:
2. Now drop the QA database:
3. Now drop the QA User:
4. Let’s verify that the QA database really has been deleted with l
You will see that the LABDBQA database has been removed:
more schema.xml
<ARCHIVE archive_major="4" archive_minor="0" product_ver="Release 6.1, Dev 2 [Bu
ild 16340]" catalog_ver="3.976" hostname="netezza" dataslices="4" createtime="20
11-12-14 17:35:57" lowercase="f" hpfrel="4.10" model="WMware" family="vmware" pl
atform="xs">
<OPERATION backupset="20111214173551" increment="1" predecessor="0" optype="0" d
bname="LABDBQA"/>
<DATABASE name="LABDBQA" owner="QAUSER" oid="206144" delimited="f" odelim="f" ch
arset="LATIN9" collation="BINARY" collecthist="t">
<STATISTICS column_count="15"/>
<TABLE ver="2" name="REGION" owner="ADMIN" oid="206208" delimited="f" odelim="f"
rowsecurity="f" origoid="206208">
<COLUMN name="R_REGIONKEY" owner="" oid="206209" delimited="f" odelim="t" seq="1
" type="INTEGER" typeno="23" typemod="-1" notnull="t"/>
...
[nz@netezza md]$ cd /labs/backupRestore/
LABDBQA(ADMIN)=> c SYSTEM
SYSTEM(ADMIN)=> DROP DATABASE LABDBQA;
SYSTEM(ADMIN)=> DROP USER QAUSER;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of 23
5. In the OS session we will now restore the database to the first increment:
Notice that we have specified the increment with the –increment option. In our case this is the first full backup in our
backup set.
We didn’t need to specify a backup set, per default the most recent one is used. Since we are not sure to which
increment we want to restore the database we have to lock the database with the –lockdb option. This allows only
read-only access until the desired increment has been restored.
6. In the NZSQL session verify that the database has been recreated with l
You will see the LABDBQA database and you can also see that the owner QAUSER has been recreated and is again
the database owner:
7. Connect to the LABDBQA database with
You will see that LABDBQA database is currently in read-only mode.
SYSTEM(ADMIN)=> l
List of databases
DATABASE | OWNER
-----------+----------
INZA | ADMIN
LABDB | LABADMIN
LABDBTEST | ADMIN
MASTER_DB | ADMIN
NZA | ADMIN
NZM | ADMIN
NZR | ADMIN
SYSTEM | ADMIN
(8 rows)
[nz@netezza md]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment 1 -lockdb
true
SYSTEM(ADMIN)=> l
List of databases
DATABASE | OWNER
-----------+----------
INZA | ADMIN
LABDB | LABADMIN
LABDBQA | QAUSER
LABDBTEST | ADMIN
MASTER_DB | ADMIN
NZA | ADMIN
NZM | ADMIN
NZR | ADMIN
SYSTEM | ADMIN
(9 rows)
SYSTEM(ADMIN)=> c labdbqa
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 17 of 23
8. Verify the contents of the REGION table from the LABDBQA database:
You can see that we have returned the database to the point in time before the first full backup. There is no north or
south pole in the comments column:
9. Try to insert a row to verify the read only mode:
As expected this is prohibited until we unlock the database:
10. In the OS session we will now apply the next increment to the database
You will see that we now apply the second increment to the database:
11. Since we do not need to load any more increments, we can now unlock the database:
SYSTEM(ADMIN)=> c labdbqa
NOTICE: Database 'LABDBQA' is available for read-only
You are now connected to database labdbqa.
SYSTEM(ADMIN)=> select * from region;
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
2 | sa | south america
1 | na | north america
3 | emea | europe, middle east, africa
4 | ap | asia pacific
(4 rows)
LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');
LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');
ERROR: Database 'LABDBQA' is available for read-only (command ignored)
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment
next -lockdb true
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment
next -lockdb true
Restore of increment 2 from backupset 20111214173551 to database 'labdbqa'
committed.
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -unlockdb
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 18 of 23
After the database unlock we cannot apply any further increments to this database. To jump to a different increment
we would need to start from the beginning.
12. In the NZSQL session we have a look at the REGION table again:
You can see that we have added the north pole region which was created before the first differential backup:
13. Verify that the database is unlocked and ready for use again by adding a new set of customers to the
CUSTOMER table. In addition to the Automobile users we want to add the machinery users from the main
database:
You will see that we now can use the database in a normal fashion again.
14. We had around 30000 customers before, verify that the new user set has been added successfully:
You will see that we now have around 60000 rows in the CUSTOMER table.
You have now done a full restore cycle for the database and applied a full and incremental backup to your database.
In the next chapter we will demonstrate single table restore and the ability to restore from any backup set.
4.4 Single Table Restore
In this chapter we will demonstrate the targeted restore of a subset of tables from a backup set. We will also demonstrate
how to restore from a specific older backup set.
1. First we will create a second backup set with the new customer data. In the OS session execute the following
command:
LABDBQA(ADMIN)=> select * from region;
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
2 | sa | south america
3 | emea | europe, middle east, africa
4 | ap | asia pacific
1 | na | north america
5 | np | north pole
(5 rows)
LABDBQA(ADMIN)=> insert into customer select * from labdb..customer where
c_mktsegment = 'MACHINERY';
LABDBQA(ADMIN)=> select count(*) from customer;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 19 of 23
Since we didn’t specify anything else this is a full database backup. In this case PureData System automatically
creates a new backup set.
2. We want to return the CUSTOMER table to the previous condition. But we do not want to change the REGION or the
NATION tables. To do this we need to know the backup set id of the previous backup set. To do this execute the
history command again:
We now see three different backup sets, the schema only backup, the two step backupset and the new full backupset.
Remember the backup set id of the two step backupset.
3. To return only the CUSTOMER table to its condition of the second backup set we can do a table level restore with
the following command:
This command will only restore the tables of the –tables option. If you want to restore multiple tables you can simply
write them in a list after the option.
We use the –backupset option to specify a specific backup set. Remember to replace the id with the value you
retrieved with the history command.
Notice that the table name needs to be case sensitive. This is in contrast to the database name.
You will get the following error:
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
[nz@netezza backupRestore]$ nzbackup -history
[nz@netezza backupRestore]$ nzbackup -history
Database Backupset Seq # OpType Status Date Log File
--------- -------------- ----- ------- --------- ------------------- ---------------
---------------
(LABDBQA) 20111213225029 1 SCHEMA COMPLETED 2011-12-13 17:50:29
backupsvr.10724.2011-12-13.log
(LABDBQA) 20111214173551 1 FULL COMPLETED 2011-12-14 12:35:51
backupsvr.21406.2011-12-14.log
(LABDBQA) 20111214173551 2 DIFF COMPLETED 2011-12-14 12:44:53
backupsvr.21594.2011-12-14.log
LABDBQA 20111214205536 1 FULL COMPLETED 2011-12-14 15:55:36
backupsvr.23621.2011-12-14.log
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
<your_backup_set_id> -tables CUSTOMER
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
20111214173551 -tables CUSTOMER
Error: Specify -droptables to force drop of tables in the -tables list.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 20 of 23
PureData System cannot restore a table that exists in the target database. You can either drop the table before
restoring it, or use the –droptables option.
4. Repeat the previous command with the added –droptables option:
You will get the following result:
You can see the target table was dropped before the restore happened and the specified backup set was used. Since
we didn’t stipulate a specific increment, the full backup set has been applied with all increments. Also the table is
automatically unlocked after the restore process finishes.
5. Finally lets verify that the restore worked as expected, in the NZSQL console count the rows of the customer table
again:
You will see that we are back to 30000 rows. This means that we have reverted the most recent changes:
In this chapter you have executed a single table restore and you did a restore from a specific backup set.
5 Backing up User Data and Host Data
In the previous chapters you have learned to backup PureData System databases. This backs up all the database objects
that are used in the database and the user data from the S-Blades. These are the most critical components to back up in
a PureData System appliance. They will allow you to recreate your databases even if you would need to switch to a
completely new Appliance.
But there are two other things that should be backed up:
• The global user information.
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
<your_backup_set_id> -tables CUSTOMER -droptables
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
20111214173551 -tables CUSTOMER -droptables
[Restore Server] : Dropping TABLE 'CUSTOMER'
Restore of increment 1 from backupset 20111214173551 to database 'labdbqa'
committed.
Restore of increment 2 from backupset 20111214173551 to database 'labdbqa'
committed.
LABDBQA(ADMIN)=> select count(*) from customer;
LABDBQA(ADMIN)=> select count(*) from customer;
COUNT
-------
29752
(1 row)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 21 of 23
• The host data
In this chapter you will do a backup of these components, so you would be able to revert your appliance to the exact
condition it was in before the backup.
5.1 User Data Backup
Users, groups, and privileges that are not used in databases will not be backed up by the user data backup. To be able to
revert a PureData System Appliance completely to its original condition you need to have a backup of the global user
information as well, to capture for example administrative users that are not part of any database.
This is done with the –users option of the nzbackup command:
1. In the OS session execute the following command:
You will see the following results:
.
This will create a backup of all Users, Groups and Privileges. Restoring it will not delete any users, instead it will only add
missing Users, Groups and Privileges, so it doesn’t need to be fully synchronized with the user data backup. You can
even restore an older user backup without fear of destroying information.
5.2 Host Data Backup
Until now we have always backed up database content. This is essentially catalog and user data that can be applied to a
new PureData System appliance. PureData System also provides the functionality to backup and restore host data. This
is essentially the data in the /nz/data and /export/nz directories of the host server.
There are two reasons for regularly backing up host data. The first is a host crash. If the S-Blades of your appliance are
intact but the host file system has been destroyed you could recreate all databases from the user backup. But in very
large systems this might take a long time. It is much easier to only restore the host information and reconnect to the
undamaged user tables on the S-Blades.
The second reason is that the host data contains configuration information, log and plan files etc. that are not saved by
the user backup. If you for example changed the system configuration that information would be lost.
Therefore it is advisable to regularly backup host data.
1. To backup the host data execute the following command in the OS session:
This will pause your system and copy the host files into the specified file name:
[nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users
[nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users
Backup of users, groups, and global permissions completed successfully.
[nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 22 of 23
As you can see the system has been paused for the duration of the host backup but is automatically resumed after
the backup is successful.
Also notice that the host backup is done with the nzhostbackup command instead of the standard nzbackup
command.
2. Lets have a look at the created file:
You will see the following results:
You can see that a backup file has been created. It’s a compressed file containing the system catalog and PureData
System host information. If possible host backups should be done regularly. If for example an old host backup is
restored there might exist so called orphaned tables. This means tables that have been created after the host backup
and exist on the S-Blades but are now not registered in the system catalog anymore. During host restore PureData
System will create a script to clean up these orphaned tables, so they do not take up any disc space.
Congratulations you have finished the Backup&Restore lab and you have had a chance to see all components of a
successful PureData System backup strategy. The one missing component is that we did only use file system backup. In
a real environment you would more likely use a Veritas or TSM backup server. For further information regarding the setup
steps please refer to the excellent system administration guide.
[nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup
Starting host backup. System state is 'online'.
Pausing the system ...
Checkpointing host catalog ...
Archiving system catalog ...
Resuming the system ...
Host backup completed successfully. System state is 'online'.
[nz@netezza backupRestore]$ ll /tmp
total 66160
drwxrwxrwx 3 nz nz 4096 Dec 14 12:35 bk1
drwxrwxrwx 3 nz nz 4096 Dec 14 12:35 bk2
drwxrwxrwx 3 nz nz 4096 Dec 13 17:50 bkschema
-rw------- 1 nz nz 67628809 Dec 14 16:37 hostbackup
drwxrwxr-x 2 nz nz 4096 Dec 12 14:55 inza1.1.2
drwxrwxrwx 2 root root 16384 Jan 20 2011 lost+found
srwxrwxrwx 1 nz nz 0 Dec 12 15:04 nzaeus__nzmpirun___
-rw-rw-r-- 1 nz nz 33 Dec 12 15:04 nzaeus__nzmpirun_____Process
-rw-rw-r-- 1 nz nz 0 Dec 12 15:05 nzcm.lock
drwx------ 2 nz nz 4096 Dec 12 14:46 nzcm-temp_18uEeq
drwx------ 2 nz nz 4096 Dec 12 12:55 nzcm-temp_rvAZXR
[nz@netezza backupRestore]$ ll /tmp
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 23 of 23
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current
list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or service marks of others.
References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries
in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
hIBM Software
Information Management
Query Optimization
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 14
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Generate Statistics..........................................................3
3 Identifying Join Problems ..............................................6
4 HTML Explain ................................................................10
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 14
1 Introduction
PureData System uses a cost-based optimizer to determine the best method for scan and join operations, join order, and data
movement between SPUs (redistribute or broadcast operations if necessary). For example the planner tries to avoid
redistributing large tables because of the performance impact. The optimizer can also dynamically rewrite queries to improve
query performance.
The optimizer takes a SQL query as input and creates a detailed execution or query plan for the database system. For the
optimizer to create the best execution plan that results in the best performance, it must have the most up-to-date statistics.
You can use EXPLAIN, HTML (also known as bubble), and text plans to analyze how the PureData System system
executes a query.
Explain is a very useful tool to spot and identify performance problems, bad distribution keys, badly written SQL queries
and out-of-date statistics.
1.1 Objectives
During our POC we have identified a couple of very long running customer queries that have significantly worse performance
than the number of rows involved would suggest. In this lab we will use Explain functionality to identify the concrete bottlenecks
and if possible fix them to improve query performance.
2 Generate Statistics
Our first long running customer query returns the average order price by customer segment for a given year and order priority. It
joins the customer table for the market segment and the orders table for the total price of the order. Due to restrictive join
conditions it shouldn’t require too much processing time. But on our test systems it runs a very long time. In this chapter we will
use PureData System Explain functionality to find out why this is the case.
The customer query in question:
1. Connect to your PureData System image using putty. Login to 192.168.239.2 as user “nz” with password “nz”.
(192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp)
2. First we will make sure that the system doesn’t run a different workload that could influence our tests. Use the following
nzsession command to verify that the system is free:
You should get a similar result to the following:
[nz@netezza ~]$ nzsession show
ID Type User Start Time PID Database State Priority Name Client
IP Client PID Command
----- ---- ----- ----------------------- ---- -------- ------ ------------- --------
- ---------- ------------------------
16023 sql ADMIN 29-Apr-11, 09:18:13 EDT 4795 SYSTEM active normal
127.0.0.1 4794 SELECT session_id, clien
[nz@netezza ~]$ nzsession show
SELECT c.c_mktsegment, AVG(o.o_totalprice)
FROM orders AS o, CUSTOMER as c
WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT'
GROUP BY c.c_mktsegment;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 14
This result shows that there is currently only one session connected to the database, which is the nzsession command itself.
Per default the database user in your vmware image is ADMIN. Executing this command before doing any performance
measurements ensures that other workloads are not influencing the performance of the system. You can use the nzsession
command as well to abort bad or locked sessions.
3. After we verified that the system is free we can start analyzing the query. Connect to the lab database with the following
command:
4. Let’s first have a look at the two tables and the WHERE conditions to get an idea of the row numbers involved. Our query
joins the CUSTOMER table without any where condition applied to it and the ORDERS table that has two where
conditions restricting it on the date and order priority. From the data distribution lab we know that the CUSTOMER table
has 150000 rows. To get the rows that are involved from the ORDERS table Execute the following COUNT(*) command:
You should get the following results:
So the ORDERS table has 46014 rows that fit the WHERE condition. We will use EXPLAIN functionality to check if the
available Statistics allow the PureData System optimizer to estimate this correctly for its plan creation.
5. The PureData System optimizer uses statistics about the data in the system to estimate the number of rows that result
from WHERE conditions, joins, etc. Doing wrong approximations can lead to bad execution plans. For example a huge
result set could be broadcast for a join instead of doing a double redistribution. To see its estimated rows for the
WHERE conditions in our query run the following EXPLAIN command:
You will see a long output. Scroll up to your command and you should see the following:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR
FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT';
LABDB(LABADMIN)=> SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) =
1996 AND o_orderpriority = '1-URGENT';
COUNT
-------
46014
(1 row)
LABDB(LABADMIN)=> SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) =
1996 AND o_orderpriority = '1-URGENT';
[nz@netezza ~]$ nzsql labdb labadmin
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 14
The execution plan of this query consists of two nodes or snippets. First the table is scanned and the WHERE conditions
are applied, which can be seen in the Restrictions sub node. Since we use a COUNT(*) the Projections node is empty.
Then an Aggregation node is applied to count the rows that are returned by node 1.
When we look at the estimated number of rows we can see that it is way off the mark. The PureData System Optimizer
estimates from its available statistics that only 150 rows are returned by the WHERE conditions. We have seen before that
in reality its 46014 or roughly 300 times as many.
6. One way to help the optimizer in its estimates is the collection of detailed statistics about the involved tables. Execute
the following command to generate detailed statistics about the ORDERS table:
Since generating full statistics involves a table scan this command may take some time to execute.
7. We will now check if generating statistics has improved the estimates. Execute the EXPLAIN command again:
Scroll up to your command and you should now see the following:
As we can see the estimated rows of the SELECT query have improved drastically. The optimizer now assumes this WHERE
condition will apply to 3000 rows of the order table. Still significantly off the true number of 46000 but by a factor of 20 better
than the original estimate of 150.
explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) =
1996 and o.o_orderpriority = '1-URGENT';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
Node 2.
[SPU Aggregate]
...
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR
FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT';
LABDB(LABADMIN)=> generate statistics on orders;
explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) =
1996 and o.o_orderpriority = '1-URGENT';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 150, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
Node 2.
[SPU Aggregate]
...
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 14
Estimations are very difficult to make. Obviously the optimizer cannot do the actual computation during planning. It relies on
current statistics about the involved columns. Statistics include min/max values, distinct values, numbers of null values etc.
Some of these statistics are collected on the fly but the most detailed statistics can be generated manually with the Generate
Statistics command. Generating full statistics after loading a table or changing its content significantly is one of the most
important administration tasks in PureData System. The PureData System appliance will automatically generate express
statistics after many tasks like load operations and just-in-time statistics during planning. Nevertheless full statistics should be
generated on a regular basis.
3 Identifying Join Problems
In the last chapter we have taken a first look at the tables involved in our join query and have improved optimizer estimates by
generating statistics on the involved tables. Now we will have a look at the complete execution plan and we will have a specific
look at the distribution and involved join.
In our example we have a query that doesn’t finish in a reasonable amount of time. It is taken much longer than you would
expect from the involved data sizes. We will now analyze why this is the case.
1. Lets analyze the execution plan for this query using the EXPLAIN VERBOSE command:
You should see the following results (Scroll up to your query)
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM
orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND
o.o_orderpriority = '1-URGENT' GROUP BY c.c_mktsegment;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "CUSTOMER" as "C" {(C.C_CUSTKEY)}]
-- Estimated Rows = 150000, Width = 10, Cost = 0.0 .. 90.5, Conf = 100.0
Projections:
1:C.C_MKTSEGMENT
[SPU Broadcast]
Node 2.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) =
1996))
Projections:
1:O.O_TOTALPRICE
Node 3.
[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}]
-- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0
Restrictions:
't'::BOOL
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
Node 4.
[SPU Group {(C.C_MKTSEGMENT)}]
-- Estimated Rows = 100, Width = 18, Cost = 1048040.0 .. 7732377.0, Conf = 0.0
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
[SPU Distribute on {(C.C_MKTSEGMENT)}]
[SPU Merge Group]
Node 5.
[SPU Aggregate {(C.C_MKTSEGMENT)}]
-- Estimated Rows = 100, Width = 26, Cost = 1048040.0 .. 7732377.0, Conf = 0.0
Projections:
1:C.C_MKTSEGMENT 2:(SUM(O.O_TOTALPRICE) / "NUMERIC"(COUNT(O.O_TOTALPRICE)))
[SPU Return]
[Host Return]
... Removed Plan Text ...
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 14
2. First try to answer the following questions through the execution plan yourself. Take your time. We will walk through the
answers after that.
Question Answer
a. Which columns of Table Customer are used in
further computations?
b. Is Table Customer redistributed, broadcast or can it
be joined locally?
c. Is Table Orders redistributed, broadcast or can it be
joined locally?
d. In which node are the WHERE conditions applied
and how many rows does PureData System expect to
fulfill the where condition?
e. What kind of join takes place and in which node?
f. What is the number of estimated rows for the join?
g. What is the most expensive node and why?
Hint: a stream operation in PureData System Explain is a join whose output isn’t persisted on disc but streamed to further
computation nodes or snippets.
3. So let’s walk through the questions:
a. Which columns of Table Customer are used in further computations?
The first node in the execution plan does a sequential scan of the CUSTOMER table on the SPUs. It estimates that 150000
rows are returned which we know is the number of rows in the CUSTOMER table.
The statement that tells us which columns are used in further computations is the “Projections:” clause. We can see that
only the C_MKTSEGMENT column is carried on from the CUSTOMER table. All other columns are thrown away. Since
C_MKTSEGMENT is a CHAR(10) column the returned resultset has a width of 10.
b. Is Table Customer redistributed, broadcast or can it be joined locally?
During scan the table is broadcast to the other SPUs. This means that a complete CUSTOMER table is assembled on the
host and broadcast to each SPU for further computation of the query. This may seem surprising at first since we have a
substantial amount of rows. But since the width of the result set is only 10 we are talking about 150000 rows * 10 bytes =
1.5mb. This is almost nothing for a warehousing system.
Node 2.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
1:O.O_TOTALPRICE
Node 1.
[SPU Sequential Scan table "CUSTOMER" as "C" {}]
-- Estimated Rows = 150000, Width = 10, Cost = 0.0 .. 90.5, Conf = 100.0
Projections:
1:C.C_MKTSEGMENT
[SPU Broadcast]
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 14
c. Is Table Orders redistributed, broadcast or can it be joined locally?
The second node of the execution plan does a scan of the ORDERS table. One column O_TOTALPRICE is projected and
used in further computations. We cannot see any distribution or broadcast clauses so this table can be joined locally. This is
true because the CUSTOMER table is broadcast to all SPUs. If one table of a join is broadcast the other table doesn’t need
any redistribution.
d. In which node are the WHERE conditions applied and how many rows does PureData System expect to fulfill the where
condition?
We can see in the “Restrictions” clause that the WHERE conditions of our query are applied during the second node as well.
This should be clear since both of the WHERE conditions are applied to the ORDERS table and they can be executed
during the scan of the ORDERS table. As we can see in the “Estimated Rows” clause, the optimizer estimates a returned
set of 3000 rows which we know is not perfectly true since in reality 46014 rows are returned from this table.
e. What kind of join takes place and in which node?
The third node of our execution plan contains the join between the two tables. It is a Nested Loop Join which means that
every row of the first join set is compared to each row of the second join set. If the join condition holds true the joined row is
then added to the result set. This can be a very efficient join for small tables but for large tables its complexity is quadratic
and therefore in general less fast than for example a Hash Join. The Hash Join though cannot be used in cases of inequality
join conditions, floating point join keys etc.
f. What is the number of estimated rows for the join?
We can see in the Estimated Rows clause that the optimizer estimates this join node to return roughly 450m rows. Which is
the number of rows from the first node times the number of rows from the second node.
g. What is the most expensive node and why?
As we can see from the Cost clause the optimizer estimates, that the join has a cost in the range from 1048040 .. 7676127.0.
This is a roughly 2000 – 14000 times higher cost than what was expected for Node 1 and Node 2. Node 4 and 5 which
group and aggregate the result set do not add much cost as well. So our performance problems clearly originate in the join
node 3.
So what is happening here? If we take a look at the query we can assume that it is intended to compute the average order
cost per market segment. This means we should join all customers to their corresponding order rows. But for this to happen
we would need a join condition that joins the customer table and the orders table on the customer key. Instead the query
performs a Cartesian Join, joining each customer row to each orders row. This is a very work intensive query that results in
the behavior we have seen. The joined result set becomes huge. And it even returns results that cannot have been
expected for the query we see.
4. So how do we fix this? By adding a join condition to the query that makes sure that customers are only joined to their
orders. This additional join condition is O.O_CUSTKEY=C.C_CUSTKEY. Execute the following EXPLAIN command for
the modified query.
Node 3.
[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}]
-- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0
Restrictions:
't'::BOOL
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 14
You should see the following results. Scroll up to your query to see the scan and join nodes.
As you can see there have been some changes to the exeuction plan. The ORDERS table is now scanned first and
distributed on the customer key. The CUSTOMER table is already distributed on the customer key so there doesn’t need to
happen any redistribution here. Both tables are then joined in node 3 through a Hash Join on the customer key.
The estimated number of rows is now 150000, the same as the number of customers. Since we have a 1:n relationship
between customers and orders this is as we would expect. Also the estimated cost of node 3 has come down significantly to
578.6 ... 746.7.
5. Let’s make sure that the query performance has indeed improved. Switch on the display of elapsed query time with the
following command:
If you want you can later switch off the elapsed time display by executing the same command again. It is a toggle.
6. Now execute our modified query:
You should see the following results:
LABDB(LABADMIN)=> time
LABDB(LABADMIN)=> SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o,
CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority =
'1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 12, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) =
1996))
Projections:
1:O.O_TOTALPRICE 2:O.O_CUSTKEY
Cardinality:
O.O_CUSTKEY 3.0K (Adjusted)
[SPU Distribute on {(O.O_CUSTKEY)}]
[HashIt for Join]
Node 2.
[SPU Sequential Scan table "CUSTOMER" as "C" {(C.C_CUSTKEY)}]
-- Estimated Rows = 150000, Width = 14, Cost = 0.0 .. 90.5, Conf = 100.0
Projections:
1:C.C_MKTSEGMENT 2:C.C_CUSTKEY
Node 3.
[SPU Hash Join Stream "Node 2" with Temp "Node 1" {(C.C_CUSTKEY,O.O_CUSTKEY)}]
-- Estimated Rows = 150000, Width = 18, Cost = 578.6 .. 746.7, Conf = 51.2
Restrictions:
(C.C_CUSTKEY = O.O_CUSTKEY)
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
Cardinality:
O.O_CUSTKEY 100 (Adjusted)
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM
orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND
o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of
14
Before we made our changes the query took so long that we couldn’t wait for it to finish. After our changes the execution
time has improved to slightly more than a second. In this relatively simple case we might have been able to pinpoint the
problem through analyzing the SQL on its own. But this can be almost impossible for complicated multi-join queries that are
often used in warehousing. Reporting and BI tools tend to create very complicated portable SQL as well. In these cases
EXPLAIN can be a valuable tool to pinpoint the problem.
4 HTML Explain
In this section we will look at the HTML plangraph for the customer query that we just fixed. Besides the text descriptions of the
exeution plan we used in the previous chapter, PureData System provides the ability to generate a graphical query tree as well.
This is done with the help of HTML. So plangraph files can be created and viewed in your internet browser. PureData System
can be configured to save a HTML plangraph or plantext file for every executed SQL query. But in this chapter we will use the
basic EXPLAIN PLANGRAPH command and use Cut&Paste to export the file to your host computer.
1. Enter the query with the keyword explain plangraph to generate the HTML plangraph:
You will get a long print output of the HTML file content on your screen:
LABDB(LABADMIN)=> SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c
WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND
o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
C_MKTSEGMENT | AVG
--------------+---------------
HOUSEHOLD | 150196.009267
BUILDING | 151275.977882
AUTOMOBILE | 151488.825830
MACHINERY | 151348.971079
FURNITURE | 150998.129771
(5 rows)
Elapsed time: 0m1.129s
LABDB(ADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM
orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND
o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of
14
Next open your host computer’s text editor. If you workstation is windows open notepad, if you use a linux desktop use the
default text editor like KEDIT, or GEDIT. Copy the output from the explain plangraph from your putty window into notepad.
Make sure that you only copy the HTML file from the <html… start tag to the </html> end tag.
2. Save the file as “explain.html” on your desktop.
LABDB(LABADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE
EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
NOTICE: QUERY PLAN:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Generator" content="Netezza Performance Server">
<meta http-equiv="Author" content="Babu Tammisetti <btammisetti@netezza.com>">
<style>
v:* {behavior:url(#default#VML);}
</style>
</head>
<body lang="en-US">
<pre style="font:normal 68% verdana,arial,helvetica;background:#EEEEEE;margin-top:1em;margin-bottom:1em;margin-
left:0px;padding:5pt;">
EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM
o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
</pre>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:19pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">AGG<br/>r=100 w=26 s=2.5KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:15pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:0pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">snd,ret</p></v:textbox>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:54pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">GROUP<br/>r=100 w=18 s=1.8KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:50pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,27pt" to="270pt,62pt"/>
<v:textbox style="position:absolute;margin-left:233pt;margin-top:42pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">dst,m-grp</p></v:textbox>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:89pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">HASHJOIN<br/>r=150.0K w=18 s=2.6MB<br/>(C_CUSTKEY =
O_CUSTKEY)</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:85pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,62pt" to="270pt,100pt"/>
<v:textbox style="position:absolute;margin-left:190pt;margin-top:124pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=150.0K w=14 s=2.0MB<br/>C</p></v:textbox>
<v:oval style="position:absolute;margin-left:191pt;margin-top:120pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="230pt,135pt"/>
<v:textbox style="position:absolute;margin-left:270pt;margin-top:124pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">HASH<br/>r=3.0K w=12 s=35.2KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:271pt;margin-top:120pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="310pt,132pt"/>
<v:textbox style="position:absolute;margin-left:253pt;margin-top:112pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">dst{(O_CUSTKEY)}</p></v:textbox>
<v:textbox style="position:absolute;margin-left:270pt;margin-top:159pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=3.0K w=12 s=35.2KB<br/>O</p></v:textbox>
<v:oval style="position:absolute;margin-left:271pt;margin-top:155pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="310pt,132pt" to="310pt,170pt"/>
</body>
</html>
EXPLAIN
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of
14
3. Now on your desktop double click on “explain.html”. In windows make sure to open it with Internet Explorer since this
will result in the best output
You can see a graphical representation of the query we analyzed before. The left leg of the tree is the scan node of the
Customer tables C, the right leg contains a scan of the Orders table O and a node hashing the result set from orders in
preparation for the HASHJOIN node, that is joining the resultsets of the two table scans on the customer key. After the join the
result is fed into a GROUP node and an Aggregation node that computes the Average total price, before being returned to the
caller.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of
14
A graphical representation of the execution plan can be valuable for complicated multi-join queries to get an overview of the join.
Congratulations in this lab you have used PureData System Explain functionality to analyze a query.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of
14
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at “Copyright and
trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
Optimization Objects
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 2 of 17
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Materialized Views...........................................................3
2.1 Wide Tables.....................................................................4
2.2 Lookup of small set of rows .............................................6
3 Cluster Based Tables (CBT).........................................12
3.1 Cluster Based Table Usage...........................................12
3.2 Cluster Based Table Maintenance .................................15
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 3 of 17
1 Introduction
A PureData System appliance is designed to provide excellent performance in most cases without any specific tuning or index
creation. One of the key technologies used to achieve this are zone maps: Automatically computed and maintained records of
the data that is inside the extents of a database table.
In general data is loaded into data warehouses ordered by the time dimension; therefore zone maps have the biggest
performance impact on queries that restrict the time dimension as well.
This approach works well for most situations, but PureData System provides additional functionality to enhance specific
workloads, which we will use in this chapter.
We will first use materialized views to enhance performance of database queries against wide tables and for queries that only
lookup small subsets of columns.
Then we will use Cluster Based Tables to enhance query performance of queries which are using multiple lookup dimensions.
1.1 Objectives
In the last couple of labs we have recreated a customer database in our PureData System system. We have picked distribution
keys, loaded the data and made some first performance investigations. In this lab we will take a deeper look at some customer
queries and try to enhance their performance by tuning the system.
Figure 1 LABDB database
2 Materialized Views
A materialized view is a view of a database table that projects a subset of the base table’s columns and can be sorted on a
specific set of the projected columns. When a materialized view is created, the sorted projection of the base table’s data is
stored in a materialized table on disk.
Materialized views reduce the width of data being scanned in a base table. They are beneficial for wide tables that contain many
columns (i.e. 50-500 columns) where typical queries only reference a small subset of the columns.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 4 of 17
Materialized views also provide fast, single or few record lookup operations. The thin materialized view is automatically
substituted by the optimizer for the base table, allowing faster response, particularly for shorter tactical queries that examine only
a small segment of the overall database table.
2.1 Wide Tables
In our customer scenario we have a couple of queries that do some basic computations on the LINEITEM table but only touch a
small number of columns of the table.
1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2. Enter NZSQL and connect to LABDB as user LABADMIN.
3. The first thing we need to do is to make sure table statistics have been generated so that more accurate estimated
query costs can be reported by explain commands which we will be looking at. Please generate statistics for the
ORDERS and LINEITEM tables using the following commands.
4. The following query computes the total quantity of items shipped and their average tax rate for a given month. In this
case the fourth month or April. Execute the following query:
Your results should look similar to the following:
Notice the EXTRACT(MONTH FROM L_SHIPDATE) command. The EXTRACT command can be used to retrieve parts of a
date or time column like YEAR, MONTH or DAY.
5. Now let’s have a look at the cost of this query. To get the projected cost from the Optimizer we use the following
EXPLAIN VERBOSE command:
You will see a long output on the screen. Scroll up till you reach the command you just executed. You should see something
similar to the following:
LABDB(LABADMIN)=> GENERATE STATISTICS ON ORDERS;
LABDB(LABADMIN)=> GENERATE STATISTICS ON LINEITEM;
[nz@netezza labs]$ nzsql LABDB LABADMIN
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE
EXTRACT(MONTH FROM L_SHIPDATE) = 4;
LABDB(LABADMIN)=> SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;
SUM | AVG
-------------+----------
13136228.00 | 0.039974
(1 row)
LABDB(LABADMIN)=> SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE
EXTRACT(MONTH FROM L_SHIPDATE) = 4;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 5 of 17
Notice the highlighted cost associated with the table scan. In our example it’s a value of over 2400.
6. Since this query is run very frequently we want to enhance the scanning performance. And since it only uses 3 of the 16
LINEITEM columns we have decided to create a materialized view covering these three columns. This should
significantly increase scan speed since only a small subset of the data needs to be scanned. To create the materialized
view THINLINEITEM execute the following command:
This command can take several minutes since we effectively create a copy of the three columns of the table.
7. Repeat the explain call from step 2. Execute the following command:
Again scroll up till you reach your command. The results should now look like the following:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE
EXTRACT(MONTH FROM L_SHIPDATE) = 4;
LABDB(LABADMIN)=> CREATE MATERIALIZED VIEW THINLINEITEM AS SELECT L_QUANTITY, L_TAX,
L_SHIPDATE FROM LINEITEM;
QUERY SQL:
EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 60012, Width = 16, Cost = 0.0 .. 2417.5, Conf = 80.0
Restrictions:
(DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4)
Projections:
1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 32, Cost = 2440.0 .. 2440.0, Conf = 0.0
Projections:
1:SUM(LINEITEM.L_QUANTITY)
2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX)))
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 6 of 17
Notice that the PureData System Optimizer has automatically replaced the LINEITEM table with the view THINLINEITEM.
We didn’t need to make any changes to the query. Also notice that the expected cost has been reduced to 174 which is less
than 10% of the original.
As you have seen in cases where you have wide database tables, with queries only touching a subset of them, a
materialized view of the hot columns can significantly increase performance for these queries, without any changes to the
executed queries.
2.2 Lookup of small set of rows
Materialized views not only reduce the width of tables, they can also be used in a similar way to indexes to increase the speed of
queries that only access a very limited set of rows.
1. First we drop the view we used in the last chapter with the following command:
2. The following command returns the number of returned shipments vs. total shipments for a specific shipping day.
Execute the following command:
You should have a similar result to the following:
LABDB(LABADMIN)=> DROP VIEW THINLINEITEM;
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY SQL:
EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview "_MTHINLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 511888, Width = 16, Cost = 0.0 .. 174.1, Conf = 90.0 [MV:
MaxPages=136 TotalPages=544] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4)
Projections:
1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 32, Cost = 366.0 .. 366.0, Conf = 0.0
Projections:
1:SUM(LINEITEM.L_QUANTITY)
2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX)))
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 7 of 17
You can see that on the 15th
June of 1995 there have been 176 returned shipments out of a total of 2550. Notice the use of
the CASE statement to change the L_RETURNFLAG column into a Boolean 0-1 value, which is easily countable.
3. We will now take a look at the underlying data distribution of the LINEITEM table and its zone map values. To do this
exit the NZSQL console by executing the q command.
4. In our demo image we have installed the PureData System support tools. You can normally find them as an installation
package in /nz on your PureData System appliances or you can retrieve them from IBM support. One of these tools is
the nz_zonemap tool that returns detailed information about the zone map values associated with a given database
table. First let’s have a look at the zone mappable columns of the LINEITEM table. Execute the following command:
You should get the following result:
This command returns an overview of the zonemappable columns of the LINEITEM table in the LABDB database. Seven of
the sixteen columns have zone maps created for them. Zonemappable columns include integer and date data types. We
see that the L_SHIPDATE column we have in the WHERE condition of the customer query is zonemappable.
5. Now we will have a look at the zone map values for the L_SHIPDATE column. Execute the following command:
This command returns a list of all extents that make up the LINEITEM table and the minimum and maximum values of the
data in the L_SHIPDATE column for each extent. Your results should look like the following:
[nz@netezza ~]$ nz_zonemap LABDB LINEITEM L_SHIPDATE
[nz@netezza ~]$ nz_zonemap LABDB LINEITEM
Database: LABDB
Object Name: LINEITEM
Object Type: TABLE
Object ID : 243252
The zonemappable columns are:
Column # | Column Name | Data Type
----------+---------------+-----------
1 | L_ORDERKEY | INTEGER
2 | L_PARTKEY | INTEGER
3 | L_SUPPKEY | INTEGER
4 | L_LINENUMBER | INTEGER
11 | L_SHIPDATE | DATE
12 | L_COMMITDATE | DATE
13 | L_RECEIPTDATE | DATE
(7 rows)
[nz@netezza ~]$ nz_zonemap LABDB LINEITEM
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
RET | TOTAL
-----+-------
176 | 2550
(1 row)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 8 of 17
You can see that the LINEITEM table consists of 23 extents of data (3MB chunks on each dataslice). We can also see the
minimum and maximum values for the L_SHIPDATE column in each extent. These values are stored in the zone map and
automatically updated when rows are inserted, updated or deleted. If a query has a where condition on the L_SHIPDATE
column that falls outside of the data range of an extent, the whole extent can be discarded by PureData System without
scanning it.
In this case the data has been equally distributed on all extents. This means that our query which has a WHERE condition
on the 15th
June of 1995 doesn’t profit from the zone maps and requires a full table scan. Not a single extent could be safely
ruled out.
6. Enter the NZSQL console again by entering the nzsql labdb labadmin command.
7. We will now create a materialized view that is ordered on the L_SHIPDATE column. Execute the following command:
Note that our customer query has a WHERE condition on the L_SHIPDATE column but aggregates the L_RETURNFLAG
column. Nevertheless we didn’t add the L_RETURNFLAG column to the materialized view. We could have done it to
enhance the performance of our specific query even more. But in this case we assume that there are lots of customer
queries which are restricted on the ship date and access different columns of the LINEITEM table. A materialized view
LABDB(LABADMIN)=> CREATE MATERIALIZED VIEW SHIPLINEITEM AS SELECT L_SHIPDATE FROM
LINEITEM ORDER BY L_SHIPDATE;
[nz@netezza ~]$ nz_zonemap LABDB LINEITEM L_SHIPDATE
Database: LABDB
Object Name: LINEITEM
Object Type: TABLE
Object ID : 243252
Data Slice: 1
Column 1: L_SHIPDATE (DATE)
Extent # | L_SHIPDATE (Min) | L_SHIPDATE (Max) | ORDER'ed
----------+------------------+------------------+----------
1 | 1992-01-04 | 1998-11-29 |
2 | 1992-01-06 | 1998-11-30 |
3 | 1992-01-03 | 1998-11-28 |
4 | 1992-01-02 | 1998-11-29 |
5 | 1992-01-04 | 1998-11-29 |
6 | 1992-01-03 | 1998-11-28 |
7 | 1992-01-04 | 1998-11-29 |
8 | 1992-01-04 | 1998-11-30 |
9 | 1992-01-07 | 1998-12-01 |
10 | 1992-01-03 | 1998-11-28 |
11 | 1992-01-05 | 1998-11-27 |
12 | 1992-01-03 | 1998-12-01 |
13 | 1992-01-03 | 1998-11-30 |
14 | 1992-01-04 | 1998-11-30 |
15 | 1992-01-06 | 1998-11-27 |
16 | 1992-01-03 | 1998-11-30 |
17 | 1992-01-02 | 1998-11-29 |
18 | 1992-01-07 | 1998-11-29 |
19 | 1992-01-04 | 1998-11-30 |
20 | 1992-01-04 | 1998-11-30 |
21 | 1992-01-03 | 1998-11-30 |
22 | 1992-01-04 | 1998-11-29 |
23 | 1992-01-02 | 1998-11-26 |
(23 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 9 of 17
retains the information about the location of a parent row in the base table and can be used for lookups even if columns of
the parent table are accessed in the SELECT clause.
You can specify more than one order column. In that case they are ordered first by the first column; in case this column has
equal values the next column is used to order rows with the same value in column one etc. In general only the first order
column provides a significant impact on performance.
8. Let’s have a look at the zone map of the newly created view. Leave the NZSQL console again with the q command.
9. Display the zone map values of the materialized view SHIPLINEITEM with the following command:
The results should look like the following:
We can make a couple of observations here. First the materialized view is significantly smaller than the base table, since it
only contains one column. We can also see that the data values in the extent are ordered on the L_SHIPDATE column. This
means that for our query, which is accessing data from the 15th
June of 1995, only extent 3 needs to be accessed at all,
since only this extent has a data range that contains this date value.
10. Now let’s verify that our materialized view is indeed used for this query. Enter the NZSQL console by entering the
following command: nzsql labdb labadmin
11. Use the Explain command again to verify that our materialized view is used by the Optimizer:
You will see a long text output, scroll up till you find the command you just executed. Your result should look like the
following:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
[nz@netezza ~]$ nz_zonemap LABDB SHIPLINEITEM L_SHIPDATE
Database: LABDB
Object Name: SHIPLINEITEM
Object Type: MATERIALIZED VIEW
Object ID : 252077
Data Slice: 1
Column 1: L_SHIPDATE (DATE)
Extent # | L_SHIPDATE (Min) | L_SHIPDATE (Max) | ORDER'ed
----------+------------------+------------------+----------
1 | 1992-01-02 | 1993-04-11 |
2 | 1993-04-11 | 1994-05-24 | TRUE
3 | 1994-05-24 | 1995-07-03 | TRUE
4 | 1995-07-03 | 1996-08-14 | TRUE
5 | 1996-08-14 | 1997-09-24 | TRUE
6 | 1997-09-24 | 1998-12-01 | TRUE
(6 rows)
[nz@netezza ~]$ nz_zonemap LABDB SHIPLINEITEM L_SHIPDATE
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 10 of 17
Notice that the Optimizer has automatically changed the table scan to a scan of the view SHIPLINEITEM we just created.
This is possible even though the projection is taking place on column L_RETURNFLAG of the base table.
12. In some cases you might want to disable or suspend an associated materialized view. For troubleshooting or
administrative tasks on the base table. For these cases use the following command to suspend the view:
13. We want to make sure that the view is not used anymore during query execution. Execute the EXPLAIN command for
our query again:
Scroll up till you see your explain query. With the view suspended we can see that the optimizer again scans the original
table LINEITEM.
EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 60012, Width = 1, Cost = 0.0 .. 2417.5, Conf = 80.0
Restrictions:
...
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
LABDB(LABADMIN)=> ALTER VIEW SHIPLINEITEM MATERIALIZE SUSPEND;
EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 24, Cost = 62.2 .. 62.2, Conf = 0.0
Projections:
1:SUM(CASE WHEN (LINEITEM.L_RETURNFLAG <> 'N'::BPCHAR) THEN 1 ELSE 0 END)
2:COUNT(*)
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 11 of 17
14. Note that we have only suspended our view not dropped it. We will now reactivate it with the following refresh command:
This command can also be used to reorder materialized views in case the base table has been changed. While INSERTs,
UPDATEs and DELETEs into the base table are automatically reflected in associated materialized views, the view is not
reordered for every change. Therefore it is advisable to refresh them periodically esp. after major changes to the base table.
15. To check that the Optimizer again uses the materialized view for query execution, execute the following command:
Make sure that the Optimizer again uses the materialized view for its first scan operation. The output should again look like
before you suspended the view.
16. If you execute the query again you should get the same results as you got before creating the materialized view.
Execute the query again:
You should see the following output:
There is a defect in our VMWare image which in some cases only returns the rows from one dataslice instead of all
four, when a materialized view is used. This means that instead of seeing a TOTAL of 2550 you will see a total of
623 (or similar numbers depending on your data distribution and which dataslice is returned). You can solve this
problem by restarting your PureData System database. It will also not occur on a real PureData System appliance.
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS
RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
RET | TOTAL
-----+-------
176 | 2550
(1 row)
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
...
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
LABDB(LABADMIN)=> ALTER VIEW SHIPLINEITEM MATERIALIZE REFRESH;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 12 of 17
You have just created a materialized view to speed up queries that lookup small numbers of rows. A materialized view can
provide a significant performance improvement and is transparent to end users and applications accessing the database.
But it also creates additional overhead during INSERTs, UPDATEs and DELETEs, requires additional hard disc space and it
may require regular maintenance.
Therefore materialized views should be used sparingly. In the next chapter we will discuss an alternative approach to speed
up scan speeds on a database table.
3 Cluster Based Tables (CBT)
We have received a set of new customer queries on the ORDERS table that do not only restrict the table by order date but also
only accesses orders in a given price range. These queries make up a significant part of the system workload and we will look
into ways to increase performance for them. The following query is a template for the queries in question. It returns the
aggregated total price of all orders by order priority for a given year (in this case 1996) and price range (in this case between
150000 and 180000).
In this example we have a very restrictive WHERE condition on two columns O_ORDERDATE and O_TOTALPRICE, which can
help us to increase performance. The ORDERS table has around 220,000 rows with an order date of 1996 and 160,000 rows
with the given price range. But it only has 20,000 columns that satisfy both conditions.
Materialized views provide their main performance improvements on one column. Also INSERTS to the ORDERS table are
frequent and time critical, therefore we would prefer not to use materialized views and will in this chapter investigate the use of
cluster based tables.
Cluster based tables are PureData System tables that are created with an ORGANIZE ON keyword. They use a special space
filling algorithm to organize a table by up to 4 columns. Zone maps for a cluster based table will provide approximately the same
performance increases for all organization columns. This is useful if your query restricts a table on more than one column or if
your workload consists of multiple queries hitting the same table using different columns in WHERE conditions. In contrast to
materialized views no additional disc space is needed, since the base table itself is reordered.
3.1 Cluster Based Table Usage
Cluster based tables are created like normal PureData System database tables. They need to be flagged as a CBT during table
creation by specifying up to four organization columns. A PureData System table can be altered at any time to become a cluster
based table as well.
1. We are going to change the create table command for ORDERS to create a cluster based table. We will create a new
cluster based table called ORDERS_CBT. Exit the NZSQL console by executing the q command.
2. Switch to the optimization lab directory by executing the following command: cd /labs/optimizationObjects
3. We have supplied a the script for the creation of the ORDERS_CBT table but we need to add the ORGANIZE
ON(O_ORDERDATE, O_TOTALPRICE) clause to create the table as a cluster based table organized on the
O_ORDERDATE and O_TOTALPRICE columns. To change the CREATE statement open the orders_cbt.sql script in
the vi editor with the following command:
vi orders_cbt.sql
4. Enter the insert mode by pressing “i”, the editor should now show an “---INSERT MODE---“ statement in the bottom line.
SELECT O_ORDERPRIORITY, SUM(O_TOTALPRICE) FROM ORDERS
WHERE EXTRACT(YEAR FROM O_ORDERDATE) = 1996 AND
O_TOTALPRICE > 150000 AND
O_TOTALPRICE <= 180000
GROUP BY O_ORDERPRIORITY;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 13 of 17
5. Navigate the cursor on the semicolon ending the statement. Press enter to move it into a new line. Enter the line
“organize on (o_orderdate, o_totalprice)” before it. Your screen should now look like the following.
6. Exit the insert mode by pressing Esc.
7. Enter :wq! In the command line and press Enter to save and exit without questions.
8. Create and load the orders_cbt table by executing the following script: ./create_orders_test.sh
9. This may take a couple minutes because of our virtualized environment. You may see an error message that the table
orders_cbt does not exist. This is expected since the script first tries to clean up an existing orders_cbt table.
10. We will now have a look at how Netezza has organized the data in this table. For this we use the nz_zonemap utility
again. Execute the following command:
You will get the following result:
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt
create table orders_cbt
(
o_orderkey integer not null ,
o_custkey integer not null ,
o_orderstatus char(1) not null ,
o_totalprice decimal(15,2) not null ,
o_orderdate date not null ,
o_orderpriority char(15) not null ,
o_clerk char(15) not null ,
o_shippriority integer not null ,
o_comment varchar(79) not null
)
distribute on (o_orderkey)
organize on (o_orderdate, o_totalprice);
~
-- INSERT --
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 14 of 17
This command shows you the zone mappable columns of the ORDERS_CBT table. If you compare it with the output of the
nz_zonemap tool for the ORDERS table, you will see that it contains the additional column O_TOTALPRICE. Numeric
columns are not zone mapped per default for performance reasons but zone maps are created for them, if they are part of
the organization columns.
11. Execute the following command to see the zone map values of the O_ORDERDATE column:
You will get the following results:
This is unexpected. Since we used O_ORDERDATE as an organization column we would have expected some kind of
order in the data values, but they are again distributed equally over all extents.
The reason for this is that the organization process takes place during a command called groom. Instead of creating a new
table we could also have altered the existing ORDERS table to become a cluster based table. Creating or altering a table to
become a cluster based table doesn’t actually change the physical table layout till the groom command has been used.
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate
Database: LABDB
Object Name: ORDERS_CBT
Object Type: TABLE
Object ID : 264428
Data Slice: 1
Column 1: O_ORDERDATE (DATE)
Extent # | O_ORDERDATE (Min) | O_ORDERDATE (Max) | ORDER'ed
----------+-------------------+-------------------+----------
1 | 1992-01-01 | 1998-08-02 |
2 | 1992-01-01 | 1998-08-02 |
3 | 1992-01-01 | 1998-08-02 |
4 | 1992-01-01 | 1998-08-02 |
5 | 1992-01-01 | 1998-08-02 |
6 | 1992-01-01 | 1998-08-02 |
7 | 1992-01-01 | 1998-08-02 |
(7 rows)
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt
Database: LABDB
Object Name: ORDERS_CBT
Object Type: TABLE
Object ID : 264428
The zonemappable columns are:
Column # | Column Name | Data Type
----------+----------------+---------------
1 | O_ORDERKEY | INTEGER
2 | O_CUSTKEY | INTEGER
4 | O_TOTALPRICE | NUMERIC(15,2)
5 | O_ORDERDATE | DATE
8 | O_SHIPPRIORITY | INTEGER
(5 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 15 of 17
This command will be covered in detail in the following presentation and lab. But we will use it in the next chapter to
reorganize the table.
3.2 Cluster Based Table Maintenance
When a table is created as a cluster based table in Netezza the data isn’t actually organized during load time. Also similar to
ordered materialized views a cluster based table can become partially unordered due to INSERTS, UPDATES and DELETES. A
threshold is defined for reorganization and the groom command can be used at any time to reorganize a cluster based table,
based on its organization keys.
1. To organize the table you created in the last chapter you need to switch to the NZSQL console again. Execute the
following command: nzsql labdb labadmin
2. Execute the following command to groom your cluster based table:
This command does a variety of things which will be covered in a further presentation and lab. In this case it organizes the
cluster based table based on its organization keys.
This command requires a lot of RAM on the SPUs to operate. Our VMWare systems have been tuned so the
command should be able to finish. Since the whole table is reordered it may take a couple of minutes to finish but
should you get the impression that the system is stuck please inform the lecturer.
3. Let’s have a look at the data organization in the table. To do this quit the NZSQL console with the q command.
4. Review the zone maps of the two organization columns by executing the following command:
Your results should look like the following (we removed the ORDER columns from the results to make it better readable)
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice
Database: LABDB
Object Name: ORDERS_CBT
Object Type: TABLE
Object ID : 264428
Data Slice: 1
Column 1: O_ORDERDATE (DATE)
Column 2: O_TOTALPRICE (NUMERIC(15,2))
Extent # | O_ORDERDATE (Min) | O_ORDERDATE (Max) | O_TOTALPRICE (Min) | O_TOTALPRICE (Max)
----------+-------------------+-------------------+--------------------+--------------------
1 | 1992-01-01 | 1994-06-22 | 912.10 | 144450.63 |
2 | 1993-08-27 | 1996-12-08 | 875.52 | 144451.22 |
3 | 1996-02-13 | 1998-08-02 | 884.52 | 144446.76 |
4 | 1995-04-18 | 1998-08-02 | 78002.23 | 215555.39 |
5 | 1993-08-27 | 1998-08-02 | 196595.73 | 530604.44 |
6 | 1992-01-01 | 1995-04-18 | 144451.94 | 296228.30 |
7 | 1992-01-01 | 1993-08-27 | 196591.22 | 555285.16 |
(7 rows)
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice
LABDB(LABADMIN)=> groom table orders_cbt;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 16 of 17
You can see that both columns have some form of order now. Our query is restricting rows in two ranges
Condition 1: O_ORDERDATE = 1996
AND
Condition 2: 150000 < O_TOTALPRICE <= 180000
Below we enter the minimum and maximum values of the extents in a table and add a column to mark (with an X) if the
contained values of an extent overlap with the above conditions.
Min(Date) Max(Date) Min(Price) Max(Price) Cond 1 Cond 2 Both Cond
1992-01-01 1994-06-22 912.10 144450.63
1993-08-27 1996-12-08 875.52 144451.22 X
1996-02-13 1998-08-02 884.52 144446.76 X
1995-04-18 1998-08-02 78002.23 215555.39 X X X
1993-08-27 1998-08-02 196595.73 530604.44 X
1992-01-01 1995-04-18 144451.94 296228.30 X
1992-01-01 1993-08-27 196591.22 555285.16
As you can see there are now 4 extents that have rows from 1996 in them and 2 extents that contain rows in the price range
from 150000 to 18000. But we have only one extent that contains rows that satisfy both conditions and needs to be scanned
during query execution.
In this scenario we probably would have been able to get similar results with one organization column or a materialized view,
but with bigger tables and more extents cluster based tables gain a performance advantage.
Congratulations, you have finished the Optimization Objects lab. In this lab you have created materialized views to speedup
scans of wide tables and queries that only look up small numbers of rows. Finally you created a cluster based table and
used the groom command to organize it. Throughout the lab you have used the nz_zonemap tool to see zone maps and get
a better idea on how data is stored in the Netezza appliance.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 20112 All rights reserved Page 17 of 17
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or
registered
trademarks of International Business Machines Corporation
in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence
in this
information with a trademark symbol (® or ™), these
symbols indicate
U.S. registered or common law trademarks owned by IBM
at the time
this information was published. Such trademarks may also
be registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web at
“Copyright and trademark information” at
ibm.com/legal/copytrade.shtml
Other company, product and service names may be
trademarks or service marks of others.
References in this publication to IBM products and services
do not imply that IBM intends to make them available in all
countries in which
IBM operates.
No part of this document may be reproduced or transmitted
in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date
of initial
publication. Product data is subject to change without notice.
Any
statements regarding IBM’s future direction and intent are
subject to
change or withdrawal without notice, and represent goals
and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY,
EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS
ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and
conditions of
the agreements (e.g. IBM Customer Agreement, Statement
of Limited
Warranty, International Program License Agreement, etc.)
under which
they are provided.
IBM Software
Information Management
Groom
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 18
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Transactions....................................................................3
2.1 Insert Transaction............................................................3
2.2 Update and Delete Transactions......................................4
2.3 Aborting Transactions......................................................7
2.4 Cleaning up .....................................................................8
3 Grooming Logically Deleted Rows ..............................10
4 Performance Benefits of GROOM................................12
5 Changing the Data Type of a Column..........................13
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 18
1 Introduction
As part of your routine database maintenance activities, you should plan to recover disk space occupied by outdated or deleted
rows. In normal PureData System operation, an UPDATE or DELETE of a table row does not remove the physical row on the
hard disc. Instead the old row is marked as deleted together with a transaction id of the deleting transaction and in case of
update a new row is created. This approach is called multiversioning. Rows that could potentially be visible to other transactions
with an older transaction id are still accessible. Over time however, the outdated or deleted rows are of no interest to any
transaction anymore and need to be removed to free up hard disc space and improve performance. After the rows have been
captured in a backup, you can reclaim the space they occupy using the SQL GROOM TABLE command. The GROOM TABLE
command does not lock a table while it is running; you can continue to SELECT, UPDATE, and INSERT into the table while the
table is being groomed.
1.1 Objectives
In this lab we will use the GROOM command to prepare our tables for the customer. During the course of the POC we have
deleted and update a number of rows. At the end of a POC it is sensible to clean up the system. Use Groom on the created
tables, Generate Statistics, and other cleanup tasks.
2 Transactions
In this section we will show how transactions can leave logically deleted rows in a table which later as an administrative task
need to be removed with the groom command. We will go through the different transaction types and show you what happens
under the covers in a PureData System Appliance.
2.1 Insert Transaction
In this chapter we will add a new row to the regions table and review the hidden fields that are saved in the database. As you
remember from the Transactions presentation, PureData System uses a concept called multi-versioning for transactions. Each
transaction has its own image of the table and doesn’t influence other transactions. This is done by adding a number of hidden
fields to the PureData System table. The most important ones are the CREATEXID and the DELETEXID. Each PureData System
transaction has a unique transaction id that is increasing with each new transaction.
In this subsection we will add a new row to the REGION table.
1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2. Start NZSQL with : nzsql
3. Connect to the database LABDB as user LABADMIN by typing the following command:
SYSTEM(ADMIN)=> c LABDB LABADMIN
4. Select all rows from the REGION table:
You should see the following output with 4 existing regions:
LABDB(LABADMIN)=> SELECT * FROM REGION;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 18
5. Insert a new row into the REGIONS table for the region Australia with the following SQL command
6. Now we will again do a select on the REGION table. But this time we will also query the hidden fields CREATEXID,
DELETEXID and ROWID:
You should see the following results:
As you can see we now have five rows in the REGION table. The new row for Australia has the id of the last transaction as
CREATEXID and “0” as DELETEXID since it has not yet been deleted. Other transactions with a lower transaction id that
might still be running will not be able to see this new row. Note also that each row has a unique rowid. Rowids do not need
to be consecutive but they are unique across all dataslices for one table.
2.2 Update and Delete Transactions
Delete transactions in PureData System do not physically remove rows but update the DELETEXID field of a row to mark it as
logically deleted. These logically deleted rows need to be removed regularly with the administrative Groom command.
Update transactions in PureData System consist of a logical delete of the old row and an insert of a new row with the updated
fields. To show this effectively we will need to change a system parameter in PureData System that allows us to switch off the
invisibility lists in PureData System. Note that the parameter we will be using is dangerous and shouldn’t be used in a real
PureData System environment. There is also a safer environment variable but this has some restrictions.
1. First we will change the system variable that allows us to see deleted rows in the system
To do this exit the console with q
2. Stop the PureData System database with nzstop
LABDB(LABADMIN)=> SELECT CREATEXID,DELETEXID,ROWID,* FROM REGION;
CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+-------------------
----------
365584 | 0 | 163100000 | 5 | as | australia
357480 | 0 | 161271001 | 1 | na | north america
357480 | 0 | 161271002 | 2 | sa | south america
357480 | 0 | 161271000 | 3 | emea | europe, ...
357480 | 0 | 161271003 | 4 | ap | asia pacific
(5 rows)
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
2 | sa | south america
4 | ap | asia pacific
3 | emea | europe, middle east, africa
1 | na | north america
(4 rows)
LABDB(LABADMIN)=> INSERT INTO REGION VALUES (5, 'as', 'australia');
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 18
3. Navigate to the system config directory with the following command:
4. Open the system.cfg file that contains the PureData System system configuration with vi
5. Enter the insert mode by pressing “i”, the editor should now show an “---INSERT MODE---“ statement in the bottom line.
6. Navigate the cursor to the end of the last line. Press enter to create a new line. Enter the line
host.fpgaAllowXIDOverride=yes before it. Your screen should now look like the following.
7. Exit the insert mode by pressing Esc.
8. Enter :wq! In the command line and press Enter to save and exit without questions.
9. Start the system again with the nzstart command. Note in a real PureData System system changing system
configuration parameters can be a very dangerous thing that is normally not advisable without PureData System service
support.
10. Enter the NZSQL console again with the following command:
11. Now we will update the row we inserted in the last chapter to the REGION table:
LABDB(LABADMIN)=> UPDATE REGION SET R_COMMENT='Australia' WHERE R_REGIONKEY=5;
[nz@netezza config]$ nzsql labdb labadmin
system.enableCompressedTables=false
system.realFpga=no
system.useFpgaPrep=yes
system.enableCompressedTables=yes
system.enclosurePollInterval=0
system.envPollInterval=0
system.esmPollInterval=0
system.hbaPollInterval=0
system.diskPollInterval=0
system.enableCTA2=1
system.enableCTAColumns=1
sysmgr.coreCountWarning=1
sysmgr.coreCountFailover=1
system.emulatorMode=64
system.emulatorThreads=4
host.fpgaAllowXIDOverride=yes
~
~
-- INSERT --
[nz@netezza config]$ vi system.cfg
[nz@netezza ~]$ cd /nz/data/config
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 18
12. Do a SELECT on the REGION table again:
You should see the following output:
Normally you would now see 5 rows with the update value. But since we disabled the invisibility lists you now see 6 rows in
the REGION table. Our transaction that updated the row had the transaction id 369666. You can see that the original row
with the lowercase “australia” in the comment column is still there and now has a DELETXID field that contains the
transaction id of the transaction that deleted it. Transactions with a higher transaction id will not see a row with a deletexid
that indicates that it has been logically deleted before the transaction is run.
We also see a newly inserted row with the new comment value ‘Australia’. It has the same rowid as the deleted row and the
same CREATEXID as the transaction that did the insert.
13. Finally lets clean up the table again by deleting the Australia row:
14. Do a SELECT on the REGION table again:
You should see the following output:
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+-------------------
----------
357480 | 0 | 161271000 | 3 | emea | europe, ...
365584 | 369666 | 163100000 | 5 | as | australia
369666 | 369670 | 163100000 | 5 | as | Australia
357480 | 0 | 161271001 | 1 | na | north america
357480 | 0 | 161271003 | 4 | ap | asia pacific
357480 | 0 | 161271002 | 2 | sa | south america
(6 rows)
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
LABDB(LABADMIN) => DELETE FROM REGION WHERE R_REGIONKEY=5;
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+-------------------
----------
357480 | 0 | 161271003 | 4 | ap | asia pacific
357480 | 0 | 161271000 | 3 | emea | europe, ...
357480 | 0 | 161271002 | 2 | sa | south america
365584 | 369666 | 163100000 | 5 | as | australia
369666 | 0 | 163100000 | 5 | as | Australia
357480 | 0 | 161271001 | 1 | na | north america
(6 rows)
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 18
We can now see that we have logically deleted our updated row as well. It has now a DELETEXID field with the value of the
new transaction. New transactions will see the original table from the start of this lab again. Normally the logically deleted
rows are filtered out automatically by the FPGA.
If you do a SELECT the FPGA will remove all rows that:
• have a CREATEXID which is bigger than the current transaction id.
• have a CREATEXID of an uncommitted transaction.
• have a DELETENXID which is smaller than the current transaction, but only if the transaction of the DELETEXID
field is committed.
• have a DELETEXID of “1” which means that the insert has been aborted.
2.3 Aborting Transactions
PureData System never deletes a row during transactions even if transactions are rolled back. In this section we will show what
happens if a transaction is rolled back. Since an update transaction consists of a delete and insert transaction, we will
demonstrate the behavior for all tree transaction types with this.
1. To start a transaction that we can later rollback we need to use the BEGIN keyword.
Per default all SQL statements entered into the NZSQL console are auto-committed. To start a multi command transaction
the BEGIN keyword needs to be used. All SQL statements that are executed after it will belong to a single transaction. To
end the transaction two keywords can be used COMMIT to commit the transaction or ROLLBACK to rollback the transaction
and all changes since the BEGIN statement was executed.
2. Update the row for the AP region:
3. Do a SELECT on the REGION table again:
You should see the following output:
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT
-----------+-----------+----------+-------------+---------------------------+-----------------------------
5160 | 0 | 37801002 | 2 | sa | south america
5160 | 0 | 37801001 | 1 | na | north america
5172 | 9218 | 38962000 | 5 | as | australia
9218 | 9222 | 38962000 | 5 | as | Australia
5160 | 0 | 37801000 | 3 | emea | europe, middle east, africa
5160 | 9226 | 37801003 | 4 | ap | asia pacific
9226 | 0 | 37801003 | 4 | ap | AP
(7 rows)
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
LABDB(LABADMIN)=> BEGIN;
LABDB(LABADMIN)=> UPDATE REGION SET R_COMMENT='AP' WHERE R_REGIONKEY=4;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 18
Note that we have the same results as in the last chapter, the original row for the AP region was logically deleted by
updating its DELETEXID field, and a new row with the updated comment and new rowid has been added. Note that its
CREATEXID is the same as the DELETEXID of the old row, since they were updated by the same transaction.
4. Now lets rollback the transaction:
5. Do a SELECT on the REGION table again:
You should see the following output:
We can see that the transaction has been rolled back. The DELETEXID of the old version of the row has been reset to “0” ,
which means that it is a valid row that can be seen by other transactions, and the DELETEXID of the new row has been set
to “1” which marks it as aborted.
2.4 Cleaning up
In this section we will use the Groom command to remove the logically deleted rows we have entered and we will remove the
system parameter from the configuration file. The Groom command will be used in more detail in the next chapter. It is the main
maintenance command in PureData System and we have already used it in the Cluster Based Table labs to reorder a CBT. It
also removes all logically deleted rows from a table and frees up the space on the machine again.
1. Execute the Groom command on the REGION table:
You should see the following result:
You can see that the groom command purged 3 rows, exactly the number of aborted and logically deleted rows we have
generated in the previous chapter.
LABDB(LABADMIN)=> groom table region;
NOTICE: Groom processed 4 pages; purged 3 records; scan size unchanged; table size
unchanged.
GROOM RECORDS ALL
LABDB(LABADMIN)=> groom table region;
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT
-----------+-----------+----------+-------------+---------------------+----------------------
5160 | 0 | 37801002 | 2 | sa | south america
5160 | 0 | 37801000 | 3 | emea | Europe …
5160 | 0 | 37801003 | 4 | ap | asia pacific
9226 | 1 | 37801003 | 4 | ap | AP
5160 | 0 | 37801001 | 1 | na | north america
5172 | 9218 | 38962000 | 5 | as | australia
9218 | 9222 | 38962000 | 5 | as | Australia
(7 rows)
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
LABDB(LABADMIN)=> ROLLBACK;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 18
2. Now select the rows from the REGION table again.
You should see the following result:
You can see that the groom command has removed all logically deleted rows from the table. Remember that we still have
the parameter switched on that allows us to see any logically deleted rows. Especially in tables that are heavily changed
with lots and updates and deletes running the groom command will free up hard drive space and increase performance.
3. Finally we will remove the system parameter again. Quit the nzsql console with the q command.
4. Stop the PureData System database with nzstop
5. Navigate to the system config directory with the following command:
6. Open the system.cfg file that contains the PureData System system configuration with vi
7. Navigate the cursor to the last line and delete it by pressing “d” twice. Your screen should look like the following:
[nz@netezza config]$ vi system.cfg
[nz@netezza ~]$ cd /nz/data/config
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+-------------------
----------
357480 | 0 | 161271002 | 2 | sa | south america
357480 | 0 | 161271000 | 3 | emea | europe,...
369682 | 0 | 164100000 | 1 | na | north america
357480 | 0 | 161271003 | 4 | ap | asia pacific
(4 rows)
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of
18
8. Enter :wq! In the command line and press Enter to save and exit without questions.
9. Start the system again with the nzstart command. We have now returned the system to its original status. Logically
deleted lines will again be hidden by the database.
3 Grooming Logically Deleted Rows
In this section we will delete rows and determine that they have not really been deleted from the disk. Then using groom we will
physically delete the rows.
1. First determine the physical size on disk of the table ORDERS using the following command :
You should see the following results:
Notice that the ORDERS table is 75MBs in size.
2. Now we are going to delete some rows from ORDERS table. Delete all rows where the orderstatus is marked as ‘F’ for
finished using the following command :
system.enableCompressedTables=false
system.realFpga=no
system.useFpgaPrep=yes
system.enableCompressedTables=yes
system.enclosurePollInterval=0
system.envPollInterval=0
system.esmPollInterval=0
system.hbaPollInterval=0
system.diskPollInterval=0
system.enableCTA2=1
system.enableCTAColumns=1
sysmgr.coreCountWarning=1
sysmgr.coreCountFailover=1
system.emulatorMode=64
system.emulatorThreads=4
~~
"system.cfg" 16L, 421C
[nz@netezza ~]$ nz_db_size LABDB
[nz@netezza ~]$ nz_db_size LABDB
Object | Name | Bytes | KB | MB | GB | TB
-----------+------------------+---------------+-------------+-------------+------------+-------
Appliance | netezza | 769,785,856 | 751,744 | 734 | .7 | .0
Database | LABDB | 761,921,536 | 744,064 | 727 | .7 | .0
Table | CUSTOMER | 13,631,488 | 13,312 | 13 | .0 | .0
Table | LINEITEM | 588,644,352 | 574,848 | 561 | .5 | .0
Table | NATION | 524,288 | 512 | 1 | .0 | .0
Table | ORDERS | 78,118,912 | 76,288 | 75 | .1 | .0
Table | PART | 12,058,624 | 11,776 | 12 | .0 | .0
Table | PARTSUPP | 67,502,080 | 65,920 | 64 | .1 | .0
Table | REGION | 393,216 | 384 | 0 | .0 | .0
Table | SUPPLIER | 1,048,576 | 1,024 | 1 | .0 | .0
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of
18
The output should be:
3. Now check the physical table size for ORDERS and see if the size decreased using the same command as before. You
must first exit NZSQL to shell using q.
The output should be the same as above showing that the ORDERS table did not change in size and is still 75MB. This is
because the deleted rows were logically deleted but are still left on disk. The rows will still accessible to transactions that started
before the DELETE statement which we just executed. (i.e. have a lower transaction id)
4. Next let’s physically delete what we just logically deleted using the GROOM TABLE command and specifying table
ORDERS. When you run the GROOM TABLE command, it removes outdated and deleted records from tables.
The output should be:
You can see that 729413 rows were removed from disk resulting in the table size shrinking by 12 extents. Notice that this is
the same number of rows we deleted in the previous step.
5. Check if the ORDERS table size on disk has shrunk using the nz_db_size command. You must first exit NZSQL to
shell using q.
The output is shown below. Note the reduced size of the ORDERS table:
[nz@netezza labs]$ nzsql LABDB LABADMIN
LABDB(LABADMIN)=> DELETE FROM ORDERS WHERE O_ORDERSTATUS='F';
[nz@netezza ~]$ nz_db_size labdb
Object | Name | Bytes | KB | MB | GB | TB
-----------+-------------+--------------+------------+------------+-----------+-------
Appliance | netezza | 430,833,664 | 420,736 | 411 | .4 | .0
Database | LABDB | 422,969,344 | 413,056 | 403 | .4 | .0
Table | CUSTOMER | 13,631,488 | 13,312 | 13 | .0 | .0
Table | LINEITEM | 294,256,640 | 287,360 | 281 | .3 | .0
Table | NATION | 524,288 | 512 | 1 | .0 | .0
Table | ORDERS | 40,370,176 | 39,424 | 39 | .0 | .0
Table | PART | 5,242,880 | 5,120 | 5 | .0 | .0
Table | PARTSUPP | 67,502,080 | 65,920 | 64 | .1 | .0
Table | REGION | 393,216 | 384 | 0 | .0 | .0
Table | SUPPLIER | 1,048,576 | 1,024 | 1 | .0 | .0
LABDB(LABADMIN)=> q
[nz@netezza ~]$ nz_db_size LABDB
LABDB(LABADMIN)=> GROOM TABLE ORDERS;
NOTICE: Groom processed 596 pages; purged 729413 records; scan size shrunk by 288
pages; table size shrunk by 12 extents.
GROOM RECORDS ALL
[nz@netezza labs]$ nzsql LABDB LABADMIN
LABDB(LABADMIN)=> GROOM TABLE ORDERS;
LABDB(LABADMIN)=> q
[nz@netezza ~]$ nz_db_size LABDB
DELETE 729413
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of
18
We can see that GROOM did purge the deleted rows from disk. GROOM reported that the table size was reduced by 12 extents and
we can confirm this because we can see that the size of the table reduced by 36MB which is the correct size for 12 exents. (1
extent’s size is 3 MB).
4 Performance Benefits of GROOM
In this section we will show that grooming a table can also result in a performance benefit because the amount of data that
needs to be scanned is smaller. Outdated rows are still present on the hard disc. They can be dismissed by the FPGA chip but
the system still needs to read them from disc. In this example we need for accounting reasons increase the order price of all
columns. This means that we need to update every row in the ORDERS table. We will measure query performance before and
after Grooming the table.
1. Update the ORDERS table so that the price of everything is increased by $1. Do this using the following command:
Output:
All rows will be affected by the update resulting in a doubled number of physical rows in the table. This is because the update
operation leaves a copy of the rows before the update occurred incase a transaction is still operating on the rows.. New rows
are created and the results of the UPDATE are put in these rows. The old rows that are left on disk are marked as logically
deleted.
2. To measure the performance of our test query, we can configure the NZSQL console to show the elapsed execution
time using the following command:
Output:
3. Run our given test query and note the performance:
Output:
4. Please rerun the query once or twice more to see roughly what a consistent query time is on your machine.
5. Now run the GROOM TABLE command on the ORDER table again:
LABDB(LABADMIN)=> GROOM TABLE ORDERS;
LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;
LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;
COUNT
--------
770587
(1 row)
Elapsed time: 0m0.502s
LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;
Query time printout on
LABDB(LABADMIN)=> time
UPDATE 770587
[nz@netezza labs]$ nzsql LABDB LABADMIN
LABDB(LABADMIN)=> UPDATE ORDERS SET O_TOTALPRICE = O_TOTALPRICE+1;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of
18
The output should be:
Can you tell how much disk space this saved? (It’s the number of extents times 3MB)
6. Now run our chosen test query again and you should see a difference in performance:
Output:
You should see that the query ran faster than before. This is because GROOM reduced the number of rows that must be scanned
to complete the query. The COUNT(*) command on the table will return the same number of rows before and after the GROOM
command was run since it can only see the current version of the table, which means all rows that have not been deleted by a
lower transaction id. Since our UPDATE command hasn’t changed the number of logical rows this will not change.
Nevertheless the outdated rows, which have been logically deleted by our UPDATE command, are still present on disk. The
COUNT(*) query cannot access these rows but they do take up space on disk and need to be scanned. GROOM is used to purge
these logically deleted rows from disk which increase disk usage and scan distance. You should GROOM tables that receive
frequent updates or deletes more often than tables that are seldom updated. You might want to schedule tasks that routinely
GROOM the frequently updated tables or run a GROOM command as part of you ETL process.
5 Changing the Data Type of a Column
In some situations you will realize that the initially used data types are not suitable for longterm use, for example because new
entries exceed the range of an initially picked integer value. You cannot directly change the data type by using the ALTER
statement but there are two approaches that allow you to do it without loading and unloading the data.
The first approach is to:
• Create a CTAS table from the old table with a CAST to the new datatype for the column you want to change
• Drop the old table
• Rename the new table to the name of the old table
In general this is a good approach because it lets you keep the order of the columns. But in this example we will use a second
approach to highlight the groom command and its role during ADD and DROP column commands. Its disadvantages are that the
order of the columns will change, which may result in difficulties for third party applications that access columns by their order.
In this chapter we will:
• Add a new column to the table with the new datatype
• Copy over all values from the old row to the new one with an UPDATE command
• Drop the old column
LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;
COUNT
--------
770587
(1 row)
Elapsed time: 0m0.315s
LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;
LABDB(ADMIN)=> GROOM TABLE ORDERS;
NOTICE: Groom processed 616 pages; purged 770587 records; scan size shrunk by 308
pages; table size shrunk by 16 extents.
GROOM RECORDS ALL
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of
18
• Rename the new column to the name of the old one
• Use the groom command to materialize the results of our table changes
For our example we find out that we have a new Region we want to add to our Regions table which has a name that exceeds
the limits of the CHAR(25) field R_NAME. “Australia, New Zealand, and Tasmania”. And we decide to increase the R_NAME
field to a CHAR(40) field.
1. Add a new column to the region table with name R_NAME_TEMP and data type CHAR(40)
Notice that the ALTER command is practically instantaneous. This even holds true for huge tables. Under the cover the
system will create a new empty version of the table. It will not lock and change the whole table.
2. Lets insert a row into the table using the new name column
3. Now do a select on the table:
You should get the following results:
You can see that the results are exactly as you would expect them to be, but how does the system actually achieve this.
Remember inside the PureData System appliances we have two versions of the table, one containing the old columns and
rows and one containing the new row column.
4. Let’s do an EXPLAIN on the SELECT query
You should get the following results:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;
LABDB(LABADMIN)=> INSERT INTO REGION VALUES (5,'', 'South Pacific Region',
'Australia, New Zealand, and Tasmania');
DELETE 39099
LABDB(LABADMIN)=> LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY | R_NAME | R_COMMENT | R_NAME_TEMP
-------------+---------------------------+-----------------------+----------------------------------
1 | na | north america |
5 | | South Pacific Region | Australia, New Zealand, and Tasmania
4 | ap | asia pacific |
2 | sa | south america |
3 | emea | europe, |
(5 rows)
LABDB(LABADMIN)=> SELECT * FROM REGION;
LABDB(LABADMIN)=> ALTER TABLE REGION ADD COLUMN R_NAME_TEMP CHAR(40);
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of
18
Normally the query would result in a single table scan node. But now we see a more complicated query plan. The Optimizer
automatically translates the simple SELECT into a UNION of two tables. The two tables are internal and are called
_TV_315893_1, which is the old version of the table before the ALTER statement. And _TV_315893_2, which is the new
version of the table after the table statement containing the new column R_NAME_TEMP.
Notice that in the old table a 4th
column of CHAR(40) with default value NULL is added. This is necessary for the UNION to
succeed. The merger of those tables is done in Node 5, which takes both result sets and appends them.
But lets proceed with our data type change operation.
5. First lets remove the new row again
6. Now we will move all values of the R_NAME column to the R_NAME_TEMP column by updating them
7. Lets have a look at the table again:
LABDB(LABADMIN)=> UPDATE REGION SET R_NAME_TEMP = R_NAME;
LABDB(LABADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY > 4;
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;
NOTICE: QUERY PLAN:
QUERY SQL:
EXPLAIN VERBOSE SELECT * FROM REGION;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table ""_TV_315893_2"" {("_TV_315893_2".R_REGIONKEY)}]
-- Estimated Rows = 1, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0
User table: REGION version 2
Projections:
1:"_TV_315893_2".R_REGIONKEY 2:"_TV_315893_2".R_NAME
3:"_TV_315893_2".R_COMMENT 4:"_TV_315893_2".R_NAME_TEMP
Node 2.
[SPU Sub-query Scan table "*SELECT* 1" Node "1" {(0."1")}]
-- Estimated Rows = 1, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0
Projections:
1:0."1" 2:0."2" 3:0."3" 4:0."4"
Node 3.
[SPU Sequential Scan table ""_TV_315893_1"" {("_TV_315893_1".R_REGIONKEY)}]
-- Estimated Rows = 8, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0
User table: REGION version 1
Projections:
1:"_TV_315893_1".R_REGIONKEY 2:"_TV_315893_1".R_NAME
3:"_TV_315893_1".R_COMMENT 4:(NULL::BPCHAR)::CHAR(40)
Node 4.
[SPU Sub-query Scan table "*SELECT* 2" Node "3" {(0."1")}]
-- Estimated Rows = 8, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0
Projections:
1:0."1" 2:0."2" 3:0."3" 4:0."4"
Node 5.
[SPU Append Nodes: , "2", "4 (stream)" {(0."1")}]
-- Estimated Rows = 9, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0
Projections:
1:0."1" 2:0."2" 3:0."3" 4:0."4"
Node 6.
[SPU Sub-query Scan table "_BV_315893" Node "5" {("_BV_315893".R_REGIONKEY)}]
-- Estimated Rows = 9, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0
Projections:
1:"_BV_315893".R_REGIONKEY 2:"_BV_315893".R_NAME 3:"_BV_315893".R_COMMENT
4:"_BV_315893".R_NAME_TEMP
[SPU Return]
[Host Return]
…
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of
18
You should get the following results:
8. Now lets remove the old column:
9. And rename the column name
10. Lets have a look at the table again:
You should get the following results:
We have achieved to change the data type of the R_NAME column. The column order has changed but our R_NAME
column has the same values as before and now supports longer region names.
But we have one last step to do. Under the cover the system now has three different versions of the table which are merged
for each call against the REGION table. This not only uses up space it is also bad for the query performance. So we have to
materialize these table changes with the groom command.
11. Groom the REGION table with the VERSIONS keyword to merge table versions:
You should get the following results:
LABDB(LABADMIN)=> GROOM TABLE REGION VERSIONS;
NOTICE: Groom processed 8 pages; purged 5 records; scan size shrunk by 4 pages; table size shrunk by 4
extents.
GROOM VERSIONS
LABDB(LABADMIN)=> GROOM TABLE REGION VERSIONS;
LABDB(LABADMIN)=> ALTER TABLE REGION RENAME COLUMN R_NAME_TEMP TO R_NAME;
LABDB(LABADMIN)=> SELECT * FROM REGION;
LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY | R_COMMENT | R_NAME
-------------+-----------------------------+------------------------------------------
4 | asia pacific | ap
3 | europe, middle east, africa | emea
2 | south america | sa
1 | north america | na
(4 rows)
LABDB(LABADMIN)=> ALTER TABLE REGION DROP COLUMN R_NAME RESTRICT;
LABDB(LABADMIN)=> SELECT * FROM REGION;
LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY | R_NAME | R_COMMENT | R_NAME_TEMP
-------------+---------------------------+-----------------------------+-----------------------------------
-------
3 | emea | europe, middle east, africa | emea
1 | na | north america | na
2 | sa | south america | sa
4 | ap | asia pacific | ap
(4 rows)
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 17 of
18
12. And finally we will look at the EXPLAIN output again:
You should get the following results:
Now this is much nicer. As we would expect we only have a single table scan snippet in the query plan and a single version
of the REGION table.
13. Finally we will return the REGION table to the old column ordering to not interfere with future labs, to do this we will use
a CTAS statement
14. Now drop the REGION table:
15. And finally rename the REGION_NEW table to make the transformation complete:
If a table can be inaccessible for a short period of time using CTAS tables can be the better solution to change data types
than using an ALTER TABLE statement.
In this lab you have looked behind the scenes of the PureData System appliances. You have seen how transactions are
implemented and we have shown different reasons for using the groom command. It not only removes logically deleted rows
from INSERT and UPDATE operations, aborted INSERTS and Loads, it also materializes table changes and reorders cluster
based tables.
LABDB(LABADMIN)=> ALTER TABLE REGION_NEW RENAME TO REGION;
LABDB(LABADMIN)=> DROP TABLE REGION;
LABDB(LABADMIN)=> CREATE TABLE REGION_NEW AS SELECT R.R_REGIONKEY, R.R_NAME,
R.R_COMMENT FROM REGION R;
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;
NOTICE: QUERY PLAN:
QUERY SQL:
EXPLAIN VERBOSE SELECT * FROM REGION;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "REGION" {(REGION.R_REGIONKEY)}]
-- Estimated Rows = 4, Width = 73, Cost = 0.0 .. 0.0, Conf = 100.0
Projections:
1:REGION.R_REGIONKEY 2:REGION.R_COMMENT 3:REGION.R_NAME
[SPU Return]
[Host Return]
…
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 18 of
18
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at “Copyright and
trademark information” at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBM’s future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.
Stored Procedures
Hands-On Lab
IBM PureData System for Analytics … Powered by Netezza Technology
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 2 of 22
Table of Contents
1 Introduction .....................................................................3
1.1 Objectives........................................................................3
2 Implementing the addCustomer stored procedure ......3
2.1 Create Insert Stored Procedure .......................................4
2.2 Adding integrity checks....................................................8
2.3 Managing your stored procedure ...................................10
3 Implementing the checkRegions stored procedure...15
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 3 of 22
1 Introduction
Stored Procedures are subroutines that are saved in PureData System. They are executed inside the database server and are
only available by accessing the NPS system. They combine the capabilities of SQL to query and manipulate database
information with capabilities of procedural programming languages, like branching and iterations. This makes them an ideal
solution for tasks like data validation, writing event logs or encrypting data. They are especially suited for repetitive tasks that
can be easily encapsulated in a sub-routine.
1.1 Objectives
In the last labs we have created our database, loaded the data and we have done some optimization and administration tasks.
In this lab we will enhance the database by a couple of stored procedures. As we mentioned in a previous chapter PureData
System doesn’t check referential or unique constraints. This is normally not critical since data loading in a data warehousing
environment is a controlled task. In our PureData System implementation we get the requirement to allow some non
administrative database users to add new customers to the customer table. This happens rarely so there are no performance
requirements and we have decided to implement this with a stored procedure that is accessible for these users and checks the
input values and referential constraints.
In a second part we will implement a business logic function as a stored procedure returning a result set. TODO describe
function
Figure 1 LABDB database
2 Implementing the addCustomer stored procedure
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 4 of 22
In this chapter we will create the stored procedure to insert data into the customer table. The information that is added for a new
customer will be the customer key, name, phone number and nation, the rest of the information is updated through other
processes.
2.1 Create Insert Stored Procedure
First we will review the customer table and define the interface of the insert stored procedure.
1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2. Access the lab directory for this lab with the following command, this folder already contains empty files for the stored
procedure scripts we will later create. If you want review them with the ls command:
3. Enter NZSQL and connect to LABDB as user LABADMIN.
4. Describe the customer table with the following command d customer
You should see the following:
We will now create a stored procedure that adds a new customer entry and sets the 4 fields: C_CUSTKEY, C_NAME,
C_NATIONKEY, and C_PHONE, all other fields will be set with an empty value or 0, since the fields are flagged as NOT
NULL.
5. To create a stored procedure we will use the internal vi editor of the nzsql console. Open the already existing empty file
addUser.sql with the following command (note you can tab out the filename):
6. You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing “i”
7. We will now create the interface of the stored procedure so we can test creating it. We need the 4 input field mentioned
above and will return an integer return code. Enter the text as seen in the following, then exit the insert mode by
pressing ESC and enter wq! and enter to save the file and quit vi.
[nz@netezza ~]$ cd /labs/storedProcedure/
LABDB(ADMIN)=> e addCustomer.sql
LABDB(ADMIN)=> d customer
Table "CUSTOMER"
Attribute | Type | Modifier | Default Value
--------------+------------------------+----------+---------------
C_CUSTKEY | INTEGER | NOT NULL |
C_NAME | CHARACTER VARYING(25) | NOT NULL |
C_ADDRESS | CHARACTER VARYING(40) | NOT NULL |
C_NATIONKEY | INTEGER | NOT NULL |
C_PHONE | CHARACTER(15) | NOT NULL |
C_ACCTBAL | NUMERIC(15,2) | NOT NULL |
C_MKTSEGMENT | CHARACTER(10) | NOT NULL |
C_COMMENT | CHARACTER VARYING(117) | NOT NULL |
Distributed on hash: "C_CUSTKEY"
[nz@netezza labs]$ nzsql LABDB LABADMIN
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 5 of 22
The minimal stored procedure we create here doesn’t yet do anything, since it has an empty body. We simply create the
signature with the input and output variables. We use the command CREATE OR REPLACE so we can later execute the
same command multiple times to update the stored procedure with more code.
The input variables cannot be given names so we only add the datatypes for our input parameters key, name, nation and
phone. We also return an integer return code.
Note that we have to specify the procedure language even though NZPLSQL is the only available option in PureData
System.
8. Back in the nzsql command line execute the script we just created with i addCustomer.sql
You should see, that the procedure has been created successfully
9. Display all stored procedures in the LABDB database with the following command:
You will see the following result:
You can see the procedure ADDCUSTOMER with the arguments we specified.
10. Execute the stored procedure with the following dummy input parameters:
LABDB(LABADMIN)=> call addcustomer(1,'test', 2, 'test');
LABDB(ADMIN)=> show procedure;
RESULT | PROCEDURE | BUILTIN | ARGUMENTS
---------+-------------+---------+------------------------------------------------------
------------
INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER
VARYING(15))
(1 row)
LABDB(LABADMIN)=> SHOW PROCEDURE;
LABDB(ADMIN)=> i addCustomer.sql
CREATE PROCEDURE
LABDB(ADMIN)=>
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
END_PROC;
~
~
~
~~
~
:wq!
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 6 of 22
You should see the following:
The result shows that we have a syntax error in our stored procedure. Every stored procedure needs at least one BEGIN ..
END block that encapsulates the code that is to be executed. Stored procedures are compiled when they are first executed
not when they are created, therefore errors in the code can only be seen during execution.
11. Switch back to the VI view with the following command
12. Switch to insert mode by pressing “i"
13. We will now create a simple stored procedure that inserts the new entry into the customer table. But first we will add
some variables that alias the input variables $1, $2 etc. After the BEGIN_PROC statement enter the following lines:
Each BEGIN..END block in the stored procedure can have its own DECLARE section. Variables are valid in the block they
belong to. It is a good best practice to change the input parameters into readable variable names to make the stored
procedure code maintainable. We will later add some additional parameters to our procedures as well.
Be careful not to use variable names that are restricted by PureData System like for example NAME.
14. Next we will add the BEGIN..END block with the INSERT statement.
This statement will add a new row to the customer table using the input variables. It will replace the remaining fields like
account balance with default values that can be later filled. It is also possible to execute dynamic SQL queries which we will
do in a later chapter.
Your complete stored procedure should now look like the following:
BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
DECLARE
C_KEY ALIAS FOR $1;
C_NAME ALIAS FOR $2;
N_KEY ALIAS FOR $3;
PHONE ALIAS FOR $4;
LABDB(LABADMIN)=> e addCustomer.sql
LABDB(LABADMIN)=> call addcustomer(1,'test', 2, 'test');
NOTICE: plpgsql: ERROR during compile of ADDCUSTOMER near line 1
ERROR: syntax error, unexpected <EOF>, expecting BEGIN at or near ""
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 7 of 22
15. Save and exit VI again by pressing ESC to enter the command mode and entering “wq!” and pressing enter. This will
bring you back to the nzsql console.
16. Execute the stored procedure script with the following command: i addCustomer.sql
17. Now lets try our stored procedure lets add a new customer John Smith with customer key 999999, phone number 555-
5555 and nation 2 (which is the key for the United States in our nation table). You can also check first that the customer
doesn’t yet exist if you want.
You should get the following results:
18. Lets check if the insert was successful:
You should get the following results:
We can see that our insert was successful. Congratulations, you have built your first PureData System stored procedure.
LABDB(LABADMIN)=> SELECT * FROM CUSTOMER WHERE C_CUSTKEY = 999999;
C_CUSTKEY | C_NAME | C_ADDRESS | C_NATIONKEY | C_PHONE | C_ACCTBAL
| C_MKTSEGMENT | C_COMMENT
-----------+------------+-----------+-------------+-----------------+-----------
+--------------+-----------
999999 | John Smith | | 2 | 555-5555 | 0.00
| |
(1 row)
LABDB(LABADMIN)=> CALL addcustomer(999999,'John Smith', 2, '555-5555');
ADDCUSTOMER
-------------
(1 row)
LABDB(LABADMIN)=> SELECT * FROM CUSTOMER WHERE C_CUSTKEY = 999999;
LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555');
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 8 of 22
2.2 Adding integrity checks
In this chapter we will add integrity checks to the stored procedure we just created. We will make sure that no duplicate
customer is entered into the CUSTOMER table by querying it before the insert. We will then check with an IF condition if the
value had already been inserted into the CUSTOMER table and abort the insert in that case. We will also check the foreign key
relationship to the nation table and make sure that no customer is inserted for a nation that doesn’t exist. If any of these
conditions aren’t met the procedure will abort and display an error message.
1. Switch back to the VI view of the procedure with the following command. In case of a message warning about duplicate
files press enter.
2. Switch to insert mode by pressing “i"
3. Add a new variable customer_rec with the type RECORD in the DECLARE section:
A RECORD is a row set with dynamic fields. It can refer to any row that is selected in a SELECT INTO statement. You can
later refer to fields with for example CUSTOMER_REC.C_PHONE.
4. Add the following statement before the INSERT statement:
This statement fills the CUSTOMER_REC variable with the results of the query. If there is already one or more customers
with the specified key it will contain the first. Otherwise the variable will be null.
5. Now we add the IF condition to abort the stored procedure in case a record already exists. After the newly added
SELECT statement add the following lines:
In this case we use an IF condition to check if an customer record with the key already exists and has been selected by the
previous SELECT condition. We could do an implicit check on the record or any of its fields and see if it compares to the null
value, but PureData System provides a number of special variables that make this more convenient.
• FOUND specifies if the last SELECT INTO statement has returned any records
• ROW_COUNT contains the number of found rows in the last SELECT INTO statement
• LAST_OID is the object id of the last inserted row, this variable is not very useful unless used for catalog tables.
Finally we use a RAISE EXCEPTION statement to throw an error and abort the stored procedure. To add variable values to
the return string use the % symbol anywhere in the string. This is a similar approach as used for example by the C printf
statement.
6. We will also check the foreign key relationship to NATION, add the following lines after the last once:
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
REC RECORD;
LABDB(LABADMIN)=> e addCustomer.sql
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 9 of 22
This is very similar to the last check, only that we this time check if a record was NOT found. Notice that we can reuse the
REC record since it is not typed to a particular table.
Your stored procedure should now look like the following:
7. Save the stored procedure by pressing ESC, and then entering ‘wq!’ and pressing Enter.
8. In NZSQL create the stored procedure from the script by executing the following command (remember that you can
cycle through previous commands by pressing the UP key)
9. Now lets test the check for duplicate customer ids by repeating our last CALL statement, we already know that a
customer record with the id 999999 already exists:
Your should get the following results:
This is what we expected the key value already exists and our first error condition is thrown.
LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555');
ERROR: Customer with key 999999 already exists
LABDB(LABADMIN)=>
LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555');
LABDB(LABADMIN)=> i addCustomer.sql
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
REC RECORD;
BEGIN
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 10 of 22
10. Now lets check the foreign key integrity by executing the following command with a customer id that does not yet exist
and a nation key that doesn’t exist in the NATION table as well. You can double check this using select statements if
you want:
You should see the following output:
This is also as we have expected. The customer key didn’t yet exist so the first IF condition is not thrown but the check for
the nation key table throws an error.
11. Finally lets try a working example, execute the following command with a customer id that doesn’t yet exist and the
nation key 2 for United States.
You should see a successful execution.
12. Lets check that the value was correctly inserted:
This will give you the following results
We have successfully created a stored procedure that can be used to insert values into the CUSTOMER table and checks
for unique and foreign key constraints. You should remember that PureData System isn’t optimized to do lookup queries so
this will be a pretty slow operation and shouldn’t be used for thousands of inserts. But for the occasional management it is a
perfectly valid solution to the problem of missing constraints in PureData System.
2.3 Managing your stored procedure
In the last chapters we have created a stored procedure that inserts values to the CUSTOMER table and does check constraints.
We will now give rights to execute this procedure to a user and we will use the management functions to make changes to the
stored procedure and verify them.
1. First we will create a user custadmin who will be responsible for adding customers, to do this we will need to switch to
the admin user since users are global objects:
LABDB(LABADMIN)=> c labdb admin
LABDB(LABADMIN)=> SELECT C_CUSTKEY, C_NAME FROM CUSTOMER WHERE C_CUSTKEY = 999998;
C_CUSTKEY | C_NAME
-----------+-------------
999998 | James Brown
(1 row)
LABDB(LABADMIN)=>
LABDB(LABADMIN)=> SELECT C_CUSTKEY, C_NAME FROM CUSTOMER WHERE C_CUSTKEY = 999998;
LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 2, '555-5555');
LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 99, '555-5555');
ERROR: No Nation with nation key 99
LABDB(LABADMIN)=>
LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 99, '555-5555');
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 11 of 22
2. Now we create the user:
You can see that he has the same password as the other users in our labs. We do this for simplification, since it allows us to
obmit the password during user switches, this would of course not be done in a production environment
3. Now we will grant access to the labdb database, otherwise he couldn’t log on
4. Finally we will grant him the right to select from the customer table, he will need to have this to verify any changes he
has made:
5. Now let’s test this out first switch to the user custadmin:
6. Now try to select something from the NATION table to verify that the user only has access to the customer table:
You should get the message that access is refused:
7. Now lets select something from the CUSTOMER table:
The user should be able to select the row from the CUSTOMER table:
8. Finally lets verify that the user doesn’t have INSERT rights on the table:
You will be refused to insert values to the customer table:
LABDB(CUSTADMIN)=> INSERT INTO CUSTOMER VALUES (1, '','',1,'',1,'','');
LABDB(CUSTADMIN)=> select c_custkey, c_name from customer where c_custkey = 999998;
C_CUSTKEY | C_NAME
-----------+-------------
999998 | James Brown
(1 row)
LABDB(CUSTADMIN)=>
LABDB(CUSTADMIN)=> select c_custkey, c_name from customer where c_custkey = 999998;
LABDB(CUSTADMIN)=> select * from nation;
ERROR: Permission denied on "NATION".
LABDB(CUSTADMIN)=>
LABDB(CUSTADMIN)=> select * from nation;
LABDB(ADMIN)=> c labdb custadmin
LABDB(ADMIN)=> grant select on customer to custadmin;
LABDB(ADMIN)=> grant list, select on labdb to custadmin;
LABDB(ADMIN)=> create user custadmin with password 'password';
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 12 of 22
9. We now need to switch back to the admin user to give custadmin the rights to execute the stored procedure:
10. To grant the right to execute a specific stored procedure we need to specify the full name including all input parameters.
The easiest way to get these in the correct syntax is to first list them with the SHOW PROCEDURE command:
You should see the following screen, you could either cut&paste the arguments or copy them manually:
11. Now grant the right to execute this stored procedure to CUSTADMIN:
12. Lets check the rights of the custadmin user now with : dpu custadmin
You should get the following results:
LABDB(ADMIN)=> dpu custadmin
User object permissions for user 'CUSTADMIN'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M
I B R C S H F A L P N S
---------------+-------------+-------------------------------------+--------------------
-------------------------
LABDB | ADDCUSTOMER | X |
LABDB | CUSTOMER | X |
GLOBAL | LABDB | X X |
(3 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
Administration Privilege
(D)atabase (G)roup (U)ser (T)able T(E)mp E(X)ternal Se(Q)uence
S(Y)nonym (V)iew (M)aterialized View (I)ndex (B)ackup (R)estore
va(C)uum (S)ystem (H)ardware (F)unction (A)ggregate (L)ibrary
(P)rocedure U(N)fence (S)ecurity
LABDB(ADMIN)=> grant execute on addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER,
CHARACTER VARYING(15)) to custadmin;
LABDB(ADMIN)=> show procedure all;
LABDB(ADMIN)=> show procedure all;
RESULT | PROCEDURE | BUILTIN | ARGUMENTS
---------+-------------+---------+------------------------------------------------------
------------
INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER
VARYING(15))
(1 row)
LABDB(CUSTADMIN)=> c labdb admin
LABDB(CUSTADMIN)=> INSERT INTO CUSTOMER VALUES (1, '','',1,'',1,'','');
ERROR: Permission denied on "CUSTOMER".
LABDB(CUSTADMIN)=>
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 13 of 22
You can see that the user has only the rights we have given him. He can select data from the customer table and execute
our stored procedure but he is not allowed to change the customer table directly or execute anything but the stored
procedure.
13. Lets test this switch to the custadmin user with the following command: c labdb custadmin
14. Add another customer to the customer table:
The insert will have been successful and you will have another row in your table, you can check this with a SELECT query if
you want.
15. We will now make some changes to the stored procedure to do this we need to switch back to the administrative
account:
16. Now we will modify the stored procedure first lets have a detailed look at it.
You should see the following screen:
You can see the input and output arguments, procedure name, owner, if it is executed as owner or caller and other details.
Verbose also shows you the source code of the stored procedure. We see that the description field is still empty so lets add
a comment to the stored procedure. This is important to do if you have a big number of stored procedures in your system.
Note: nzadmin is a very convenient way to manage your stored procedure it provides most of the managing functionality
used in this lab in a graphical UI.
17. Add a description to the stored procedure:
LABDB(CUSTADMIN)=> c labdb admin
LABDB(ADMIN)=> show procedure addcustomer verbose;
LABDB(ADMIN)=> show procedure addcustomer verbose;
RESULT | PROCEDURE | BUILTIN | ARGUMENTS | OWNER | EXECUTEDASOWNER | VARARGS | DESCRIPTION |
PROCEDURESOURCE
------
INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) | ADMIN | t |
f | |
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
REC RECORD;
BEGIN
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
LABDB(CUSTADMIN)=> CALL addCustomer(999997,'Jake Jones', 2, '555-5554');
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 14 of 22
It is necessary to specify the exact stored procedure signature including the input arguments, these can be cut& pasted from
the output of the show procedures command. The COMMENT ON command can be used to add descriptions to more or less
all database objects you own from procedures, tables till columns.
18. Verify that your description has been set:
The description field will now contain your comment:
19. We will now alter the stored procedure to be executed as the caller instead of the owner. This means that whoever
executes the stored procedure needs to have access rights to all the objects that are touched in the stored procedure
otherwise it will fail. This should be the default for stored procedures that encapsulate business logic and do not do
extensive data checking:
20. Since the admin user has access to the customer table he will be able to execute the stored procedure:
21. Now lets switch to the custadmin user: c labdb custadmin
22. Try to add another customer as custadmin:
You should see the following results:
As expected the stored procedure fails now. The user custadmin has read access to the CUSTOMER table but no read
access to the NATION table, therefore this check results in an exception. While EXECUTE AS CALLER is more secure in
some circumstances it doesn’t fit our usecase where we specifically want to expose some data modification ability to a user
who shouldn’t be able to modify a table otherwise. Therefore we will change the stored procedure back:
LABDB(CUSTADMIN)=> call addCustomer(999995, 'John Schwarz', 2, '555-5553');
LABDB(CUSTADMIN)=> CALL addCustomer(999995,'John Schwarz', 2, '555-5552');
NOTICE: Error occurred while executing PL/pgSQL function ADDCUSTOMER
NOTICE: line 12 at select into variables
ERROR: Permission denied on "NATION".
LABDB(ADMIN)=> call addCustomer(999996,'Karl Schwarz', 2, '555-5553');
LABDB(ADMIN)=> alter procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER,
CHARACTER VARYING(15)) execute as caller;
LABDB(ADMIN)=> show procedure addcustomer verbose;
RESULT | PROCEDURE | BUILTIN | ARGUMENTS | OWNER | EXECUTEDASOWNER | VARARGS | DESCRIPTION |
PROCEDURESOURCE
------
INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) | ADMIN | t |
f | This procedure adds a new customer entry to the CUSTOMER table |
…
…
LABDB(ADMIN)=> show procedure addcustomer verbose;
LABDB(ADMIN)=> comment on procedure addcustomer(INTEGER, CHARACTER VARYING(25),
INTEGER, CHARACTER VARYING(15)) IS 'This procedure adds a new customer entry to the
CUSTOMER table';
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 15 of 22
23. First switch back to the admin user: c labdb admin
24. Change the stored procedure back to being executed as owner:
In this chapter you have setup the permissions for the addCustomer stored procedure and the user custadmin who is
supposed to use it. You also added comments to the stored procedure.
3 Implementing the checkRegions stored procedure
In this chapter we will implement a stored procedure that performs a check on all rows of the regions table. The call of the stored
procedure will be very simple and will not contain input arguments. The stored procedure is used to encapsulate a sanity check
of the regions table that is executed regularly in the PureData System system for administrative purposes.
Our stored procedure will check each row of the REGION table for three things:
1. If the region key is smaller than 1
2. if the name string is empty
3. if the description is lower case only this is needed for application reasons.
The procedure will return each row of the region table together with additional columns that describe if the above constraints are
broken. It will also return a notice with the number of faulty rows.
This chapter will teach you to use loops in a stored procedure and to return table results. You will also use dynamic query
execution to create queries on the fly.
You should be familiar with the use of VI for the development of stored procedures from the last chapter. Alternatives to using a
standard text editor for the creation of your stored procedure would be the use of a graphical development environment like
Aginity or the PureData System Eclipse plugins that can be downloaded from the PureData System Developer Network.
1. Open the already existing empty file checkRegion.sql with the following command (note you can tab out the filename):
2. You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing “i”
3. First we will define the stored procedure header similar to the last procedure. It will be very simple since we will not use
any input arguments. Enter the following code to the editor:
CREATE OR REPLACE PROCEDURE checkRegions() LANGUAGE NZPLSQL RETURNS REFTABLE(tb1) AS
BEGIN_PROC
END_PROC;
~
~
~
~
-- INSERT --
LABDB(ADMIN)=> e checkRegion.sql
LABDB(ADMIN)=> alter procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER,
CHARACTER VARYING(15)) execute as owner;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 16 of 22
Let’s have a detailed look at the RETURNS section. We want to return a result set but do not have to describe the column
names or datatypes of the table object that is returned. Instead we reference an existing table, which needs to exist at the
time the stored procedure is created. This means we will need to create the table TB1 before executing the CREATE
PROCEDURE command.
Once the stored procedure is executed the stored procedure will create under the cover an empty temporary table that has
the same definition as the referenced table. So the results will not actually be saved in the referenced table, which is only
used for the definition. This means that multiple stored procedures can be executed at the same time without influencing
each other. Since the created table is temporary it will be cleaned up once the connection to the database is aborted.
Note: If the referenced table contains rows they will neither be changed nor copied over to the temporary table, the table is
strictly used for reference.
4. For our stored procedure we need four variables, add the following lines after the BEGIN_PROC statement:
The four variables are needed for our stored procedure:
• rec, is a RECORD structure while we loop through the rows of the table we will use it to save and access the
values of each row and check them with our constraints
• errorRows will be used to contain the total number of rows that violate our constraints
• fieldEmpty will be used to store if the row violates either the constraint that the name is empty or the record code is
smaller than 1, this is appropriate since values of -1 or 0 in the region code are used to denote that it is empty
• descUpper will be true if a record violates the constraint that the description needs to be lowercase
5. We will now add the main BEGIN..END clause and initialize the errorRows variable. Add the following rows after the
DECLARE section:
Each stored procedure must at least contain one BEGIN .. END clause, which encapsulates the executed commands. We
also initially set the number of error rows to 0 and display a short sentence.
6. We will now add the main loop. It will iterate through all rows of the REGION table and store each row in the rec
variable. Add the following lines before the END statement
FOR rec IN SELECT * FROM REGION ORDER BY R_REGIONKEY LOOP
fieldEmpty := false;
descUpper := false;
END LOOP;
RAISE NOTICE ' % rows had an error see result set', errorRows;
BEGIN
RAISE NOTICE 'Start check of Region';
errorRows := 0;
END;
DECLARE
rec RECORD;
errorRows INTEGER;
fieldEmpty BOOLEAN;
descUpper BOOLEAN;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 17 of 22
The FOR rec IN expression LOOP .. END LOOP command is used to iterate through a result set, in our case a
SELECT * on the region table. The loop body is executed once for every row in the expression and the current row is saved
in the rec field. The loop needs to be ended with the END LOOP keyword.
There are many other types of loops in NZPLSQL, for a complete set refer to the stored procedure guide.
For each iteration of the loop we initially set the value of the fieldEmpty and descUpper to false. Variables can be
assigned with the ‘:=’ operator. Finally we will display a notice that shows the number of rows that either had an empty field
or upper case expression. This number will be saved in the errorRows variable.
7. Now its time to check the rows for our constraints and set our variables accordingly. Enter the following rows behind the
variable initialization and before the END LOOP keyword:
In this section we check our constraints for each row and set our three variables accordingly. First we check if the name field
of the row is the empty string or if the region key is smaller than one. In that case the fieldEmpty field is set to true.
Note how we can access the fields by adding the fieldname to our loop record.
The second ‘IF’ statement checks if the comment field of the row is different to the lower case version of the comment
field. This would be the case if it contains uppercase characters.
Note that we can use the available PureData System functions like LOWER in the stored procedure, as if it were a SQL
statement.
Finally if one of these variables has been set to true by the previous checks, we increase the value of the errorRows
variable by one. The final number will in the end be displayed by the RAISE NOTICE statement we already added to the
stored procedure.
8. Finally add the following lines after the lines you just added and before the END LOOP statement:
These lines add the row of the REGION table to the result set of our stored procedure adding two columns containing the
fieldEmpty and descUpper flags for this row. There are a couple of important points here:
For each call of a stored procedure with a result set as return value a temporary table is created that is later returned to the
caller. Since the name is unique it needs to be referenced through a variable. This is the REFTABLENAME variable. Apart
from that, adding values to the result set is identical to other INSERT operations.
EXECUTE IMMEDIATE 'INSERT INTO '|| REFTABLENAME ||' VALUES ('
|| rec.R_REGIONKEY ||','''
|| trim(rec.R_NAME) ||''','''
|| trim(rec.R_COMMENT) ||''','
|| fieldEmpty ||','
|| descUpper ||')';
IF rec.R_NAME = '' OR rec.R_REGIONKEY < 1 THEN
fieldEmpty := true;
END IF;
IF rec.R_COMMENT <> LOWER(rec.R_COMMENT) THEN
descUpper := true;
END IF;
IF (fieldEmpty = true) OR (descUpper = true) THEN
errorRows := errorRows + 1;
END IF;
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 18 of 22
Since the name of the table is dynamic we need to execute the INSERT operations as a dynamic statement. This means
that the EXECUTE IMMEDIATE statement is used with a string that contains the query that is to be executed.
To add variable values to the string the pipe symbol || is used. Note that the values for R_NAME and R_COMMENT are
inserted as strings, which means they need to be surrounded by quotes. To add quotes to a string they need to be escaped
with a second quote character. This is the reason that R_NAME and R_COMMENT are surrounded by triple quotes. Apart from
that we trim them, so the inserted VARCHAR values are not blown up with empty characters.
It can be tricky to construct a string like that and you will see the error only once it is executed. For debugging it can be
useful to construct the string and display it with a RAISE NOTICE statement.
9. Your VI should now look like that, containing the complete stored procedure:
10. Save and exit VI. Press ESC to enter the command mode, enter “:wq!” to save and force quit and press enter.
11. To create the stored procedure the table reference tb1 needs to exist. Create the table with the following
statement:
LABDB(ADMIN)=> create table TB1 as select *, false AS FIELDEMPTY, false as DESCUPPER
from region limit 0;
CREATE OR REPLACE PROCEDURE checkRegions() LANGUAGE NZPLSQL RETURNS REFTABLE(tb1) AS
BEGIN_PROC
DECLARE
rec RECORD;
errorRows INTEGER;
fieldEmpty BOOLEAN;
descUpper BOOLEAN;
BEGIN
RAISE NOTICE 'Start check of Region';
errorRows := 0;
FOR rec IN SELECT * FROM REGION ORDER BY R_REGIONKEY LOOP
fieldEmpty := false;
descUpper := false;
IF rec.R_NAME = '' OR rec.R_REGIONKEY < 1 THEN
fieldEmpty := true;
END IF;
IF rec.R_COMMENT <> lower(rec.R_COMMENT) THEN
descUpper := true;
END IF;
IF (fieldEmpty = true) OR (descUpper = true) THEN errorRows := errorRows + 1;
END IF;
EXECUTE IMMEDIATE 'INSERT INTO '|| REFTABLENAME ||' VALUES ('
|| rec.R_REGIONKEY ||','''
|| trim(rec.R_NAME) ||''','''
|| trim(rec.R_COMMENT) ||''','
|| fieldEmpty ||','
|| descUpper ||')';
END LOOP;
RAISE NOTICE ' % rows had an error see result set', errorRows;
END;
END_PROC;
-- INSERT --
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 19 of 22
This command creates a table TB1 that has all the rows of the REGION table and two additional BOOLEAN fields
FIELDNULL and DESCUPPER. It will also be empty because we used the LIMIT 0 clause.
12. Describe the reference table with d TB1
You should see the following result:
You can see the three columns of the REGION table and the two additional BOOLEAN fields that will contain for each
row if the row violates the specified constraints.
Note this table needs to exist before the procedure can be created.
13. Now create the stored procedure. Execute the script you just created with the following command:
You should successfully create your stored procedure.
14. Now lets have a look at our REGION table, select all rows:
You will get the following results:
We can see that none of the rows would violate the constraints we defined which would be pretty boring. So lets test
our stored procedure by adding two rows that violate our constraints.
15. Add the two violating rows with the following commands:
LABDB(ADMIN)=> d TB1
Table "TB1"
Attribute | Type | Modifier | Default Value
-------------+------------------------+----------+---------------
R_REGIONKEY | INTEGER | NOT NULL |
R_NAME | CHARACTER(25) | NOT NULL |
R_COMMENT | CHARACTER VARYING(152) | |
FIELDEMPTY | BOOLEAN | |
DESCUPPER | BOOLEAN | |
Distributed on hash: "R_REGIONKEY"
LABDB(ADMIN)=> INSERT INTO REGION VALUES (0, 'as', 'Australia');
LABDB(ADMIN)=> SELECT * FROM REGION;
R_REGIONKEY | R_NAME | R_COMMENT
-------------+---------------------------+-----------------------------
2 | sa | south america
1 | na | north america
4 | ap | asia pacific
3 | emea | europe, middle east, africa
(4 rows)
LABDB(ADMIN)=> SELECT * FROM REGION;
LABDB(ADMIN)=> i checkRegion.sql
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 20 of 22
This row violates the lower case constraints for the comment field and the empty field constraint for the region key
This row violates the empty field constraint for the region name.
16. Now finally lets try our checkRegions stored procedure:
You should see the following output:
You can see the expected results. Our stored procedure has found two rows that violated the constraints we check for.
In the FIELDNULL and DESCUPPER columns we can easily see that the row with the key 0 has both an empty field
and uppercase comment. We can also see that row 6 only violated the empty field constraint.
Note that the TB1 table we created doesn’t contain any rows, it is only used as a template.
17. Finally let’s cleanup our REGION table again:
18. And lets run our checkRegions procedure again:
You will see the following results:
You can see that the table now is error free and all constraint violation fields are false.
LABDB(ADMIN)=> call checkRegions();
NOTICE: Start check of Region
NOTICE: 0 rows had an error see result set
R_REGIONKEY | R_NAME | R_COMMENT | FIELDEMPTY | DESCUPPER
-------------+---------------------------+-----------------------------+------------+-----------
3 | emea | europe, middle east, africa | f | f
4 | ap | asia pacific | f | f
1 | na | north america | f | f
2 | sa | south america | f | f
(4 rows)
LABDB(ADMIN)=> call checkRegions();
LABDB(ADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY = 0 OR R_REGIONKEY = 6;
LABDB(ADMIN)=> call checkRegions();
NOTICE: Start check of Region
NOTICE: 2 rows had an error see result set
R_REGIONKEY | R_NAME | R_COMMENT | FIELDEMPTY | DESCUPPER
-------------+---------------------------+-----------------------------+------------+-----------
1 | na | north america | f | f
3 | emea | europe, middle east, africa | f | f
0 | as | Australia | t | t
4 | ap | asia pacific | f | f
2 | sa | south america | f | f
6 | | mongolia | t | f
(6 rows)
LABDB(ADMIN)=> call checkRegions();
LABDB(ADMIN)=> INSERT INTO REGION VALUES (6, '', 'mongolia');
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 21 of 22
Congratulations you have finished the stored procedure lab and created two stored procedures that help you to
manage your database.
IBM Software
Information Management
IBM PureData System for Analytics
© Copyright IBM Corp. 2012. All rights reserved Page 22 of 22
© Copyright IBM Corporation 2011
All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or
registered
trademarks of International Business Machines Corporation
in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence
in this
information with a trademark symbol (® or ™), these
symbols indicate
U.S. registered or common law trademarks owned by IBM
at the time
this information was published. Such trademarks may also
be registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web at
“Copyright and trademark information” at
ibm.com/legal/copytrade.shtml
Other company, product and service names may be
trademarks or service marks of others.
References in this publication to IBM products and services
do not imply that IBM intends to make them available in all
countries in which
IBM operates.
No part of this document may be reproduced or transmitted
in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date
of initial
publication. Product data is subject to change without notice.
Any
statements regarding IBM’s future direction and intent are
subject to
change or withdrawal without notice, and represent goals
and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY,
EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS
ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and
conditions of
the agreements (e.g. IBM Customer Agreement, Statement
of Limited
Warranty, International Program License Agreement, etc.)
under which
they are provided.

More Related Content

PPTX
Ingress overview
PPTX
Versioning avec Git
PPTX
BitBucket presentation
PDF
Xen Project Contributor Training - Part 1 introduction v1.0
ODP
Block Storage For VMs With Ceph
PDF
Alphorm.com Formation Windows Server 2019 : Installation et Configuration de ...
PDF
[9월 런치 세미나] 도커와 쿠버네티스 기술에 스며들다
PDF
Alphorm.com Formation SCADA : Cybersécurité des systèmes industriels
Ingress overview
Versioning avec Git
BitBucket presentation
Xen Project Contributor Training - Part 1 introduction v1.0
Block Storage For VMs With Ceph
Alphorm.com Formation Windows Server 2019 : Installation et Configuration de ...
[9월 런치 세미나] 도커와 쿠버네티스 기술에 스며들다
Alphorm.com Formation SCADA : Cybersécurité des systèmes industriels

What's hot (20)

PDF
Architecture microservices avec docker
PDF
Docker란 무엇인가? : Docker 기본 사용법
PDF
Software Defined Datacenter with Proxmox
ODP
Openshift Container Platform
PDF
Windows server 2012 r2
PDF
Docker and the Linux Kernel
PPTX
Kubernetes Networking 101
PPTX
Docker 基礎介紹與實戰
PDF
Velero & Beyond: Backup & Restore for Kubernetes Applications (Cloud Native S...
PDF
Performance Wins with BPF: Getting Started
PDF
Kubernetes Introduction
PDF
Build automated Machine Images using Packer
PDF
Kubernetes Summit 2023: Head First Kubernetes
PDF
CKA Certified Kubernetes Administrator Notes
PDF
DevOps - Interview Question.pdf
PDF
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
PPTX
10 Tips for AIX Security
PDF
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
PDF
Alphorm.com Support de la Formation VMware vSphere 6, Les machines virtuelles
PDF
Security practices in OpenShift
Architecture microservices avec docker
Docker란 무엇인가? : Docker 기본 사용법
Software Defined Datacenter with Proxmox
Openshift Container Platform
Windows server 2012 r2
Docker and the Linux Kernel
Kubernetes Networking 101
Docker 基礎介紹與實戰
Velero & Beyond: Backup & Restore for Kubernetes Applications (Cloud Native S...
Performance Wins with BPF: Getting Started
Kubernetes Introduction
Build automated Machine Images using Packer
Kubernetes Summit 2023: Head First Kubernetes
CKA Certified Kubernetes Administrator Notes
DevOps - Interview Question.pdf
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
10 Tips for AIX Security
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Alphorm.com Support de la Formation VMware vSphere 6, Les machines virtuelles
Security practices in OpenShift
Ad

Similar to Netezza All labs (20)

PPTX
Virtualization
PDF
how to install VMware
PDF
Administering windows xp
PPTX
Virtualization technology "comparison vmware 9 vs virtualbox 4.2"
PDF
ORION STARTER KIT….a real electronic laboratory (by FASAR ELETTRONICA)
PDF
Gluster Storage Platform Installation Guide
PDF
Xen time machine
PDF
PDF
Howto Pxeboot
PDF
Learn Puppet : Quest Guide for the Learning VM
DOCX
boot from lan
PPT
Operating System & Utility Programme
PPTX
Let’s talk virtualization
PDF
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
PPT
IT109 Microsoft Windows 7 Operating Systems Unit 02
PDF
xampp_server
PDF
xampp_server
DOCX
Krenel Based Virtual Machine In Centos7
PPTX
PPTX
computing networks and operating system
Virtualization
how to install VMware
Administering windows xp
Virtualization technology "comparison vmware 9 vs virtualbox 4.2"
ORION STARTER KIT….a real electronic laboratory (by FASAR ELETTRONICA)
Gluster Storage Platform Installation Guide
Xen time machine
Howto Pxeboot
Learn Puppet : Quest Guide for the Learning VM
boot from lan
Operating System & Utility Programme
Let’s talk virtualization
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
IT109 Microsoft Windows 7 Operating Systems Unit 02
xampp_server
xampp_server
Krenel Based Virtual Machine In Centos7
computing networks and operating system
Ad

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
annual-report-2024-2025 original latest.
PPTX
1_Introduction to advance data techniques.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Foundation of Data Science unit number two notes
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
1_Introduction to advance data techniques.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Foundation of Data Science unit number two notes
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Mega Projects Data Mega Projects Data
Introduction to machine learning and Linear Models
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Computer network topology notes for revision
IB Computer Science - Internal Assessment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Netezza All labs

  • 1. IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved IBM Software Information Management IBM PureData System for Analytics Hands-On Labs
  • 2. IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Table of Contents : Connecting to the Host and Database Database Administration Data Distribution NzAdmin Loading and Unloading Data Backup & Restore Query Optimization Optimization Objects Groom Stored Procedures
  • 3. IBM PureData System for Analytics … Powered by Netezza Technology IBM Software Information Management Connecting to the Host and Database Hands-On Lab
  • 4. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 10 Table of Contents 1 Introduction .......................................................................................3 1.1 VMware Basics....................................................................................3 1.2 Architecture of the Virtual Machines.....................................................3 1.3 Tips and Tricks on Using the PureData System Virtual Machines ........3 2 Connecting to PureData System Host.............................................4 2.1 Open the Virtual Machines in VMware .................................................4 2.2 Start the Virtual Machines....................................................................5 3 Connecting to System Database Using nzsql ................................6 3.1 Using PuTTY .......................................................................................6 3.2 Connect to the System Database Using nzsql .....................................7 3.3 Commonly Used Commands and SQL Statements..............................8 3.4 Exit nzsql.............................................................................................9
  • 5. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 10 1 Introduction 1.1 VMware Basics VMware® Player and VMware Workstation are the synonym for test beds and developer environments across the IT industry. While having many other functions for this specific purpose it allows the easy distribution of an “up and running” PureData System system right to anybody’s computer – be it a notebook, desktop, or server. The VMware image can be deployed for simple demos and educational purposes or it can be the base of your own development and experiments on top of the given environment. What is a VMware image? VMware is providing a virtual computer environment on top of existing operating systems on Intel® or AMD™ processor based systems. The virtual computer has all the usual components like a CPU, memory and disks as well as network, USB devices or sound. The CPU and memory are simply the existing resources provided by the underlying operating system (indicated as processes starting with “vmware”). On the other hand, virtual machine disks are a collection of files in the host operating system that can be copied between any system and platform. The virtual disk files make up the most part of the image while the actual description file of the virtual machine is small in size. 1.2 Architecture of the Virtual Machines For the hands-on lab portion of the bootcamp, we will be using 2 virtual machines (VM) to demonstrate the usability of PureData System systems. Because of the nature of virtualized environment and the host hardware, we will be limited in terms of performance. Please use these exercises as a guide to familiarize with the PureData System systems only. The virtual images are adaptations of an appliance for their portability and convenience. We will be running a virtual image to act as the host machine and the other image as a SPU that typically resides in a PureData System appliance. The Host image will be the main gateway where the Netezza Performance Server (NPS) code resides and will be accessed. The second image is the SPU where it contains 5 virtual hard drives of 20 GB each as well as a virtual FPGA. The hard disks here are not partitioned into primary, mirror and temp partitions as you would observe on a PureData System appliance. Instead, 4 of the disks only contain primary data partitions and the fifth disk is used for temporary data. Host SPU VMware NPS code temp FPGAFPGAFPGA Host Operating System PuTTY 1.3 Tips and Tricks on Using the PureData System Virtual Machines The PureData System appliance is designed and fine tuned for a specific set of hardware. In order to demonstrate the system in a virtualized environment, some adaptations were made on the virtual machines. To ensure the labs run smoothly, we have listed some pointers for using the VMs:
  • 6. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 10 Always boot up the Host image first before the SPU image When booting up the VMs, start the Host image first. Once it is fully booted, the SPU image can be started at which time the Host image would be listening for connections from the SPU machine. The connection then should be made automatically. After pausing the virtual machines, nz services need to be restarted In the case that the VMs were paused (the host operating system went into sleep or hibernation modes, the images were paused in VMware Workstation,,, etc). To continue using the VMs, run the following commands in the prompt of the Host image. When starting the SPU image for the first time, there will be a prompt for whether the image was copied or moved. The first time SPU image is booted, VMware Workstation will prompt with the question whether the image was copied or moved, the user should click on “I moved it”. This will insure that the SPU image will have the same MAC address as before. This is crucial for making sure the Host and SPU images can be connected. 2 Connecting to PureData System Host In most Bootcamp locations this chapter will already have been prepared by your bootcamp instructor. You can review it to learn how the images would be installed. But if your NPS system is already running, jump straight to chapter 3. 2.1 Open the Virtual Machines in VMware 2.1.1 Unpacking the Images The virtual machines for the PureData System Bootcamp are delivered in a self-extractable set of rar files. For easy handling the files are compressed to 700MB volumes. Download all the volumes to the same directory. Double click the executable file and select the destination folder to begin the extraction process. 2.1.2 Open the HOST Virtual Machine There are 2 methods to start the VMware virtual machines: Option 1: Double click on the file “HOST.vmx” in your Windows Explorer or Linux file browser. Option 2: Select it through the File > Open… icon in the VMware console. This will bring up the dialog to browse to the folder where the VM image resides, select “HOST.vmx” and click on the “Open” button. Either option should bring up the VMware console [nz@netezza ~]$ nzstop [nz@netezza ~]$ nzstart
  • 7. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 10 2.1.3 Open the SPU Virtual Machine Repeat the steps you chose in the previous section to open the “SPU.vmx” file. The VMware console should be look similar to the following, with a tab for each image: 2.2 Start the Virtual Machines 2.2.1 Start and Login into the Host Virtual Machine To start using the virtual machines, first boot up the Host machine. Click on the “HOST” tab, then press the “Power On” button in the upper left corner (marked in a red circle above). You should see the RedHat operating system boot up screen, allow it to boot for a couple minutes until it runs to the PureData System login prompt. At the login prompt, login with the following credentials: Username: nz Password: nz Once logged in, we can check the state of the machine by issuing the following command: The system state should be in “Discovering” state which signifies that the host machine is ready for connection with the SPU’s: 2.2.2 Starting the SPU Virtual Machine Now we can start the SPU image. Similar to how we started the Host image, click on the SPU tab in the VMware console, then click on the “Power on” button. The first time the SPU image is booted, the following prompt will display to ask if the virtual machine was moved or copied: [nz@netezza ~]$ nzstate
  • 8. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 10 Choose the “I moved it” radio button, and click “OK”. This will ensure that the previously configured MAC address in the SPU image will remain the same. This is crucial for the communication between the Host and SPU virtual machines. After the SPU is fully booted, you should see the screen similar to the following. Note the bottom right corner where it displays that there are 5 virtual hard disks in healthy state. We can now go back to the Host image to check the status of the connection. Click on the “HOST” tab, and enter the following command in the prompt: The system state should display “Online” 3 Connecting to System Database Using nzsql Most Bootcamp locations will have a predefined PuTTY entry netezza that already contains the IP address; open it by double-clicking on the saved connection. 3.1 Using PuTTY Since we will not be using any graphical interface tools from the Host virtual machine, there is an alternative to using the PureData System prompts directly in VMware. We can connect to the Host via SSH using tools such as PuTTY. We will be using the PuTTY console for the rest of the labs since this better simulates the real life scenario of connecting to a remote PureData System system. First, locate the PuTTY executable in the folder where the VMs were extracted. Under the folder “Tools” you should be able to find the file “putty.exe”. Execute it by a double click. In the PuTTY interface, enter the IP of the Host image as 192.168.239.2 and select SSH as the connection type. Finally, click “Open” to start the session. [nz@netezza ~]$ nzstate
  • 9. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 10 Once the prompt window is open, log in with the following credentials: Username: nz Password: nz We are now ready for connection to the system database and execute commands in the PuTTY command prompt. 3.2 Connect to the System Database Using nzsql Since we have not created any user and databases yet, we will connect to the default database as the default user, with the following credentials: Database: system Username: admin Password: password When issuing the nzsql command, the user supplies the user account, password and the database to connect to using the syntax, below is an example of how this would be done. Do not try to execute that command it is just demonstrating the syntax: Alternatively, these values can be stored in the command shell and passed to the nzsql command when it is issued without any arguments. Let’s verify the current database, user and password values stored in the command shell by issuing the commands: The output should look similar to the following: Since the current values correspond to our desired values, no modification is required. [nz@netezza ~]$ printenv NZ_DATABASE [nz@netezza ~]$ printenv NZ_USER [nz@netezza ~]$ printenv NZ_PASSWORD nzsql –d [db_name] –u [user] –pw [password]
  • 10. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 10 Next, let’s take a look at what options are available to start nzsql. Type in the following command The -? option will list the usage and all options for the nzsql command. In this exercise, we will start nzsql without arguments. In the command prompt, issue the command: This will bring up the nzsql prompt below that shows a connection to the system database as user admin: 3.3 Commonly Used Commands and SQL Statements There are commonly used commands that start with “” which we will demonstrate in this section. First, we will run the 2 help commands to familiarize ourselves with these handy commands. The h command will list the available SQL commands, while the ? command is used to list the internal slash commands. Examine the output for both commands: From the output of the ? command, we found the l internal command we can use to find out all the databases: Let’s find out all the databases by entering Secondly, we will use “dSt” to find out the system tables within the system database. SYSTEM(ADMIN)=> h SYSTEM(ADMIN)=> ? [nz@netezza ~]$ nzsql SYSTEM(ADMIN)=> dSt List of relations Name | Type | Owner ------------------------------+--------------+------- _T_ACCESS_TIME | SYSTEM TABLE | ADMIN _T_ACL | SYSTEM TABLE | ADMIN _T_ACTIONFRAG | SYSTEM TABLE | ADMIN _T_AGGREGATE | SYSTEM TABLE | ADMIN _T_ALTBASE | SYSTEM TABLE | ADMIN _T_AM | SYSTEM TABLE | ADMIN _T_AMOP | SYSTEM TABLE | ADMIN _T_AMPROC | SYSTEM TABLE | ADMIN _T_ATTRDEF | SYSTEM TABLE | ADMIN . . . SYSTEM(ADMIN)=> l List of databases DATABASE | OWNER -----------+----------- MASTER_DB | ADMIN SYSTEM | ADMIN (2 rows) [nz@netezza ~]$ nzsql -?
  • 11. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 10 Note: press the space bar to scroll down the result set when you see “--More--“ on the screen. From the previous command, we can see that there is a user table called “_T_USER”. To find out what is stored in that table, we will use the describe command d: This will return all the columns of the _T_USER system table. Next, we want to know the existing users stored in the table. In case too many rows are returned at once, we will first calculate the number of rows it contains by enter the following query: The query above is essentially the same as “SELECT COUNT (*) FROM _T_USER;”, we have demonstrated the sub-select syntax in case there is a complex query that needed to have the result set evaluated. The result should show there is currently 1 entry in the user table. We can enter the following query to list the user names: 3.4 Exit nzsql To exit nzsql, use the command q to return to the PureData System system. SYSTEM(ADMIN)=> SELECT USENAME FROM _T_USER; SYSTEM(ADMIN)=>d _T_USER SYSTEM(ADMIN)=> SELECT COUNT(*) FROM (SELECT * FROM _T_USER) AS "Wrapper";
  • 12. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 10 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 13. IBM Software Information Management Database Administration Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 14. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 18 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Creating IBM PureData System Users and Groups......3 2.1 Creating New PureData System Users ............................4 2.2 Creating New PureData System Groups..........................5 3 Creating a PureData System Database .........................7 3.1 Creating a Database and Transferring Ownership ...........7 3.2 Assigning Authorities and Privileges ................................9 3.3 Creating PureData System Tables.................................11 3.4 Using DML Queries .......................................................14
  • 15. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 18 1 Introduction A factory-configured and installed IBM PureData System will include some of the following components: • An IBM PureData System warehouse appliance with pre-installed IBM PureData System software • A preconfigured Linux operating system (with PureData System modifications) • Several preconfigured Linux users and groups: o The nz user is the default PureData System system Administration account o The group is the default group • An IBM PureData System database user named ADMIN. The ADMIN user is the database super-user, and has full access to all system functions and objects • A preconfigured database group named PUBLIC. All database users are automatically placed in the group PUBLIC and therefore inherit all of its privileges The IBM PureData System warehouse appliance includes a highly optimized SQL dialect called PureData System Structured Query Language. You can use SQL commands to create and manage your PureData System databases, user access, and permissions for the databases, as well as to query and modify the contents of the databases On a new IBM PureData System system, there is typically one main database, SYSTEM, and a database template, MASTER_DB. IBM PureData System uses the MASTER_DB as a template for all other user databases that are created on the system. Initially, only the ADMIN user can create new databases, but the ADMIN user can grant other users permission to create databases as well. The ADMIN user can also make another user the owner of a database, which gives that user ADMIN-like control over that database and its contents. The database creator becomes the default owner of the database. The owner can remove the database and all its objects, even if other users own objects within the database. Within a database, permitted users can create tables and populate them with data and query its contents. 1.1 Objectives This lab will guide you through the typical steps to create and manage new IBM PureData System users and groups after an IBM PureData System has been delivered and configured. This will include creating a new database and assigning the appropriate privileges. The users and the database that you create in this lab will be used as a basis for the remaining labs in this bootcamp. After this lab you will have a basic understanding on how to plan and create an IBM PureData System database environment. • The first part of this lab will examine creating IBM PureData System users and groups • The second part of this lab will explore creating and using a database and tables. The table schema to be used within this bootcamp will be explained in the Data Distribution lab. 2 Creating IBM PureData System Users and Groups The initial task after an IBM PureData System has been set up is to create the database environment. This typically begins by creating a new set of database users and user groups before creating the database. You will use the ADMIN user to start creating additional database users and users groups. Then you will assign the appropriate authorities after the database has been created in the next section. The ADMIN user should only be used to perform administrative tasks within the IBM PureData System and is not recommended for regular use. Also, it is highly advisable to develop a security access model to control user access against the database and the database objects in an IBM PureData System. This will involve creating separate users to perform certain tasks. The security access model for this bootcamp environment will use three PureData System database users: o LABADMIN o LABUSER
  • 16. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 18 o DBUSER and two PureData System database user groups: o LAGRP o LUGRP 1. Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user “nz” with password “nz”. 192.168.239.2 is the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images are hosted remotely, the instructors will provide the host IPs which will vary between machines 2. Connect to the system database as the PureData System database super-user, ADMIN, using the nzsql interface: or, There are different options you can use with the nzsql interface. Here we present two options, where the first option uses information set in the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD. By default the environment variables are set to the following values: NZ_DATATASE=system NZ_USER=admin NZ_PASSWORD=password So you do not need to specify the database name or the user. In the second option the information is explicitly stated using the –d, -u, and –pw options, which specifies the database name, the user, and the user’s password, respectively. This option is useful when you want to connect to a different database or use a different user than specified in the NZ environment variables. You will see the following: 2.1 Creating New PureData System Users The three new PureData System database users will be initially created using the ADMIN user. The LABADMIN user will be the full owner of the bootcamp database. The LABUSER user will be allowed to perform data manipulation language (DML) operations (INSERT, UPDATE, DELETE) against all of the tables in the database, but they will not be allowed to create new objects like tables in the database. And lastly, the DBUSER user will only be allowed to read tables in the database, that is, they will only have LIST and SELECT privilege against tables in the database. The basic syntax to create a user is: Welcome to nzsql, the Netezza SQL interactive terminal. Type: h for help with SQL commands ? for help on internal slash commands g or terminate with semicolon to execute query q to quit SYSTEM(ADMIN)=> [nz@netezza ~}$ nzsql –d system –u admin –pw password [nz@netezza ~]$ nzsql
  • 17. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 18 1. As the PureData System database super-user, ADMIN, you can now start to create the first user, LABADMIN, which will be the administrator of the database: (Note user and group names are not case sensitive) Later in this lab you will assign administrative ownership of the lab database to this user. 2. Now you will create two additional PureData System database users that will have restricted access to the database. The first user, LABUSER, will have full DML access to the data in the tables, but will not be able to create or alter tables. For now you will just create the user. We will set the privileges after the database is created : 3. Finally we create the user DBUSER. This user will have even more limited access to the database since it will only be allowed to select data from the tables within the database. Again, you will set the privileges after the database is created : 4. To list the existing PureData System database users in the environment use the du internal slash option: This will return a list of all database users: The additional information like USERRESOURCEGROUP is intended for resource management, which is covered later in the WLM presentation. 2.2 Creating New PureData System Groups PureData System database user groups are useful for organizing and managing PureData System database users. By default PureData System contains one group with the name PUBLIC. All users are members in the PUBLIC group when they are created. Users can be members of other groups as well though. In this section we will create two new PureData System database user groups. They will be initially created by the ADMIN user. We will create an administrative group LAGRP which is short for Lab Admin Group. This group will contain the LABADMIN user. The second group we create will be the LUGRP or Lab User Group. This group will contain the users LABUSER and DBUSER. SYSTEM(ADMIN)=> du SYSTEM(ADMIN)=> create user dbuser with password 'password'; SYSTEM(ADMIN)=> create user labuser with password 'password'; SYSTEM(ADMIN)=> create user labadmin with password 'password'; CREATE USER username WITH PASSWORD ‘string’; List of Users USERNAME | VALIDUNTIL | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | USERESOURCEGRPID | USERESOURCEGRPNAME | CROSS_JOINS_ALLOWED -----------+------------+----------+----------------+--------------+--------------+--------------+------------------+--------------------+--------------------- ADMIN | | | 0 | 0 | NONE | NONE | | _ADMIN_ | NULL DBUSER | | 0 | 0 | 0 | NONE | NONE | | PUBLIC | NULL LABADMIN | | 0 | 0 | 0 | NONE | NONE | | PUBLIC | NULL LABUSER | | 0 | 0 | 0 | NONE | NONE | | PUBLIC | NULL (4 rows)
  • 18. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 18 Two different methods will be used to add the existing users to the newly created groups. Alternatively, the groups could be created first and then the users. The basic command to create a group is: 1. As the PureData System database super-user, ADMIN, you will now create the first group, LAGRP, which will be the administrative group for the LABADMIN user : 2. After the LAGRP group is created you will now add the LABADMIN user to this group. This is accomplished by using the ALTER statement. You can either ALTER the user or the group, for this task you will ALTER the group to add the LABADMIN user to the LAGRP group: To ALTER the user you would use the following command : 3. Now you will create the second group, LUGRP, which will be the user group for the both the LABUSER and DBUSER users. You can specify the users to be included in the group when creating the group: If you had created the group before creating the user, you could add the user to the group when creating the user. To create the LABUSER user and add it to an existing group LUGRP, you would use the following command: 4. To list the existing PureData System groups in the environment use the dg internal slash option: This will return a list of all groups in the system. In our test system this is the default group PUBLIC and the two groups you have just created: The other columns are explained in the WLM presentation. SYSTEM(ADMIN)=> dg create user LABUSER with in group LUGRP; SYSTEM(ADMIN)=> create group lugrp with add user labuser, dbuser; alter user labadmin with group lagrp ; SYSTEM(ADMIN)=> alter group lagrp with add user labadmin; CREATE GROUP groupname; List of Groups GROUPNAME | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | GRORSGPERCENT | RSGMAXPERCENT | JOBMAX | SOSS_JOINS_ALLOWED -----------+----------+----------------+--------------+--------------+--------------+---------------+---------------+--------+------------------- LAGRP | 0 | 0 | 0 | NONE | NONE | 0 | 100 | 0 | NULL LUGRP | 0 | 0 | 0 | NONE | NONE | 0 | 100 | 0 | NULL PUBLIC | 0 | 0 | 0 | NONE | NONE | 20 | 100 | 0 | NULL (3 rows) SYSTEM(ADMIN)=> create group lagrp;
  • 19. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 18 5. To list the users in a group you can use one of two internal slash options, dG, or dU. The internal slash option dG will list the groups with the associated users: This returns a list of all groups and the users they contain: The internal slash option dU will list the users with the associated group: In this case the output is ordered by the users: 3 Creating a PureData System Database The next step after the PureData System database users and user groups have been created is to create the lab database. You will continue to use the ADMIN user to create the lab database then assign the appropriate authorities and privileges to the users created in the previous sections. The ADMIN user can also be used to create tables within the new database. However, the ADMIN user should only be used to perform administrative tasks. After the appropriate privileges have been assigned by the ADMIN user, the database can be handed over to the end-users to start creating and populating the tables in the database. 3.1 Creating a Database and Transferring Ownership The lab database that will be created will be named LABDB. It will be initially created by the ADMIN user and then ownership of the database will be transferred to the LABADMIN user. The LABADMIN user will have full administrative privileges against the LABDB database. The basic syntax to create a database is: SYSTEM(ADMIN)=> dU SYSTEM(ADMIN)=> dG List of Groups a User is a member USERNAME | GROUPNAME -----------+----------- DBUSER | LUGRP DBUSER | PUBLIC LABADMIN | LAGRP LABADMIN | PUBLIC LABUSER | LUGRP LABUSER | PUBLIC (6 rows) List of Users in a Group GROUPNAME | USERNAME -----------+----------- LAGRP | LABADMIN LUGRP | DBUSER LUGRP | LABUSER PUBLIC | DBUSER PUBLIC | LABADMIN PUBLIC | LABUSER (6 rows)
  • 20. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 18 1. As the PureData System database super-user, ADMIN, you will create the first database, LABDB, using the CREATE DATABASE command : The database LABDB has been created. 2. To view the existing databases use the internal slash option l : This will return the following list: The owner of the newly created LABDB database is the ADMIN user. The other databases are the default database SYSTEM and the template database MASTER_DB. 3. At this point you could continue by creating new tables as the ADMIN user. However, the ADMIN user should only be used to create users, groups, and databases, and to assign authorities and privileges. Therefore we will transfer ownership of the LABDB database from the ADMIN user to the LABADMIN user we created previously. The ALTER DATABASE command is used to transfer ownership of an existing database : This is the only method to transfer ownership of a database to an existing user. The CREATE DATABASE command does not include this option. 4. Check that the owner of the LABDB database is now the LABADMIN user : The owner of the LABDB database is now the LABADMIN user. SYSTEM(ADMIN)=> l SYSTEM(ADMIN)=> alter database labdb owner to labadmin; SYSTEM(ADMIN)=> l SYSTEM(ADMIN)=> create database labdb; CREATE DATABASE database_name; List of databases DATABASE | OWNER -----------+----------- LABDB | LABADMIN MASTER_DB | ADMIN SYSTEM | ADMIN (3 rows) List of databases DATABASE | OWNER -----------+----------- LABDB | ADMIN MASTER_DB | ADMIN SYSTEM | ADMIN (3 rows)
  • 21. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 18 The LABDB database is now created and the LABADMIN user has full privileges on the LABDB database. The user can create and alter objects within the database. You could now continue and start creating tables as the LABADMIN user. However, we will first finish assigning privileges to the two remaining database users that were created in the previous section. 3.2 Assigning Authorities and Privileges One last task for the ADMIN user is to assign the privileges to the two users we created earlier, LABUSER and DBUSER. LABUSER user will have full DML rights against all tables in the LABDB database. It will not be allowed to create or alter tables within the database. User DBUSER will have more restricted access in the database and will only be allowed to read data from the tables in the database. The privileges will be controlled by a combination of setting the privileges at the group and user level. The LUGRP user group will be granted LIST and SELECT privileges against the database and tables within the database. So any member of the LUGRP will have these privileges. The full data manipulation privileges will be granted individually to the LABUSER user. The GRANT command that is used to assign object privileges has the following syntax: 1. As the PureData System database super-user, ADMIN, connect to the LABDB database using the internal slash option c: You should see that you have successfully connected to the database: You will notice that the database name in command prompt has changed from SYSTEM to LABDB. 2. First you will grant LIST privilege on the LABDB database to the group LUGRP. This will allow members of the LUGRP to view and connect to the LABDB database : 3. To list the object permissions for a group use the following internal slash option, dpg : You will see the following output: The X in the L column of the list denotes that the LUGRP group has LIST object privileges on the LABDB global object. LABDB(ADMIN)=> dpg lugrp LABDB(ADMIN)=> grant list on labdb to lugrp; SYSTEM(ADMIN)=> c labdb admin password You are now connected to database LABDB as user admin. LABDB(ADMIN)=> SYSTEM(ADMIN)=> c labdb admin password GRANT <objective_privilege> ON <object> TO { PUBLIC | GROUP <group> | <username> } Group object permissions for group 'LUGRP' Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S ---------------+-------------+-------------------------------------+--------------------------------------------- GLOBAL | LABDB | X | (1 rows) Object Privileges (L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock (A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
  • 22. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 18 4. With the current privileges set for the LABUSER and DBUSER, they can now view and connect to the LABDB database as members of the LUGRP group. But these two users have no privileges to access any of the objects within the database. So you will grant LIST and SELECT privilege to the tables within the LABDB database to the members of the LUGRP : 5. View the object permissions for the LUGRP group : This will create the following results: The X in the L and S column denotes that the LUGRP group has both LIST and SELECT privileges on all of the tables in the LABDB database. (The LIST privilege is used to allow users to view the tables using the internal slash opton d.) 6. The current privileges satisfy the DBUSER user requirements, which is to allow access to the LABDB database and SELECT access to all the tables in the database. But these privileges do not satisfy the requirements for the LABUSER user. The LABUSER user is to have full DML access to all the tables in the database. So you will grant SELECT, INSERT, UPDATE, DELETE, LIST, and TRUNCATE privileges on tables in the LABDB database to the LABUSER user: 7. To list the object permissions for a user use the dpu <user name> internal slash option,: This will return the following: The X under the L, S, I, U, D, T columns indicates that the LABUSER user has LIST, SELECT, INSERT, UPDATE, DELETE, and TRUNCATE privileges on all of the tables in the LABDB database. LABDB(ADMIN)=> dpu labuser LABDB(ADMIN)=> grant select, insert, update, delete, list, truncate on table to labuser; LABDB(ADMIN)=> dpg lugrp LABDB(ADMIN)=> grant list, select on table to lugrp; User object permissions for user 'LABUSER' Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S ---------------+-------------+-------------------------------------+--------------------------------------------- LABDB | TABLE | X X X X X X | (1 rows) Object Privileges (L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock (A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s Group object permissions for group 'LUGRP' Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S ---------------+-------------+-------------------------------------+--------------------------------------------- GLOBAL | LABDB | X | LABDB | TABLE | X X | (2 rows) Object Privileges (L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock (A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
  • 23. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 18 Now that all of the privileges have been set by the ADMIN user the LABDB database can be handed over to the end-users. The end-users can use the LABADMIN user to create objects, which include tables, in the database and also maintain the database. 3.3 Creating PureData System Tables The LABADMIN user will be used to create tables in the LABDB database instead of the ADMIN user. Two tables will be created in this lab. The remaining tables for the LABDB database schema will be created in the Data Distribution lab. Data Distribution is an important aspect that should be considered when creating tables. This concept is not covered in this lab since it is discussed separately in the Data Distribution presentation. The two tables that will be created are the REGION and NATION tables. These two tables will be populated with data in the next section using LABUSER user. Two methods will be utilized to create these tables. The basic syntax to create a table is: 1. Connect to the LABDB database as the LABADMIN user using the internal slash option c: You will see the following results: You will notice that the user name in the command prompt has changed from ADMIN to LABADMIN. Since you already had an opened session you could use the internal slash option c to connect to the database. However, if you had handed over this environment to the end user they would need to initiate a new connection using the nzsql interface. To use the nzsql interface to connect to the LABDB database as the LABADMIN user you could use the following options: or with the short form, omitting the options: or you could set the environment variables to the following values and issue nzsql without options. nzsql labdb labadmin password nzsql –d labdb –u labadmin –pw password LABDB(ADMIN)=> c LABDB labadmin password You are now connected to database LABDB as user labadmin. LABDB(LABADMIN)=> CREATE TABLE table_name ( column_name type [ [ constraint_name ] column_constraint [ constraint_characteristics ] ] [, ... ] [ [ constraint_name ] table_constraint [ constraint_characteristics ] ] [, ... ] ) [ DISTRIBUTE ON ( column [, ...] ) ] LABDB(ADMIN)=> c labdb labadmin password
  • 24. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 18 In further labs we will often leave out the password parameter since it has been set to the same value “password” for all users. 2. Now you can create the first table in the LABDB database. The first table you will create is the REGION table with the following columns and datatypes : Column Name Data Type R_REGIONKEY INTEGER R_NAME CHAR(25) R_COMMENT VARCHAR(152) To create that table execute the following command: 3. To list the tables in the LABDB database use the dt internal slash option: This will show the table you just created 4. To describe a table you can use the internal slash option d <table name>: This shows a description of the created table: LABDB(LABADMIN)=> d region LABDB(LABADMIN)=> dt LABDB(LABADMIN)=> create table region (r_regionkey integer, r_name char(25), r_comment varchar(152)); NZ_DATABASE=LABDB NZ_USER=LABADMIN NZ_PASSWORD=password Table "REGION" Attribute | Type | Modifier | Default Value -------------+------------------------+----------+--------------- R_REGIONKEY | INTEGER | | R_NAME | CHARACTER(25) | | R_COMMENT | CHARACTER VARYING(152) | | Distributed on hash: "R_REGIONKEY" List of relations Name | Type | Owner --------+-------+---------- REGION | TABLE | LABADMIN (1 row)
  • 25. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 18 The distributed on hash clause is the distribution method used by the table. If you do not explicitly specify a distribution method, a default distribution is used. In our system this is a hash distribution on the first column R_REGIONKEY. This concept is discussed in the Data Distribution presentation and lab. 5. Instead of typing out the entire create table statement at the nzsql command line you can read and execute commands from a file. You’ll use this method to create the NATION table in the LABDB database with the following columns and data types: Column Name Data Type Constraint N_NATIONKEY INTEGER NOT NULL N_NAME CHAR(25) NOT NULL N_REGIONKEY INTEGER NOT NULL N_COMMENT VARCHAR(152) --- The full create table statement for the NATION table: 6. The statement can be found in the nation.ddl file under the /labs/databaseAdministration directory. To read and execute commands from a file use the i <file> internal slash option: 7. List all the tables in the LABDB database: We will now see a list containing the two tables you created: 8. Describe the NATION table : LABDB(LABADMIN)=> dt LABDB(LABADMIN)=> i /labs/databaseAdministration/nation.ddl create table nation ( n_nationkey integer not null, n_name char(25) not null, n_regionkey integer not null, n_comment varchar(152) ) distribute on random; List of relations Name | Type | Owner --------+-------+---------- NATION | TABLE | LABADMIN REGION | TABLE | LABADMIN (2 rows)
  • 26. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 18 This will show the following results: The distributed on random is the distribution method used, in this case the rows in the NATION table are distributed in round-robin fashion. This concept will be discussed separately in the Data Distribution presentation and lab. It is possible to continue to use LABADMIN user to perform DML queries since it is the owner of the database and holds all privileges on all of the objects in the databases. However, the LABUSER and DBUSER users will be used to perform DML queries against the tables in the database. 3.4 Using DML Queries We will now use the LABUSER user to populate data into both the REGION and NATION tables. This user has full data manipulation language (DML) privileges in the database, but no data definition language (DDL) privileges. Only the LABADMIN has full DDL privileges in the database. Later in this course more efficient methods to populate tables with data are discussed. The DBUSER will also be used to read data from the tables, but it can not insert data in to the tables since is has limited DML privileges in the database. 1. Connect to the LABDB database as the LABUSER user using the internal slash option, c: You will see the following result: You will notice that the user name in the command prompt has changed from LABADMIN to LABUSER. 2. First check which tables exist in the LABDB database using the dt internal slash option: You should see the following list: LABDB(LABUSER)=> dt LABDB(ADMIN)=> c labdb labuser password You are now connected to database LABDB as user labuser. LABDB(LABUSER)=> LABDB(ADMIN)=> c labdb labuser password LABDB(LABADMIN)=> d nation Table "NATION" Attribute | Type | Modifier | Default Value -------------+------------------------+----------+--------------- N_NATIONKEY | INTEGER | NOT NULL | N_NAME | CHARACTER(25) | NOT NULL | N_REGIONKEY | INTEGER | NOT NULL | N_COMMENT | CHARACTER VARYING(152) | | Distributed on random: (round-robin)
  • 27. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 18 Remember that the LABUSER user is a member of the LUGRP group which was granted LIST privileges on the tables in the LABDB database. This is the reason why it can list and view the tables in the LABDB database. If it did not have this privilege it would not be able to see any of the tables in the LABDB database. 3. The LABUSER user was created to perform DML operations against the tables in the LABDB database. However, it was restricted on performing DDL operations against the database. Let’s see what happens when you try create a new table, t1, with one column, C1, using the INTEGER data type: You will see the following error message: As expected the create table statement is not allowed since LABUSER user does not have the privilege to create tables in the LABDB database. 4. Let’s continue by performing DML operations that the LABUSER user is allowed to perform against the tables in the LABDB database. Insert a new row into the REGION table: You will see the following results: As expected this operation is successful. The output of the INSERT gives feedback about the number of successfully inserted rows. 5. Issue the SELECT statement against the REGION table to check the new row you just added to the table: This should return the row you just inserted: LABDB(LABUSER)=> select * from region; LABDB(LABUSER)=> insert into region values (1, 'NA', 'north america'); INSERT 0 1 LABDB(LABUSER)=> insert into region values (1, 'NA', 'north america'); LABDB(LABUSER)=> create table t1 (c1 integer); ERROR: CREATE TABLE: permission denied. LABDB(LABUSER)=> create table t1 (c1 integer); R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 1 | NA | north america (1 rows) List of relations Name | Type | Owner --------+-------+---------- NATION | TABLE | LABADMIN REGION | TABLE | LABADMIN (2 rows)
  • 28. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 18 6. Instead of typing DML statements at the nzsql command line, you can read and execute statements from a file. You will use this method to add the following three rows to the REGION table: R_REGIONKEY R_NAME R_COMMENT 2 SA South America 3 EMEA Europe, Middle East, Africa 4 AP Asia Pacific This is done with a SQL script containing the following commands: It can be found in the region.dml file under the /labs/databaseAdministration directory. To read and execute commands from a file use the i <file> internal slash option: You will see the following result. You can see from the output that the SQL script contained three INSERT statements. 7. You will load data into the NATION table using an external table with the following command: You will see that 14 rows are inserted to the table: Loading data into a table is covered in the Loading and Unloading Data presentation and lab. 8. Now you will switch over to the DBUSER user, who only has SELECT privilege on the tables in the LABDB database. This privilege is granted to this user since he is a member of the LUGRP group. Use the internal slash option, c <database name> <user> <password> to connect to the LABDB database as the DBUSER user: You will see the following: LABDB(LABUSER)=> c LABDB dbuser password You are now connected to database LABDB as user dbuser. LABDB(DBUSER)=> LABDB(LABUSER)=> c labdb dbuser password LABDB(LABUSER)=> insert into nation select * from external ‘/labs/databaseAdministration/nation.del’ INSERT 0 14 LABDB(LABUSER)=> insert into nation select * from external '/labs/databaseAdministration/nation.del'; LABDB(LABUSER)=> i /labs/databaseAdministration/region.dml INSERT 0 1 INSERT 0 1 INSERT 0 1 LABDB(LABUSER)=> i /labs/databaseAdministration/region.dml insert into region values (2, 'sa', 'south america'); insert into region values (3, 'emea', 'europe, middle east, africa'); insert into region values (4, 'ap', 'asia pacific');
  • 29. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 17 of 18 You will notice that the user name in the command prompt changes from LABUSER to DBUSER. 9. Before trying to view rows from tables in the LABDB database, try to add a new row to the REGION table: You should see the following error message: As expected the INSERT statement is not allowed since the DBUSER does not have the privilege to add rows to any tables in the LABDB database. 10. Now select all rows from the REGION table: You should get the following output: 11. Finally let's run a slightly more complex query. We want to return all nation names in Asia Pacific, together with their region name. To do this you need to execute a simple join using the NATION and REGION tables. The join key will be the region key, and to restrict results on the AP region you need to add a WHERE condition: This should return the following results, containing all countries from the ap region. Congratulations you have completed the lab. You have successfully created the lab database, 2 tables, and database users and user groups with various privileges. You also ran a couple of simple queries. In further labs you will continue to use this database by creating the full schema. LABDB(DBUSER)=> select n_name, r_name from nation, region where n_regionkey = r_regionkey and r_name = 'ap'; LABDB(DBUSER)=> select * from region; LABDB(DBUSER)=> insert into region values (5, 'np', 'north pole'); ERROR: Permission denied on "REGION". LABDB(DBUSER)=> insert into region values (5, 'NP', 'north pole'); N_NAME | R_NAME ---------------------------+--------------------------- macau | ap new zealand | ap australia | ap japan | ap hong kong | ap (5 rows) R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 1 | na | north america 2 | sa | south america 3 | emea | europe, middle east, Africa 4 | ap | asia pacific (4 rows)
  • 30. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 18 of 18 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 31. Data Distribution Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 32. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 19 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Skew.................................................................................4 2.1 Data Skew.......................................................................4 2.2 Processing Skew.............................................................7 3 Co-Location ...................................................................10 3.1 Investigation ..................................................................10 3.2 Co-Located Joins...........................................................12 4 Schema Creation...........................................................15 4.1 Investigation ..................................................................15 4.2 Solution .........................................................................16
  • 33. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 19 1 Introduction IBM PureData System is a family of data-warehousing appliances that combine high performance with low administrative effort. Due to the unique data warehousing centric architecture of PureData System, most performance tuning tasks are either not necessary or automated. Unlike normal data warehousing solutions, no tablespaces need to be created or tuned, there are also no indexes, buffer pools or partitions. Since PureData System is built on a massively parallel architecture that distributes data and workloads over a large number of processing and data nodes, the single most important tuning factor is picking the right distribution key. The distribution key governs which data rows of a table are distributed to which data slice and it is very important to pick an optimal distribution key to avoid data skew, processing skew and to make joins co-located whenever possible. 1.1 Objectives In this lab we will cover a typical scenario in a POC or customer engagement which involves an existing data warehouse for customer transactions. Figure 1 LABDB database Figure 1 shows a visualization of the tables in the data warehouse and the relationships between the tables. The warehouse contains the customers of the company, their orders, and the line items that are part of the order. The warehouse also has a list of suppliers, providing the parts that are part of the shipped line items. For this lab we already have the DDLs for creation of the tables and load files containing the warehouse data. Both have already been transformed in a format usable by PureData System. In this lab we will define the distribution keys for these tables. In addition to the data and the DDLs we also have received a couple of queries from the customer that are usually run against the warehouse. Those are important input as well for picking optimal distribution keys.
  • 34. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 19 2 Skew Tables in PureData System are distributed across data slices based on the distribution method and key. If a bad data distribution method has been picked, it will result in skewed tables or processing skew. Data skew occurs when the distribution method puts significantly more records of a table on one data slice than on other data slices. Apart from bad performance this also results in a situation where the PureData System can hold significantly less data than expected. Processing skew occurs if processing of queries is mainly taking place on some data slices for example because queries only apply to data on those data slices. Both types of skew result in suboptimal performance since in a parallel system the slowest node defines the total execution time. 2.1 Data Skew The first table we will create is LINEITEM, the main fact table of the schema. It contains roughly 6 million rows. 1. Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user “nz” with password “nz”. 192.168.239.2 is the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images are hosted remotely, the instructors will provide the host IPs which will vary between machines 2. If you are continuing from the previous lab and are already connected to NZSQL quit the NZSQL console with the q command. 3. To create the LINEITEM table, switch to the lab directory /labs/dataDistribution. To do this use the following command: (Notice that you can use bash auto complete by using the Tab key to complete folder and files names) 4. Create the LINEITEM table by using the following script. Since the fact table is quite large this can take a couple minutes. You should see a similar result to the following. The error message at the beginning is expected since the script tries to clean up existing LINEITEM tables: 5. Now lets have a look at the created table, open the nzsql console by entering the command: nzsql [nz@netezza dataDistribution]$ ./create_lineitem_1.sh ERROR: Table 'LINEITEM' does not exist CREATE TABLE Load session of table 'LINEITEM' completed successfully [nz@netezza dataDistribution]$ nzsql Welcome to nzsql, the Netezza SQL interactive terminal. Type: h for help with SQL commands ? for help on internal slash commands g or terminate with semicolon to execute query q to quit SYSTEM(ADMIN)=> [nz@netezza dataDistribution]$ ./create_lineitem_1.sh [nz@netezza ~]$ cd /labs/dataDistribution/
  • 35. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 19 6. Connect to the database LABDB as user LABADMIN by typing the following command: SYSTEM(ADMIN)=> c LABDB LABADMIN You should now be connected to the LABDB database as the LABADMIN user. 7. Let’s have a look at the table we just created. First we want to see a description of its columns and distribution key. Use the NZSQL describe command d LINEITEM to get a description of the table. This should have the following result: We can see that the LINEITEM table has 16 columns with different data types. Some of the columns have a “key” suffix and substrings containing the names of other tables and are most likely foreign keys of dimension tables. The distribution key is L_LINESTATUS, which is of a CHAR(1) data type. 8. Now let’s have a look at the data in the table. To return a limited number of rows you can use the limit keyword in your select queries. Execute the following select command to return 10 rows of the LINEITEM table. For readability we only select a couple of columns including the order key, the ship date and the linestatus distribution key: You will see the following results: SYSTEM(ADMIN)=> c LABDB LABADMIN You are now connected to database LABDB as user LABADMIN. LABDB(LABADMIN) => LABDB(LABADMIN)=> SELECT L_ORDERKEY, L_QUANTITY, L_SHIPDATE, L_LINESTATUS FROM LINEITEM LIMIT 10; TPCH(TPCHADMIN)=> d LINEITEM Table "LINEITEM" Attribute | Type | Modifier | Default Value -----------------+-----------------------+----------+--------------- L_ORDERKEY | INTEGER | NOT NULL | L_PARTKEY | INTEGER | NOT NULL | L_SUPPKEY | INTEGER | NOT NULL | L_LINENUMBER | INTEGER | NOT NULL | L_QUANTITY | NUMERIC(15,2) | NOT NULL | L_EXTENDEDPRICE | NUMERIC(15,2) | NOT NULL | L_DISCOUNT | NUMERIC(15,2) | NOT NULL | L_TAX | NUMERIC(15,2) | NOT NULL | L_RETURNFLAG | CHARACTER(1) | NOT NULL | L_LINESTATUS | CHARACTER(1) | NOT NULL | L_SHIPDATE | DATE | NOT NULL | L_COMMITDATE | DATE | NOT NULL | L_RECEIPTDATE | DATE | NOT NULL | L_SHIPINSTRUCT | CHARACTER(25) | NOT NULL | L_SHIPMODE | CHARACTER(10) | NOT NULL | L_COMMENT | CHARACTER VARYING(44) | NOT NULL | Distributed on hash: "L_LINESTATUS"
  • 36. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 19 From this limited sample we can not make any definite judgments but we can make a couple of assumptions. While the L_ORDERKEY column is not unique it seems to have a lot of distinct values. The L_SHIPDATE column also appears to have a lot of distinct shipping date values. Our current distribution key L_LINESTATUS on the other hand has only two shown values which may make it a bad distribution key. It is possible that you get different results. Since a database table is an unordered set it is probable that you get different results for example only “O” or “F” values in the L_LINESTATUS column. 9. We will now verify the number of distinct values in the L_LINESTATUS column with a “SELECT DISTINCT …” call. To return a list of all values that are in the L_LINESTATUS column execute the following SQL command: You should see the following results: We can see that the L_LINESTATUS column only contains two distinct values. As a distribution key, this will result in a table that is only distributed to two of the available dataslices. 10. We verify this by executing the following SQL call, which will return a list of all dataslices which contain rows of the LINEITEM table, and the corresponding number of rows stored in them: This will result in a similar output to the following: LABDB(LABADMIN)=> SELECT DISTINCT L_LINESTATUS FROM LINEITEM; L_LINESTATUS -------------- O F (2 rows) LABDB(LABADMIN)=> SELECT L_ORDERKEY, L_QUANTITY, L_SHIPDATE, L_LINESTATUS FROM LINEITEM LIMIT 10; L_ORDERKEY | L_QUANTITY | L_SHIPDATE | L_LINESTATUS ------------+------------+------------+-------------- 2 | 38.00 | 1997-01-28 | O 6 | 37.00 | 1992-04-27 | F 34 | 13.00 | 1998-10-23 | O 34 | 22.00 | 1998-10-09 | O 34 | 6.00 | 1998-10-30 | O 38 | 44.00 | 1996-09-29 | O 66 | 31.00 | 1994-02-19 | F 66 | 41.00 | 1994-02-21 | F 70 | 8.00 | 1994-01-12 | F 70 | 13.00 | 1994-03-03 | F (10 rows) LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID; LABDB(LABADMIN)=> SELECT DISTINCT L_LINESTATUS FROM LINEITEM;
  • 37. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 19 Every PureData System table has a hidden column DATASLICEID, which contains the id of the dataslice the selected row is being stored in. By executing a SQL query that does a GROUP BY on this column and counts the number of rows for each dataslice id, data skew can be detected. In this case the table has been, as we already expected, distributed to only two of the available four dataslices. This means that we only use half of the available space and it will also result in low performance during most query executions. In general a good distribution key should have a big number of distinct values with a good value distribution. Columns with a low number of distinct values, especially boolean columns should not be considered as distribution keys. 2.2 Processing Skew Even in tables that are distributed evenly across dataslices, data processing for queries can be concentrated or skewed to a limited number of dataslices. This can happen because PureData System is able to ignore data extents (sets of data pages) that do not fit to a given WHERE condition. We will cover the mechanism behind that in the zone map chapter. 1. First we will pick a new distribution key. As we have seen it should have a big number of distinct values. One of the columns that did fit this description was the L_SHIPDATE column. Check the number of distinct values in the shipdate column with the COUNT(DISTINCT … ) statement: You will get a result similar to the following: The column has over 2500 distinct values and has therefore more than enough distinct values to guarantee a good data distribution on 4 dataslices. Of course this is under the assumption that the value distribution is good as well. 2. Now let’s reload the LINEITEM table with the new distribution key. For this we need to change the SQL of the loading script we executed at the beginning of the lab. Exit the nzsql console by entering: q 3. You should now be in the lab directory /labs/dataDistribution. The table creation statement is situated in the lineitem.sql file. We will need to make changes to the file with a text editor. Open the file with the default linux text editor vi. To do this enter the following command: vi lineitem.sql 4. The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you will be in the command mode. To change the file you need to switch into the insert mode by pressing “i”. The editor will show an – INSERT – at the bottom of the screen. 5. You can now use the cursor keys to navigate to the DISTRIBUTE ON clause at the bottom of the create command. Change the distribution key to “l_shipdate”. The editor should now look like the following: LABDB(LABADMIN)=> SELECT COUNT(DISTINCT L_SHIPDATE) FROM LINEITEM; COUNT ------- 2526 (1 row) LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID; DATASLICEID | COUNT -------------+--------- 1 | 3004998 4 | 2996217 (2 rows) LABDB(LABADMIN)=> SELECT COUNT(DISTINCT L_SHIPDATE) FROM LINEITEM;
  • 38. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 19 6. We will now save our changes. Press “Esc” to switch back into command mode. You should see that the “—INSERT— “ string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without any questions. If you made a mistake editing and would like to undo it press “Esc” then enter :q! and go back to step 3. 7. Now repeat steps 3-5 of section 2.1 Data Skew: a. Recreate and load the LINEITEM table with your new distribution key by executing the ./create_lineitem_1.sh command b. Use the nzsql command to enter the command console c. Switch to the LABDB database by using the c LABDB LABADMIN command. 8. Now we verify that the new distribution key results in a good data distribution. For this we will repeat the query, which returns the number of rows for each datasliceid of the LINEITEM table. Execute the following command: Your results should look similar to the following: LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID; create table lineitem ( l_orderkey integer not null , l_partkey integer not null , l_suppkey integer not null , l_linenumber integer not null , l_quantity decimal(15,2) not null , l_extendedprice decimal(15,2) not null , l_discount decimal(15,2) not null , l_tax decimal(15,2) not null , l_returnflag char(1) not null , l_linestatus char(1) not null , l_shipdate date not null , l_commitdate date not null , l_receiptdate date not null , l_shipinstruct char(25) not null , l_shipmode char(10) not null , l_comment varchar(44) not null ) DISTRIBUTE ON (l_shipdate); ~ ~ ~ -- INSERT --
  • 39. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 19 We can see that the data distribution is much better now. All four data slices have a roughly equal amount of rows. 9. Now that we have a database table with a good data distribution lets look at a couple of queries we have received from the customer. The following query is executed regularly by the customer. It returns the average quantity shipped on a given day grouped by the shipping mode. Execute the following query: Your results should look like the following: This query will take all rows from the 29th March of 1996 and compute the average value of the L_QUANTITY column for each L_SHIPMODE value. It is a typical warehousing query insofar as a date column is used to restrict the row set that is taken as input for computation. In this example most rows of the LINEITEM table will be filtered away, only rows that have the specified date will be used as input for computation of the AVG aggregation. 10. Execute the following SQL statement to see on which data slice we can find the rows from the 29th March of 1996: You should see the following: LABDB(LABADMIN)=> SELECT AVG(L_QUANTITY) AS AVG_Q, L_SHIPMODE FROM LINEITEM WHERE L_SHIPDATE = '1996-03-29' GROUP BY L_SHIPMODE; AVG_Q | L_SHIPMODE -----------+------------ 26.045455 | MAIL 27.147826 | TRUCK 26.038567 | FOB 24.780282 | RAIL 25.708556 | AIR 24.494186 | REG AIR 25.562500 | SHIP (7 rows) TPCH(TPCHADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID; DATASLICEID | COUNT -------------+--------- 2 | 1497649 3 | 1501760 4 | 1501816 1 | 1499990 (4 rows) LABDB(LABADMIN)=> SELECT COUNT(*), DATASLICEID FROM LINEITEM WHERE L_SHIPDATE = '1996-03-29' GROUP BY DATASLICEID; LABDB(LABADMIN)=> SELECT AVG(L_QUANTITY) AS AVG_Q, L_SHIPMODE FROM LINEITEM WHERE L_SHIPDATE = '1996-03-29' GROUP BY L_SHIPMODE;
  • 40. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 19 Since we used the shipping date column as a distribution key, all rows from a specific date can be found on one data slice and therefore also one SPU. This means that for our previous query all rows on other data slices are dismissed and the computation takes only place on one dataslice and SPU. This is known as processing skew. While this one SPU is working the other SPUs will be idling. Columns that are often used in WHERE conditions shouldn’t be used as distribution keys, since this can easily result in processing skew. In warehousing environments this is especially true for date columns. Good distribution keys are key columns; they have lots of distinct values and very rarely result in processing skew. In our example we have a couple of distribution keys to choose from: L_SUPPKEY, L_ORDERKEY, L_PARTKEY. All have a big number of distinct values. 3 Co-Location The most basic warehouse schema consists of a fact table containing a list of all business transactions and a set of dimension tables that contain the different actors, objects, locations and time points that have taken part in these transactions. This means that most queries will not only access one database table but will require joins between tables. In PureData System database, tables are distributed over a potentially large numbers of data slices on different SPUs. This means that during a join of two tables there are two possibilities. • Rows of the two tables that belong together are situated on the same dataslice, which means that they are co-located and can be joined locally • Rows that belong together are situated on different dataslices which means that tables need to be redistributed. 3.1 Investigation Obviously co-location has big performance advantages. In the following section we will demonstrate that by introducing a second table ORDERS. 1. Switch to the Linux command line, if you are in the NZSQL console. Do this with the q command. 2. Switch to the data distribution lab directory with the command cd /labs/dataDistribution 3. Create and load the ORDERS table by executing the following command: ./create_orders_1.sh 4. Enter the NZSQL console with the nzsql labdb labadmin command 5. Let’s take a look at the ORDERS table with the d orders command. You should see the following results. LABDB(LABADMIN)=> SELECT COUNT(*), DATASLICEID FROM LINEITEM WHERE L_SHIPDATE = '1996-03-29' GROUP BY DATASLICEID; COUNT | DATASLICEID -------+------------- 2501 | 2 (1 row)
  • 41. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 19 The orders table has a key column O_ORDERKEY that is most likely the primary key of the table. It contains information on the order value, priority and date and has been distributed on random. This means that PureData System doesn’t use a hash based algorithm to distribute the data. Instead, rows are distributed randomly on the available data slices. You can check the data distribution of the table, using the methods we have used before for the LINEITEM table. The data distribution will be perfect. There will also not be any processing skew for queries on the single table, since in a random distribution there can be no correlation between any WHERE condition and the distribution key. 6. We have received another typical query from our customer. It returns the average total price and item quantity of all orders grouped by the shipping priority. This query has to join together the LINEITEM and ORDERS tables to get the total order cost from the orders table and the quantity for each shipped item from the LINEITEM table. The tables are joined with an inner join on the L_ORDERKEY column. Execute the following query and note the approximate execution time: You should see the following results: Notice that the query takes about a minute to complete on our machine. The actual execution times on your machine will be different. 7. Remember that the ORDERS table was distributed randomly and the LINEITEM table is still distributed by the L_SHIPDATE column. The join on the other hand is taking place on the L_ORDERKEY and O_ORDERKEY columns. We will now have a quick look at what is happening inside PureData System in this scenario. To do this we use the PureData System EXPLAIN function. This will be more thoroughly covered in the Optimization lab. LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE,AVG(L.L_QUANTITY) AS QUANTITY, O.O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY; PRICE | QUANTITY | O_ORDERPRIORITY ---------------+-----------+----------------- 189285.029553 | 25.526186 | 2-HIGH 189219.594349 | 25.532474 | 5-LOW 189093.608965 | 25.513563 | 1-URGENT 189026.093657 | 25.494518 | 3-MEDIUM 188546.457203 | 25.472923 | 4-NOT SPECIFIED (5 rows) LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY; LABDB(LABADMIN)=> d orders Table "ORDERS" Attribute | Type | Modifier | Default Value -----------------+-----------------------+----------+--------------- O_ORDERKEY | INTEGER | NOT NULL | O_CUSTKEY | INTEGER | NOT NULL | O_ORDERSTATUS | CHARACTER(1) | NOT NULL | O_TOTALPRICE | NUMERIC(15,2) | NOT NULL | O_ORDERDATE | DATE | NOT NULL | O_ORDERPRIORITY | CHARACTER(15) | NOT NULL | O_CLERK | CHARACTER(15) | NOT NULL | O_SHIPPRIORITY | INTEGER | NOT NULL | O_COMMENT | CHARACTER VARYING(79) | NOT NULL | Distributed on random: (round-robin)
  • 42. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 19 Execute the following command: You will get a long output. Scroll up till you see your command in the text window. The start of the EXPLAIN output should look like the following: The EXPLAIN functionality will be covered in detail in a following chapter but it is easy to see what is happening here. What’s happening is the system is redistributing both the ORDERS and LINEITEM tables. This is very bad because both tables are of significant size so there is a considerable overhead. This inefficient redistribution occurs because the tables are not distributed on a useful column. In the next section we will fix this. 3.2 Co-Located Joins In the last section we have seen that a query using joins can result in costly data redistribution during join execution when the joined tables are not distributed on the join key. In this section we will reload the tables based on the mutual join key to enhance performance during joins. 1. Exit the NZSQL console with the q command. 2. Switch to the dataDistribution directory with the cd /labs/dataDistribution command 3. Change the distribution key in the lineitem.sql file to L_ORDERKEY: a. Open the file with the vi editor by executing the command: vi lineitem.sql b. Switch to INSERT mode by pressing “i” c. Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON (L_ORDERKEY) d. Exit the INSERT mode by pressing ESC EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "ORDERS" as "O" {}] -- Estimated Rows = 1500000, Width = 27, Cost = 0.0 .. 578.6, Conf = 100.0 Projections: 1:O.O_TOTALPRICE 2:O.O_ORDERPRIORITY 3:O.O_ORDERKEY [SPU Distribute on {(O.O_ORDERKEY)}] [HashIt for Join] Node 2. [SPU Sequential Scan table "LINEITEM" as "L" {(L.L_SHIPDATE)}] -- Estimated Rows = 6001215, Width = 12, Cost = 0.0 .. 2417.5, Conf = 100.0 Projections: 1:L.L_QUANTITY 2:L.L_ORDERKEY [SPU Distribute on {(L.L_ORDERKEY)}] ... LABDB(LABADMIN)=>EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY;
  • 43. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 19 e. Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should look like the following: 4. Change the Distribution key in the orders.sql file to O_ORDERKEY. a. Open the file with the vi editor by executing the command: vi orders.sql b. Switch to INSERT mode by pressing “i” c. Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON (O_ORDERKEY) d. Exit the INSERT mode by pressing ESC e. Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should look like the following: 5. Recreate and load the LINEITEM table with the distribution key L_ORDERKEY by executing the command: ./create_lineitem_1.sh create table orders ( o_orderkey integer not null , o_custkey integer not null , o_orderstatus char(1) not null , o_totalprice decimal(15,2) not null , o_orderdate date not null , o_orderpriority char(15) not null , o_clerk char(15) not null , o_shippriority integer not null , o_comment varchar(79) not null ) DISTRIBUTE ON (o_orderkey); ~ :wq! create table lineitem ( l_orderkey integer not null , l_partkey integer not null , l_suppkey integer not null , l_linenumber integer not null , l_quantity decimal(15,2) not null , l_extendedprice decimal(15,2) not null , l_discount decimal(15,2) not null , l_tax decimal(15,2) not null , l_returnflag char(1) not null , l_linestatus char(1) not null , l_shipdate date not null , l_commitdate date not null , l_receiptdate date not null , l_shipinstruct char(25) not null , l_shipmode char(10) not null , l_comment varchar(44) not null ) DISTRIBUTE ON (l_orderkey); ~ :wq!
  • 44. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 19 6. Recreate and load the ORDERS table with the distribution key O_ORDERKEY by executing the command: ./create_orders_1.sh 7. Enter the NZSQL console by executing the following command: nzsql labdb labadmin 8. Repeat executing the explain of our join query from the previous section by executing the following command: The query itself has not been changed. The only changes are in the distribution keys of the involved tables. You will again see a long output. Scroll up to the start of the output, directly after your query. You should see a similar output to the following: Again we do not want to make a complete analysis of the explain output. We will cover that in more detail in later chapters. But if you compare the output with the output of the last section you will see that the [SPU Distribute on O.O_ORDERKEY)}] nodes have vanished. The reason is that the join is now co-located because both tables are distributed on the join key. You may see a distribution node further below during the execution of the group by clause, but this is forecast to distribute only hundred rows which has no negative performance influence. 9. Finally execute the joined query again: The query should return the same results as in the previous section but run much faster even in the VMWare environment. In a real PureData System appliance with 6, 12 or more SPUs the difference would be much more significant. You now have loaded the LINEITEM and ORDERS table into your PureData System appliance using the optimal distribution key for these tables for most situations. a. Both tables are distributed evenly across dataslices, so there is no data skew. LABDB(LABADMIN)=>EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY; EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}] -- Estimated Rows = 1500000, Width = 27, Cost = 0.0 .. 578.6, Conf = 100.0 Projections: 1:O.O_TOTALPRICE 2:O.O_ORDERPRIORITY 3:O.O_ORDERKEY [HashIt for Join] Node 2. [SPU Sequential Scan table "LINEITEM" as "L" {(L.L_ORDERKEY)}] -- Estimated Rows = 6001215, Width = 12, Cost = 0.0 .. 2417.5, Conf = 100.0 Projections: 1:L.L_QUANTITY 2:L.L_ORDERKEY ... LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY;
  • 45. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 19 b. The distribution key is highly unlikely to result in processing skew, since most where conditions will restrict a key column evenly c. Since ORDERS is a parent table of LINEITEM, with a foreign key relationship between them, most queries joining them together will utilize the join key. These queries will be co-located. Now finally we will pick the distribution keys of the full schema. 4 Schema Creation Now that we have created the ORDERS and LINEITEM tables we need to pick the distribution keys for the remaining tables as well. 4.1 Investigation Figure 2 LABDB database You will notice that it is much harder to find optimal distribution keys in a more complicated schema like this. In many situations you will be forced to choose between enabling co-located joins between one set of tables or another one. The following provides some details on the remaining tables: Table Number of Rows Primary Key REGION 5 R_REGIONKEY NATION 25 N_NATIONKEY
  • 46. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 19 CUSTOMER 150000 C_CUSTKEY ORDERS 1500000 O_ORDERKEY SUPPLIER 10000 S_SUPPKEY PART 200000 P_PARTKEY PARTSUPP 800000 --- LINEITEM 6000000 -- And on the involved relationships: Parent Table Child Table Parent Table Join Column Child Table Join Column REGION NATION R_REGIONKEY N_REGIONKEY NATION CUSTOMER N_NATIONKEY C_NATIONKEY NATION SUPPLIER N_NATIONKEY S_NATIONKEY CUSTOMER ORDERS C_CUSTKEY O_CUSTKEY ORDERS LINEITEM O_ORDERKEY L_ORDERKEY SUPPLIER LINEITEM S_SUPPKEY L_SUPPKEY SUPPLIER PARTSUPP S_SUPPKEY PS_SUPPKEY PART LINEITEM P_PARTKEY L_PARTKEY PART PARTSUPP P_PARTKEY PS_PARTKEY Given all that you heard in the presentation and lab, try to fill in the distribution keys in the chart below. Let’s assume that we will not change the distribution keys for LINEITEM and ORDERS anymore. Table Distribution Key (up to 4 columns) or Random REGION NATION CUSTOMER SUPPLIER PART PARTSUPP ORDERS O_ORDERKEY LINEITEM L_ORDERKEY 4.2 Solution It is important to note that there is no optimal way to pick distribution keys. It always depends on the queries that run against the database. Without these queries it is only possible to follow some general rules: • Co-Location between big tables (esp. if a fact table is involved) is more important than between small tables
  • 47. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 17 of 19 • Very small tables can be broadcast by the system with little performance penalty. If one table of a join is broadcast the other will not need to be redistributed • If you suspect that there will be lots of queries joining two big tables but you cannot distribute both of them on the expected join key, distributing one table on the join key is better than nothing, since it will lead to a single redistribute instead of a double redistribute. If we break down the problem, we can see that PART and PARTSUPP are the biggest two of the remaining tables and we have already based on available customer queries distributed the LINEITEM table on the order key, so it seems to make sense to distribute PART and PARTSUPP on their join keys. CUSTOMER is big as well and has two relationships. The first relationship is with the very small NATION table that is easily broadcasted by the system. The second relationship is with the ORDERS table which is big as well but already distributed by the order key. But as mentioned above a single redistribute is better than a double redistribute. Therefore it makes sense to distribute the CUSTOMER table on the customer key, which is also the join key of this relationship. The situation is very similar for the SUPPLIER table. It has two very large child tables PARTSUPP and LINEITEM which are both related to it through the supplier key, so it should be distributed on this key. NATION and REGION are both very small and will most likely be broadcasted by the Optimizer. You could distribute those tables randomly, on their primary keys, on their join keys. In this case we have decided to distribute both on their primary keys but there is no definite right or wrong approaches. One possible solution for the distribution keys could be the following. Table Distribution Key (up to 4 columns) or Random REGION R_REGIONKEY NATION N_NATIONKEY CUSTOMER C_CUSTKEY SUPPLIER S_SUPPKEY PART P_PARTKEY PARTSUPP PS_PARTKEY ORDERS O_ORDERKEY LINEITEM L_ORDERKEY Finally we will actually load the remaining tables. 1. You should still be connected to the LABDB database. We now need to recreate the NATION and REGION tables with a new distribution key. To drop the old versions execute the following command: 2. Quit the NZSQL console with the q command. 3. Navigate to the lab folder by executing the following command: cd /labs/dataDistribution 4. Verify the SQL script creating the remaining 6 tables with the command: more remaining_tables.sql You will see the SQL script used for creating the remaining tables with the distribution keys mentioned above. Press the Enter key to scroll lower until you reach the end of the file. 5. Actually create the remaining tables and load the data into it with the following command: ./create_remaining.sh LABDB(LABADMIN)=> DROP TABLE NATION, REGION;
  • 48. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 18 of 19 You should see the following results. The error message at the top is expected since the script tries to clean up any old tables of the same name in case a reload is necessary. Congratulations! You just have defined data distribution keys for a customer data schema in PureData System. You can have a look at the created tables and their definitions with the commands you used in the previous chapters. We will continue to use the tables we created in the following labs. [nz@netezza dataDistribution]$ ./create_remaining.sh ERROR: Table 'NATION' does not exist CREATE TABLE CREATE TABLE CREATE TABLE CREATE TABLE CREATE TABLE CREATE TABLE Load session of table 'NATION' completed successfully Load session of table 'REGION' completed successfully Load session of table 'CUSTOMER' completed successfully Load session of table 'SUPPLIER' completed successfully Load session of table 'PART' completed successfully Load session of table 'PARTSUPP' completed successfully
  • 49. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 19 of 19 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 50. IBM Software Information Management IBM PureData System Administrator Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 51. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 16 Table of Contents 1 Introduction .....................................................................3 2 Installing NzAdmin..........................................................3 3 The System Tab...............................................................4 4 The Database Tab............................................................6 5 Tools...............................................................................14 5.1 Workload Management..................................................14 5.2 Table Storage................................................................15 5.3 Table Skew....................................................................15
  • 52. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 16 1 Introduction In this lab we will explore the features of the IBM PureData System Administrator GUI tool, NzAdmin. NzAdmin is a Windows- based application that allows users to manage the system, obtain hardware information and status, and manage various aspects of user databases, tables, and objects. NzAdmin consists of two distinct environments: the System tab and the Database tab. We will look at both. When you click either tab, the system displays the tree view on the left side and the data view on the right side. The VMWare image we are using in the labs differs significantly from a normal PureData System appliance. There is only one virtualized SPA and SPU, only 4 dataslices and no dataslice mirroring. In addition to that some NzAdmin functions do not work with the VM. For example the SPU and SPA sections are blank and the data distribution of a table cannot be displayed. Nevertheless most functionality works and should provide a good overview. 2 Installing NzAdmin NzAdmin is part of the PureData System client tools for Windows. It can be installed with a standard Windows installer and doesn’t require the JDBC or ODBC drivers to be installed, since it contains its own connection libraries. 1. The installation package is in “C:BootcampNetezza_Bootcamp_VMsToolsnzsetup.exe” (The base directory C:Bootcamp may differ in your environment, there should be a shortcut on your Desktop as well, if you cannot find it ask the instructor for help) 2. Install the NzAdmin client by double-clicking on the Installer and accepting all standard settings. 3. You can start NzAdmin from the Windows Start Menu. Programs->IBM PureData System -> IBM PureData System Administrator 4. Connect to your PureData System host with the IP address taken from the VM where PureData System Host is running (you can use ifconfig eth0 in the Linux terminal window. In our lab the IP address is 192.168.239.2, username “admin”, and password “password”.
  • 53. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 16 5. You should see the following: The Admin client has a navigation menu on the left with two views System and Database. The System view contains information about the general status of the PureData System hardware and the PureData System Performance Server software. It displays system information and provides information about possible system problems like a hard disc failure. It also contains statistics like the user space usage. The database view contains information about the user databases in the system. It displays information about all database objects like tables, views, sequences, synonyms, user defined functions, procedures etc. It also provides the user with the tools necessary to manage groups and access privileges. You can also view the current active database sessions and their queries and a recent history of all queries that have been run on the system. Finally you can see the backup history for each database. The menu bar contains common actions like refresh or connect. In addition to that it provides access to some administration tools like Workload Management, a tool for the identification of table skew etc. 3 The System Tab In this section we will inspect the hardware components that make up a PureData System Appliance system using NzAdmin, including the SPUs and the data slices.
  • 54. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 16 1. The default view is the main hardware view, which shows a dashboard representing the general health of the system. Unfortunately the hardware information cannot be gathered for our VM. But we see the disc usage at the bottom. Note that the most important measure is actually the Maximum storage utilization. If one disc runs full, no new data can be added to the system. 2. Unfortunately the SPA and SPU sections are empty for our VM system, normally we could see health information of all Snippet processing arrays, snippet processing units and their data slices and hard discs. The next available section is data slices. When you select it, you can see that our VM has 4 dataslices 1-4 on four hard discs 1002-1005. Normally we would also see which disc contains the mirrors of these discs, but our VM system doesn’t mirror its data slices. 3. Under the data slice section you can see the currently active event rules. Event rules monitor the system for note worthy occurrences and act accordingly i.e. sending an email or raising an alert. For example by sending a mail to an administrator in case of a hardware failure. Unlike a real PureData System appliance only a very small set of event rules is enabled. You could use the New Event Rule wizard to add new events or generate test events.
  • 55. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 16 4 The Database Tab In this section we will learn how NzAdmin can be used to view and manage user database objects including tables, users, groups and sessions. 1. Switch tabs to the Database tab. This is the area where database tables, users, groups and sessions can be viewed and managed. You may not have some of the database objects displayed in the image, this shouldn’t change the lab in any way. 2. In the Database tab, expand Databases and click on LABDB. NzAdmin can view all the objects of the following types: tables, views, sequences, synonyms, functions, aggregates, procedures, and libraries. You can also create objects of the following types: tables, views, sequences, and synonyms. Furthermore, many of these object types can be managed in some way using NzAdmin - for example we have control over tables in NzAdmin. Finally we can see the currently running Groom processes in the Groom Status section.
  • 56. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 16 3. Click on Tables in the tree or data view. For each table in the LABDB database you can view information such as the owner, creation date, number of columns, size of table on disk, data skew, row count, and percentage organized if enabled. 4. If you right click on a table you can selected ways in which to manage the table including changing the name, owner, columns, organizing keys, default value, generating or viewing statistics, viewing record distribution, truncating and/or dropping the table.
  • 57. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 16 Unfortunately one of the most important menu entries “Record Distribution” which gives you a graphical distribution of the data distribution of the table doesn’t work in our VMware environment. 5. To look at information about the columns, distribution key, organizing keys, DLL, and users/groups with privileges for a table double click on the table entry to bring up the details: This view shows the columns of the table and their constraints. It also shows if the columns are Zone Map enabled or not - Zone Maps are an important performance feature in PureData System and will be discussed later in this course. You can set access privileges to the Table with the Privileges button. The DDL button returns the command to create the table and is a convenient feature for administrators.
  • 58. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 16 6. Close the Table Properties view again and click on the Users field of the left navigation tree. Here you can create and manage users. Users can either be created from a context menu on the Users folder or from the Options Menu at the top of the screen. To manage users use the context menu that is displayed when you right click on the user you want to manage. NzAdmin allows you to rename or drop users, change their privileges and workload management settings etc. 7. You can do the same management for groups in the Groups section of the Database tab. 8. Click on Sessions in the Database tab. Here you can see who is currently logged into the PureData System and the commands they have issued. You can also abort sessions or change their priority in a workload management environment (this has to be setup before you can change the priority).
  • 59. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 16 9. To see and manage active queries you can expand Sessions in the Database tab and click on Active Queries, however, there are no queries running at this time. 10. Comprehensive query history information can be seen by clicking on the Query History section in the Database tab. PureData System keeps a query history of the last 2000 queries. For a full audit log you would need to use the Query History database. Select the View Query Entry menu item from the context menu, to get a more structured view for the values of a specific query: 11. Another window is opened showing the fields of the query history table in a more structured way:
  • 60. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 16 PureData System saves for each recent query a significant amount of information, which can help you to identify queries that behave badly. Important values are estimated and actually elapsed seconds, result rows and of course the actual SQL statement. 12. It is also possible to get information about the actual execution plan of the query. We will discuss this in more detail in future modules. To see a graphical representation of an Explain output right-click on a query and select Explain->Graph: 13. You should see a similar window, the actual graph may differ depending on the query you pick:
  • 61. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 16 This graph shows how the PureData System appliance plans to execute the query in question. It is an important part of troubleshooting the occasional misbehaving queries. We will discuss this in more detail in the Query Optimization module. You can also get a textual view by selecting Explain->Verbose. 14. Close the graph and display the plan file in the context menu with the “View Plan File” entry. You should see a similar window to the following. Please scroll down to the bottom:
  • 62. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 16 Plan files look similar to Verbose Explain information but there is a significant difference. Explain information tells you how the system plans to execute a query. The Plan files add information on how the query was actually executed including actual execution times and are an invaluable help for debugging queries that failed or took longer than expected. 15. Finally select the Backup->History field in the navigator. PureData System logs all backups and restore sessions in the system. You will see an empty list but if you return to this view after the Backup and Restore lab you will see the backup and restore processes you started. The backup history allows PureData System to provide easy incremental or cumulative backups and to synchronize backups with the groom process - we will discuss more about that in a later section.
  • 63. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 16 5 Tools In this section we will learn how to set workload management system settings with NzAdmin, as well as search for data skew, and view disk usage by database or user. 5.1 Workload Management 1. From the menu bar at the top of NzAdmin click on Tools Workload Management System Settings. Using this tool we can limit the maximum number of rows allowed in a table, enable query timeouts, session idle timeouts, and default session priority 2. From the Workload Management menu option, click into Performance Summary 3. From the Summary pane, we can look at activities that happened in the last hour in an aggregate view. Keep this in mind for the Workload Management module.
  • 64. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 16 5.2 Table Storage 1. From the menu bar at the top of NzAdmin click on Tools Table Storage. This is a tool, which will tell us the total size in MB for each database or the total size of all the databases a user owns. 5.3 Table Skew 1. From the menu bar at the top of NzAdmin click on Tools Table Skew. This tool displays tables that meet or exceed a specified data skew threshold between data slices. Once an administrator has seen in the main overview that the maximal storage differs significantly from the average story he can use this tool to find the skewed tables. He can then fix them for example by redistributing them with a CTAS table. Skewed tables not only limit the available storage but also significantly lower the performance.
  • 65. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 16 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 66. Loading and Unloading Data Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 67. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 32 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 External Tables................................................................3 2.1 Unloading Data using External Tables .............................5 2.2 Dropping External Tables ..............................................15 2.3 Loading Data using External Tables ..............................16 3 Loading Data using the nzload Utility........................18 3.1 Using the nzload Utility with Command Line Options...19 3.2 Using the nzload Utility with a Control File...................22 3.3 (Optional) Using nzload with Bad Records...................24
  • 68. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 32 1 Introduction In every data warehouse environment there is a need to load new data into the database. The task to load data into the database is not just a one time operation but rather a continuous operation that can occur hourly, daily, weekly, or even monthly. The loading of the data into a database is vital operation that needs to be supported by the data warehouse system. IBM PureData System provides a framework to support not only the loading of data into the PureData System database environment but also the unloading of data from the database environment. This framework contains more than one component, some of these components are: • External Tables – These are tables stored as flat files on the host or client systems and registered like tables in the PureData System catalog. They can be used to load data into the PureData System appliance or unload data to the file system. • nzload – This is a wrapper command line tool around external tables that provides an easy method loading data into the PureData System appliance. • Format Options – These are options for formatting the data load to and from external tables. 1.1 Objectives This lab will help you explore the IBM PureData System framework components for loading and unloading data from the database. You will use the various commands to create external tables to unload and load data. Then you will get a basic understanding of the nzload utility. In this lab the REGION and NATION tables in the LABDB database are used to illustrate the use of external tables and the nzload utility. After this lab you will have a good understanding on how to load and unload data from a PureData System database environment • The first part of this lab will explore using External Tables to unload and load data. • The second part of this lab will discuss using the nzload utility to load records into tables. 2 External Tables An external table allows PureData System to treat an external file as a database table. An external table has a definition (a table schema) in the PureData System system catalog but the actual data exists outside of the PureData System appliance database.
  • 69. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 32 This is referred to as a datasource file. External tables can be used to access files which are stored on the file system. After you have created the external table definition, you can use INSERT INTO statements to load data from the external file into a database table, or SELECT FROM statements to query the external table. Different methods are described to create and use external tables using the nzsql interface. Along with this the external datasource files for the external tables are examined, so a second session will be used to help view these files. I. Connect to your PureData System image using PuTTY. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) II. Change to the lab working directory /labs/movingData with the following command cd /labs/movingData III. Connect to the LABDB database as the database owner, LABADMIN, using the nzsql interface : You should see the following results IV. Now in this lab we will need to alternatively execute SQL commands and operating system commands. To make it easier for you, we will open a second putty session for executing operating system commands like nzload, view generated external files etc. It will be referred to as session 2 throughout the lab. The picture above shows the two PuTTY windows that you will need. Session 1 will be used for SQL commands and session 2 for operating system prompt commands. V. Open another session using PuTTY. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) Also make sure that you change to the correct directory, /labs/movingData: [nz@netezza ~] cd /labs/movingData [nz@netezza ~] nzsql -d LABDB -u labadmin -pw password Welcome to nzsql, the Netezza SQL interactive terminal. Type: h for help with SQL commands ? for help on internal slash commands g or terminate with semicolon to execute query q to quit LABDB(LABADMIN)=>
  • 70. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 32 2.1 Unloading Data using External Tables External tables will be used to unload rows from the LABDB database as records into an external datasource file. Various methods to create and use external tables will be explored unloading rows from either REGION or NATION tables. Five basic different user cases are presented for you to follow so that you can gain a better understanding of how to use external tables to unload data from a database. 2.1.1 Unloading data with an External Table created with the SAMEAS clause The first external table will be used to unload data from the REGION table into an ASCII delimited text file. This external table will be named ET1_REGION using the same column definition as the REGION table. After the ET1_REGION external table is created you will then use it to unload all the rows from the REGION table. The records for the ET1_REGION external table will be in the external datasource file, et1_region_flat_file. The basic syntax to create this type of external table is: The SAMEAS clause allows the external table to be created with the same column definition of the referred. This is referred to as implicit schema definition. 1. As the LABDB database owner, LABADMIN, you will create the first basic external table using the same column definitions as the REGION table: 2. To list the external tables in the LABDB database you use the internal slash option, dx: Which will list the external table you just created: 3. You can also list the properties of the external table using the following internal slash option to describe the table, d <external table name> : Which will list the properties of the ET1_REGION external table: CREATE EXTERNAL TABLE table_name SAMEAS table_name USING external_table_options LABDB(LABADMIN)=> d et1_region List of relations Name | Type | Owner ------------+----------------+---------- ET1_REGION | EXTERNAL TABLE | LABADMIN (1 rows) LABDB(LABADMIN)=> dx LABDB(LABADMIN)=> create external table et1_region sameas region using (dataobject ('/labs/movingData/et1_region_flat_file'));
  • 71. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 32 This output includes the columns and associated data types in the external table. You will notice that this is similar to the REGION table since the external table was created using the SAMEAS clause in the CREATE EXTERNAL TABLE command. The output also includes the properties of the external table. The most notable property is the DataObject property that shows the location and the name of the external datasource file used for the external table. We will examine some of the other properties in this lab. 4. Now that the external table is created you can use it to unload data from the REGION table using an INSERT statement : External Table "ET1_REGION" Attribute | Type | Modifier -------------+------------------------+---------- R_REGIONKEY | INTEGER | R_NAME | CHARACTER(25) | R_COMMENT | CHARACTER VARYING(152) | DataObject - '/labs/movingData/et1_region_flat_file' adjustdistzeroint - f bool style - 1_0 code set - compress - FALSE cr in string - f ctrl chars - f date delim - - date style - YMD delim - | encoding - INTERNAL escape - fill record - f format - TEXT ignore zero - f log dir - /tmp max errors - 1 max rows - 0 null value - NULL quoted value - NO remote source - require quotes - f skip rows - 0 socket buf size - 8388608 timedelim - : time round nanos - f time style - 24HOUR trunc string - f y2base - 0 includezeroseconds - f record length - record delimiter - nullindicator bytes - layout - decimaldelim -
  • 72. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 32 5. You can use this external table like a regular table by issuing SQL statements. Try issuing a simple SELECT FROM statement against ET1_REGION external table: Which will return all the rows in the ET1_REGION external table: You will notice that this is the same data that is in the REGION table. But the data retrieved for this SELECT statement was from the datasource of this external table and not from the data within the database. 6. The main reason for creating an external table is to unload data from a table to a file. Using the second putty session review the file that was created, et1_region_flat_file, in the /labs/movingData directory: The file should look similar to the following: This is an ASCII delimited flat file containing the data from the REGION table. The column delimiter used in this file was the default character ‘|.’ 2.1.2 Unloading data with an External Table using the AS SELECT clause The second external table will also be used to unload data from the REGION table into an ASCII delimited text file using a different method. The external table will be created and the data will be unloaded in the same create statement. So a separate step is not required to unload the data. The external table will be named ET2_REGION and the external datasource file will be named et2_region_flat_file. The basic syntax to create this type of external table is: The AS clause allows the external table to be created with the same columns returned in the SELECT FROM statement, which is referred to as implicit table schema definition. This also unloads the rows at the same time the external table is created. CREATE EXTERNAL TABLE table_name 'filename' AS select_statement; 2|sa|south america 1|na|north america 4|ap|asia pacific 3|emea|europe, middle east, africa [nz@netezza movingData]$ more et1_region_flat_file R_REGIONKEY | R_NAME | R_COMMENT ------------+---------------------------+----------------------------- 2 | sa | south america 1 | na | north america 4 | ap | asia pacific 3 | emea | europe, middle east, Africa (4 rows) LABDB(LABADMIN)=> select * from et1_region; LABDB(LABADMIN)=> insert into et1_region select * from region;
  • 73. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 32 1. The first method used to create an external table required the data to be unloaded in a second step using an INSERT statement. Now you will create an external table and unload the data in a single step: This command created the external table ET2_REGION using the same definition as the REGION table and also unloaded the data to the et2_region_flat_file. 2. LIST the EXTERNAL TABLES in the LABDB database: Which will list all the external tables in the LABDB database: You will notice that there are now two external tables. You can also list the properties of the external table, but the output will be similar to the output in the last section, except for the filename. 3. Using the second session review the file that was created, et2_region_flat_file, in the /labs/movingData directory: The file should look similar to the following: This file is exactly the same as the file you reviewed in the last chapter. The difference this time is that we didn’t need to unload it explicitly. 2.1.3 Unloading data with an external table using defined columns The first two external tables that you created used the exact same columns from the REGION table, using an implicit table schema. You can also create an external table by explicitly specifying the columns. This is referred to as explicit table schema. The third external table that you create will still be used to unload data from the REGION table but only from the R_NAME and R_COMMENT columns. The ET3_REGION external table will be created in one step and then the data will be unloaded in the et3_region_flat_file ASCII delimited text file using a different delimiter string. The basic syntax to create this type of external table is: CREATE EXTERNAL TABLE table_name ({column_name type} [, ... ]) [USING external_table_options}] 2|sa|south america 1|na|north america 4|ap|asia pacific 3|emea|europe, middle east, africa [nz@netezza movingData]$ more et2_region_flat_file List of relations Name | Type | Owner ------------+----------------+---------- ET1_REGION | EXTERNAL TABLE | LABADMIN ET2_REGION | EXTERNAL TABLE | LABADMIN (2 rows) LABDB(LABADMIN)=> dx LABDB(LABADMIN)=> create external table et2_region '/labs/movingData/et2_region_flat_file' as select * from region;
  • 74. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 32 1. You will create a new external table to only include the R_NAME and R_COMMENT columns, and exclude the R_REGIONKEY column from the REGION table. Along with this you will change the delimiter string from the default ‘|’ to ‘=’: 2. LIST the properties of the ET3_REGION external table Which will list the properties of the ET3_REGION external table: LABDB(LABADMIN)=> create external table et3_region (r_name char(25), r_comment varchar(152)) USING (dataobject ('/labs/movingData/et3_region_flat_file') DELIMITER '='); LABDB(LABADMIN)=> d et3_region
  • 75. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 32 You will notice that there are only two columns for this external table since you only specified two columns when creating the external table. The rest of the output is very similar to the properties of the other two external tables that you created, with two main exceptions. The first difference is obviously the Dataobjects field, since the filename is different. The other difference is the string used for the delimiter, since it is now ‘=’ instead of the default, ‘|’. 3. Now you will unload the data from the REGION table but only the data from columns R_NAME and R_COMMENT: (Alternatively, you could create the external table and unload the data in one step using the following command: LABDB(LABADMIN)=> insert into et3_region select r_name, r_comment from region; External Table "ET3_REGION" Attribute | Type | Modifier -------------+------------------------+---------- R_NAME | CHARACTER(25) | R_COMMENT | CHARACTER VARYING(152) | DataObject - '/labs/movingData/et3_region_flat_file' adjustdistzeroint - f bool style - 1_0 code set - compress - FALSE cr in string - f ctrl chars - f date delim - - date style - YMD delim - = encoding - INTERNAL escape - fill record - f format - TEXT ignore zero - f log dir - /tmp max errors - 1 max rows - 0 null value - NULL quoted value - NO remote source - require quotes - f skip rows - 0 socket buf size - 8388608 timedelim - : time round nanos - f time style - 24HOUR trunc string - f y2base - 0 includezeroseconds - f record length - record delimiter - nullindicator bytes - layout - decimaldelim -
  • 76. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 32 4. Using the second session review the file that was created, et3_region_flat_file, in the /labs/movingData directory: The file should look similar to the following: You will notice that only two columns are present in the flat file using the ‘=’ string as a delimiter. 2.1.4 (Optional) Unloading data with an External Table from two tables The first three external tables unloaded data from one table. The next external table you will create will be based on using a table join between the REGION and NATION table. The two tables will be joined on the REGIONKEY and only the N_NAME and R_NAME columns will be defined for the external table. This exercise will illustrate how data can be unloaded using SQL statements other than a simple SELECT FROM statement. The external table will be named ET_NATION_REGION using another ASCII delimited text file named et_nation_file_flat_file. 1. For the next external table you will unload data from both the REGION and NATION table joined on the REGIONKEY column to list all of the countries and their associated regions. Instead of specifying the columns in the create external table statement you will use the AS SELECT option: 2. LIST the properties of the ET_NATION_REGION external table Which will show the properties of the ET_NATION_REGION table: create external table et4_test '/labs/movingData/et4_region_flat_file' using (delimiter '=') as select r_name, r_comment from region; LABDB(LABADMIN)=> d et_nation_region LABDB(LABADMIN)=> create external table et_nation_region '/labs/movingData/et_nation_region_flat_file' as select n_name, r_name from nation, region where n_regionkey=r_regionkey; sa=south america na=north america ap=asia pacific emea=europe, middle east, africa [nz@netezza movingData]$ more et3_region_flat_file
  • 77. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 32 You will notice that the external table was created using the two columns specified in the SELECT clause: N_NAME and R_NAME. 3. View the data of the ET_NATION_REGION external table: Which will show all the rows from the ET_NATION_REGION table: LABDB(LABADMIN)=> select * from et_nation_region; External Table "ET_NATION_REGION" Attribute | Type | Modifier -----------+---------------+---------- N_NAME | CHARACTER(25) | NOT NULL R_NAME | CHARACTER(25) | DataObject - '/labs/movingData/et_NATION_REGION_flat_file' adjustdistzeroint - f bool style - 1_0 code set - compress - FALSE cr in string - f ctrl chars - f date delim - - date style - YMD delim - | encoding - INTERNAL escape - fill record - f format - TEXT ignore zero - f log dir - /tmp max errors - 1 max rows - 0 null value - NULL quoted value - NO remote source - require quotes - f skip rows - 0 socket buf size - 8388608 timedelim - : time round nanos - f time style - 24HOUR trunc string - f y2base - 0 includezeroseconds - f record length - record delimiter - nullindicator bytes - layout - decimaldelim -
  • 78. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 32 This is the result of the joining the NATION and REGION table on the REGIONKEY column to return just the N_NAME and R_NAME columns. 4. And now using the second session review the file that was created, et_nation_region_flat_file, in the /labs/movingData directory: Which should look similar to the following: You can see that we created a flat delimited flat file from a complex SQL statement. External tables are a very flexible and powerful way to load, unload and transfer data. 2.1.5 (Optional) Unloading data with an External Table using the compress format The previous external tables that you created used the default ASCII delimited text format. This last external table will be similar to the second external table that you created. But instead of the using an ASCII delimited text format you will use the compressed binary format. The name of the external table will be ET4_REGION and the datasource file name will be et4_region_compress. The basic syntax to create this type of external table is: brazil|sa guyana|sa venezuela|sa portugal|emea australia|ap united kingdom|emea united arab emirates|emea south africa|emea hong kong|ap new zealand|ap japan|ap macau|ap canada|na united states|na [nz@netezza movingData]$ more et_nation_region_flat_file N_NAME | R_NAME ---------------------------+--------------------------- brazil | sa guyana | sa venezuela | sa portugal | emea australia | ap united kingdom | emea united arab emirates | emea south africa | emea hong kong | ap new zealand | ap japan | ap macau | ap canada | na united states | na (14 rows)
  • 79. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 32 The external table options COMPRESS and FORMAT must be specified to use the compressed binary format. 1. You will now create one last external table using a similar method that you used to create the second external table, in section 2.1.2. But instead of using an ASCII delimited-text format the datasource will be compressed. This is achieved by using the COMPRESS and FORMAT external table options: As a reminder the external table is created and the data is unloaded in the same operation using the AS SELECT clause. 2. LIST the properties of the ET4_REGION external table Which will list the properties of the ET4_REGION table: CREATE EXTERNAL TABLE table_name 'filename' USING (COMPRESS true FORMAT ‘internal’) AS select_statement; LABDB(LABADMIN)=> d et4_region LABDB(LABADMIN)=> create external table et4_region '/labs/movingData/et4_region_compress' using (compress true format 'internal') as select * from region;
  • 80. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 32 You will notice that the options for COMPRESS has changed from FALSE to TRUE indicating that the datasource file is compressed. And the FORMAT has changed from TEXT to INTERNAL, which is required for compressed files. 2.2 Dropping External Tables Dropping external tables is similar to dropping a regular PureData System table. The column definition for the external table is removed from the PureData System catalog. Keep in mind that dropping the table doesn’t delete the external datasource file so External Table "ET4_REGION" Attribute | Type | Modifier -------------+------------------------+---------- R_REGIONKEY | INTEGER | R_NAME | CHARACTER(25) | R_COMMENT | CHARACTER VARYING(152) | DataObject - '/labs/movingData/et4_region_compress' adjustdistzeroint - bool style - code set - compress - TRUE cr in string - ctrl chars - date delim - date style - delim - encoding - escape - fill record - format - INTERNAL ignore zero - log dir - max errors - max rows - null value - quoted value - remote source - require quotes - skip rows - socket buf size - 8388608 timedelim - time round nanos - time style - trunc string - y2base - includezeroseconds - record length - record delimiter - nullindicator bytes - layout - decimaldelim -
  • 81. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 32 this also has to be maintained. So the external datasource file can still be used for loading data into a different table. In this chapter you will drop the ET1_REGION table, but you will not delete the associated external datasource file, et1_region_flat_file. This datasource file will be used later in this lab to load data into the REGION table. 1. Drop the first external table that you created, ET1_REGION, using the DROP TABLE command The same drop command for tables is used for external tables, so there is no separate DROP EXTERNAL TABLE command. 2. Verify that the external table has been dropped using the internal slash option, dx: Which will list all the external tables in the LABDB database: In this list the four remaining external tables that you created still exist. 3. Even though the external table definition no longer exists within the LABDB database, the flat file named et1_region_flat_file still exits in the /labs/movingData directory. Verify this by using the second putty session: Which will list all of the files in the /labs/movingData directory: You can see that the file et1_REGION_flat_file still exists. This file can still be used to load data into another similar table. 2.3 Loading Data using External Tables External tables can also be used to load data into tables in the database. In this chapter data will be loaded into the REGION table, so you will first have to remove the existing rows from the REGION table. The method to load data from external tables into a table is similar to using the DML INSERT INTO and SELECT FROM statements. You will use two different methods to load data into the REGION table, one using an external table and the other using the external datasource file directly. Loading data into a table from any external table will have an associated log file with a default name of <table_name>.<database_name>.log 1. Before loading the data into the REGION table, delete the rows from the data using the TRUNCATE TABLE command: et1_region_flat_file et2_region_flat_file et4_region_compress et3_region_flat_file et_nation_region [nz@netezza movingData]$ ls List of relations Name | Type | Owner ------------------+----------------+---------- ET2_REGION | EXTERNAL TABLE | LABADMIN ET3_REGION | EXTERNAL TABLE | LABADMIN ET4_REGION | EXTERNAL TABLE | LABADMIN ET_NATION_REGION | EXTERNAL TABLE | LABADMIN (4 rows) LABDB(LABADMIN)=> dx LABDB(LABADMIN)=> drop table et1_region;
  • 82. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 17 of 32 2. Check that the table is empty with the SELECT * command: You should see that the table contains no data. 3. You will load data into the REGION table from the ET2_REGION external table using an INSERT statement: 4. Check to ensure that the table contains the four rows using the SELECT * statement. You should see that the table now contains 4 rows. 5. Again delete the rows in the REGION table: 6. Check to ensure that the table is empty using the SELECT * statement. You should see that the table contains no rows. 7. You will load data into the REGION table using the ASCII delimited file that was created for external table ET1_REGION. Remember that the definition of the external table was removed from that database, but the external data source file, et1_region_flat_file, still exists: 8. Check to ensure that the table contains the four rows using the SELECT * statement. You should see that the table now contains four rows. 9. Since this is a load operation there is always an associated log file, <table>.<database>.nzlog created for each load performed. By default this log file is created in the /tmp directory. In the second session review this file: LABDB(LABADMIN)=> select * from region; LABDB(LABADMIN)=> select * from region; LABDB(LABADMIN)=> select * from region; LABDB(LABADMIN)=> select * from region; [nz@netezza movingData]$ more /tmp/REGION.LABDB.nzlog LABDB(LABADMIN)=> insert into region select * from external '/labs/movingData/et1_region_flat_file'; LABDB(LABADMIN)=> truncate table region; LABDB(LABADMIN)=> insert into region select * from et2_region; LABDB(LABADMIN)=> truncate table region;
  • 83. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 18 of 32 The log file should look similar to the following: You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information to identify the table. 3 Loading Data using the nzload Utility The nzload command is a SQL CLI client application that allows you to load data from the local host or a remote client, on all the supported client platforms. The nzload command processes command-line load options to send queries to the host to create an external table definition, run the insert/select query to load data, and when the load completes, drop the external table. The nzload command is a command-line program that accepts options from multiple sources, where some of the sources can be from the: • Command line Load started at:01-Jan-11 12:34:56 EDT Database: LABDB Tablename: REGION Datafile: /labs/movingData/et1_region_flat_file Host: netezza Load Options Field delimiter: '|' NULL value: NULL File Buffer Size (MB): 8 Load Replay REGION (MB): 0 Encoding: INTERNAL Max errors: 1 Skip records: 0 Max rows: 0 FillRecord: No Truncate String: No Escape Char: None Accept Control Chars: No Allow CR in string: No Ignore Zero: No Quoted data: NO Require Quotes: No BoolStyle: 1_0 Decimal Delimiter: '.' Date Style: YMD Date Delim: '-' Time Style: 24HOUR Time Delim: ':' Time extra zeros: No Statistics number of records read: 4 number of bad records: 0 ------------------------------------------------- number of records loaded: 4 Elapsed Time (sec): 3.0 ----------------------------------------------------------------------------- Load completed at: 01-Jan-11 12:34:59 EDT =============================================================================
  • 84. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 19 of 32 • Control file • NZ Environment Variables Without a control file, you can only do one load at a time. Using a control file allows multiple loads. The nzload command connects to a database with a user name and password, just like any other PureData System appliance client application. The user name specifies an account with a particular set of privileges, and the system uses this account to verify access. For this section of the lab you will continue to use the LABADMIN user to load data into the LABDB database. The nzload utility will be used to load records from an external datasource file into the REGION table. Along with this the nzload log files will be reviewed to examine the nzload options. Since you will be loading data into a populated REGION table, you will use the TRUNCATE TABLE command to remove the rows from the table. We will continue to use the two putty sessions from the external table lab. • Session One, which is connected to the NZSQL console to execute SQL commands, for example to review tables after load operations • Session Two, which will be used for operating system commands, to execute nzload commands, view data files, … 3.1 Using the nzload Utility with Command Line Options The first method for using the nzload utility to load data in the REGION table will specify options at the command line. You will only need to specify the datasource file. We will use default options for the rest. The datasource file will be the et1_region_flat_file that you created in the External Tables section. The basic syntax for this type of command is: 1. As the LABDB database owner, LABADMIN first remove the rows in the REGION table: 2. Check to ensure that the rows have been removed from the table using the SELECT * statement: The REGION table should return no rows. 3. Using the second session at the OS command line you will use the nzload utility to load data from the et1_region_flat file into the REGION table using the following command line options, -db <database name>, -u <user>, -pw <password>, -t <table name>, -df <data file>, and –delimiter <string>: Note: The filename in the image is et”L”_region_flat_file, this is an inconsistency that will be fixed in the next iteration of the image. Which will return the following status message: Load session of table 'REGION' completed successfully nzload –db <database> -u <username> –pw <password> -df <datasource filename> LABDB(LABADMIN)=> select * from region; [nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t region -df etl_region_flat_file -delimiter '|' LABDB(LABADMIN)=> truncate table region;
  • 85. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 20 of 32 4. Check to ensure that the rows have been load into the table using the SELECT * statement: Which will return all of the rows in the REGION table: These rows were loaded from the records in the etl_region_flat_file file. 5. For every load task performed there is always an associated log file, <table>.<db>.nzlog created. By default this log file is created in the current working directory, which is the /labs/movingData directory. In the second session review this file: [nz@netezza movingData]$ more REGION.LABDB.nzlog R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 1 | na | north america 4 | ap | asia pacific 2 | sa | south america 3 | emea | europe, middle east, africa (4 rows) LABDB(LABADMIN)=> select * from region;
  • 86. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 21 of 32 You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information to identify the database and table. The –db, -u, and –pw, options specify the database name, the user, and the password, respectively. Alternatively, you could omit these options if the NZ environment variables are set to the appropriate database, username and password values. Since the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD are set to system, admin, and password, you need to use these options so the load will be against the LABDB database using the LABADMIN user. The other options: -t specifies the target table name in the database -df specifies the datasource file to be loaded -delimiter specifies the string to use as the delimiter in an ASCII delimited text file. There are other options that you can use with the nzload utility. These options were not specified here since the default values were sufficient for this load task. Load started at:01-Jan-11 12:34:56 EDT Database: LABDB Tablename: REGION Datafile: /labs/movingData/et1_region_flat_file Host: netezza Load Options Field delimiter: '|' NULL value: NULL File Buffer Size (MB): 8 Load Replay REGION (MB): 0 Encoding: INTERNAL Max errors: 1 Skip records: 0 Max rows: 0 FillRecord: No Truncate String: No Escape Char: None Accept Control Chars: No Allow CR in string: No Ignore Zero: No Quoted data: NO Require Quotes: No BoolStyle: 1_0 Decimal Delimiter: '.' Date Style: YMD Date Delim: '-' Time Style: 24HOUR Time Delim: ':' Time extra zeros: No Statistics number of records read: 4 number of bad records: 0 ------------------------------------------------- number of records loaded: 4 Elapsed Time (sec): 3.0 ----------------------------------------------------------------------------- Load completed at: 01-Jan-11 12:34:59 EDT =============================================================================
  • 87. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 22 of 32 The following command is equivalent to the nzload command we used above. It’s intended to demonstrate some of the options that can be used with the nzload command but can be omitted when default values are used. It’s only for demonstrating purposes: The –lf, -bf, and –maxErrors options are explained in the next exercise. The –compress and –format options indicate that the datasource file is an ASCII delimited text file. For a compressed binary datasource file the following options would be used, -compress true –format internal. 3.2 Using the nzload Utility with a Control File. As demonstrated in section 3.1 you can run the nzload command by specifying the command line options or you can use another method by specifying the options in a file, which is referred to as a control file. This is useful since the file can be updated and modified over time since loading data into a database for a data warehouse environment is a continuous operation. An nzload control file has the following basic structure: And the –cf option is used at the nzload command line to use a control file: The –u and –pw options are optional if the NZ_USER and NZ_PASSWORD environment variables are set to the appropriate user and password. Using the –u and –pw options overrides the values in the NZ environment variables. In this session you will again load rows into an empty REGION table using the nzload utility with a control file. The control file will set the following options: delimiter, logDir, logFile, and badFile, along with the database, and tablename. The datasource file to be used in this session is the region.del file. 1. As the LABDB database owner, LABADMIN first remove the rows in the REGION table:: Check to ensure that the rows have been removed from the table using the SELECT * statement. The table should contain no rows. 2. Using the second session at the OS command line you will create the control file to be used with the nzload utility to load data into the REGION table using the region.del data file. The control file will include the following options: Parameter Value Database Database name nzload –u <username> -pw <password> -cf <control file> DATAFILE <filename> { [<option name> <option value>] } nzload –db labdb –u labadmin –pw password –t region –df et1_region_flat_file –delimiter ‘|’ –outputDir ‘<current directory>’ –lf <table>.<database>.nzlog –bf<table>.<database>.nzlog –compress false –format text –maxErrors 1 LABDB(LABADMIN)=> truncate table region;
  • 88. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 23 of 32 Tablename Table name Delimiter Delimiter string LogDir Log directory LogFile Log file name BadFile Bad record log file name And the data file will be the region.del file instead of the et1_region_flat_file that you used in section 3.1. We already created the control file in the lab directory. Review it in the second putty session with the following command: The control file looks like the following: 3. Still in the second session you will load the data using the nzload utility with the control file you created, using the following command line options: -u <user>, -pw <user>, -cf <control file> Which will return the following status message: 4. Check the nzload log which was renamed from the default to region.log which is located in the /labs/movingData directory. You should see a successful load 5. Check to ensure that the rows are in the REGION table in the first putty session with the nzsql console: You should see the added rows. [nz@netezza movingData]$ more region.log LABDB(LABADMIN)=> select * from region; DATAFILE /labs/movingData/region.del { Database labdb Tablename region Delimiter '|' LogDir /labs/movingData LogFile region.log BadFile region.bad } [nz@netezza movingData]$ more control_file Load session of table 'REGION' completed successfully [nz@netezza movingData]$ nzload -u labadmin -pw password -cf control_file
  • 89. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 24 of 32 3.3 (Optional) Using nzload with Bad Records The first two methods illustrated how to use the nzload utility to load data into an empty table using command line options or a control file. In a data warehousing environment you will most of the time incrementally add data to a table already containing some rows. There will be instances where records from a datasource might not match the datatypes in the table. When this occurs the load will abort when the first bad record is encountered. This is the default behaviour and is controlled by the maxErrors option, which is set to a default value of 1. For this exercise you will add additional rows to the NATION table. Since you will be adding rows to the NATION table there will be no need to truncate the table. The datasource file you will be using is the nation.del file, which unfortunately has a bad record. 1. First check the NATION table by listing all of the rows in the table using the SELECT * statement in the first putty session: Which will list all the rows in the NATION table: 2. Using the second session at the OS command line you will use the nzload utility to load data from the nation.del file into the NATION table using the following command line options, -db <database name>, -u <user>, -pw <password>, -t <table name>, -df <data file>, and –delimiter <string> Which will return the following status message: This is an indication that the load has failed due to a bad record in the datasource file Error: External Table : count of bad input rows reached maxerrors limit See /labs/movingData/NATION.LABDB.nzlog file Error: Load Failed, records not inserted. N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT -------------+---------------------------+-------------+---------------------------------- 1 | canada | 1 | canada 2 | united states | 1 | united states of america 10 | australia | 4 | australia 5 | venezuela | 2 | venezuela 8 | united arab emirates | 3 | al imarat al arabiyah multahidah 9 | south africa | 3 | south africa 3 | brazil | 2 | brasil 11 | japan | 4 | nippon 12 | macau | 4 | aomen 14 | new zealand | 4 | new zealand 4 | guyana | 2 | guyana 6 | united kingdom | 3 | united kingdom 7 | portugal | 3 | portugal 13 | hong kong | 4 | xianggang (14 rows) LABDB(LABADMIN)=> select * from nation; [nz@netezza movingData]$ nzload -db LABDB -u labadmin -pw password -t nation -df nation.del -delimiter '|'
  • 90. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 25 of 32 3. Since the load has failed no rows were loaded into the NATION table, which you can confirm by using the SELECT * statement (in the first session): Which will return the rows in the NATION table: 4. In the second session you can check the log file, NATION.LABDB.nzlog, to determine the problem: [nz@netezza movingData] more NATION.LABDB.nzlog N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT -------------+---------------------------+-------------+---------------------------------- 1 | canada | 1 | canada 2 | united states | 1 | united states of america 10 | australia | 4 | australia 5 | venezuela | 2 | venezuela 8 | united arab emirates | 3 | al imarat al arabiyah multahidah 9 | south africa | 3 | south africa 3 | brazil | 2 | brasil 11 | japan | 4 | nippon 12 | macau | 4 | aomen 14 | new zealand | 4 | new zealand 4 | guyana | 2 | guyana 6 | united kingdom | 3 | united kingdom 7 | portugal | 3 | portugal 13 | hong kong | 4 | xianggang (14 rows) LABDB(LABADMIN)=> select * from nation;
  • 91. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 26 of 32 The Statistics section indicates that 10 records were read before the bad record was encountered during the load process. As expected no rows were inserted into the table since the default is to abort the load when one bad record is encountered. The log file also provides information about the bad record: 10(1) [1, INT4] expected field delimiter or end or record, “2”[t] 10(1) indicates the input record number (10) within the file and the offset (1) within the row where a problem was encountered. [1, INT(4)] indicates the column number (1) within the row and the data type (INT(4)) for the column. “2”[t] indicates the char that caused the problem ([2]). So putting this all together the problem is that the ‘2t’ was in a field for an INT(4) column, which is the N_NATIONKEY in the NATION table. ‘2t’ is not a valid integer so this is why the load marked this as a bad record. 5. You can confirm that this observation is correct by examining the nation.del datasource file that was used for the load. In the second session execute the following command: Which will display the nation.del file with the following text: [nz@netezza movingData] more nation.del Load started at:01-Jan-11 12:34:56 EDT Database: LABDB Tablename: NATION Datafile: /home/nz/movingData/nation.del Host: netezza Load Options Field delimiter: '|' NULL value: NULL File Buffer Size (MB): 8 Load Replay REGION (MB): 0 Encoding: INTERNAL Max errors: 1 Skip records: 0 Max rows: 0 FillRecord: No Truncate String: No Escape Char: None Accept Control Chars: No Allow CR in string: No Ignore Zero: No Quoted data: NO Require Quotes: No BoolStyle: 1_0 Decimal Delimiter: '.' Date Style: YMD Date Delim: '-' Time Style: 24HOUR Time Delim: ':' Time extra zeros: No Found bad records bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined] ---------------------------------------------------------------------------------------------------------------------------- 1: 10(1) [1, INT4] expected field delimiter or end of record, "2"[t] Statistics number of records read: 10 number of bad records: 1 ------------------------------------------------- number of records loaded: 0 Elapsed Time (sec): 1.0 ----------------------------------------------------------------------------- Load completed at: 01-Jan-11 12:34:57 EDT =============================================================================
  • 92. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 27 of 32 You will notice on the 10th line the following record: 2t|denmark|3|denmark So we made the correct assumption that the ‘2t’ is causing the problem. From this list you can assume that the correct value should be 24. 6. Alternatively you could instead examine the nzload bad log file NATION.LABDB.nzbad, which will contain all bad records that are processed during a load. In the second session execute the following command: Which will display the NATION.LABDB.nzbad file text: This is the same row identified in the nation.del file using the log file to locate the record within the file. Since the default is to stop the load after the first bad record is processed there is only one row. If you were to change the default behaviour to allow more bad records to be processed this file could potentially contain more records. It provides a comfortable overview of all the records that created exceptions during load. 7. We have the option of changing the NATION.del file to change ‘2t’ to ’24,’ and then rerun the same nzload command as in step 7. Instead you will rerun a similar load but you will allow 10 bad records to be encountered during the load process. To change the default behaviour you need to use the command option -maxErrors. You will also change the name of the nzbad file using the –bf command option and the log filename using the –lf command option: 2t|denmark|3|denmark [nz@netezza movingData] more NATION.LABDB.nzbad 15|andorra|2|andorra 16|ascension islan|3|ascension 17|austria|3|osterreich 18|bahamas|2|bahamas 19|barbados|2|barbados 20|belgium|3|belqique 21|chile|2|chile 22|cuba|2|cuba 23|cook islands|4|cook islands 2t|denmark|3|denmark 25|ecuador|2|ecuador 26|falkland islands|3|islas malinas 27|fiji|4|fiji 28|finland|3|suomen tasavalta 29|greenland|1|kalaallit nunaat 30|great britain|3|great britian 31|gibraltar|3|gibraltar 32|hungary|3|magyarorszag 33|iceland|3|lyoveldio island 34|ireland|3|eire 35|isle of man|3|isle of man 36|jamaica|2|jamaica 37|korea|4|han-guk 38|luxembourg|3|luxembourg 39|monaco|3|Monaco
  • 93. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 28 of 32 Which will return the following status message: Now the load is successful. 8. Check to ensure that the new loaded rows are in the NATION table: Which will list all of the rows in the NATION table: So now all of the new records were loaded except for the one bad row with nation key 24. Load session of table 'NATION' completed successfully N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT -------------+---------------------------+-------------+---------------------------------- 2 | united states | 1 | united states of america 11 | japan | 4 | nippon 18 | bahamas | 2 | bahamas 19 | barbados | 2 | barbados 20 | belgium | 3 | belqique 25 | ecuador | 2 | ecuador 33 | iceland | 3 | lyoveldio island 34 | ireland | 3 | eire 39 | monaco | 3 | monaco 3 | brazil | 2 | brasil 4 | guyana | 2 | guyana 5 | venezuela | 2 | venezuela 9 | south africa | 3 | south africa 13 | hong kong | 4 | xianggang 15 | andorra | 2 | andorra 27 | fiji | 4 | fiji 28 | finland | 3 | suomen tasavalta 30 | great britain | 3 | great britian 36 | jamaica | 2 | jamaica 37 | korea | 4 | han-guk 38 | luxembourg | 3 | luxembourg 6 | united kingdom | 3 | united kingdom 7 | portugal | 3 | portugal 10 | australia | 4 | australia 12 | macau | 4 | aomen 14 | new zealand | 4 | new zealand 26 | falkland islands | 3 | islas malinas 29 | greenland | 1 | kalaallit nunaat 31 | gibraltar | 3 | gibraltar 32 | hungary | 3 | magyarorszag 1 | canada | 1 | canada 8 | united arab emirates | 3 | al imarat al arabiyah multahidah 16 | ascension islan | 3 | ascension 17 | austria | 3 | osterreich 21 | chile | 2 | chile 22 | cuba | 2 | cuba 23 | cook islands | 4 | cook islands 35 | isle of man | 3 | isle of man (38 rows) LABDB(LABADMIN)=> select * from nation; [nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t nation -df nation.del -delimiter '|' -maxerrors 10 -bf nation.bad -lf nation.log
  • 94. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 29 of 32 9. Even though the nzload command received a successful message it is good practice to review the nzload log file for any problems, for example bad rows that are under the maxErrors threshold. In the second putty session execute the following command: The log file should be similar to the following: The main difference to before is that all of the data records in the data source file were processed (25.) 24 records were loaded because there was one bad record in the data source file. 10. Now you will correct the bad row and load it into the NATION table. There are couple of options you could use. One option is to extract the bad row from the original data source file and create a new data source file with the correct record. However, this task could be tedious when dealing with large data source files and potentially many bad records. The other option, which is more appropriate, is to use the bad log file. All bad records that can not be loaded into the table are placed in the bad log file. So in the second session use vi to open and edit the nation.bad file and change the ‘2t’ to ‘24’ in the first field. [nz@netezza movingData]$ vi nation.bad Load started at:01-Jan-11 12:34:56 EDT Database: LABDB Tablename: NATION Datafile: /home/nz/movingData/nation.del Host: netezza Load Options Field delimiter: '|' NULL value: NULL File Buffer Size (MB): 8 Load Replay REGION (MB): 0 Encoding: INTERNAL Max errors: 1 Skip records: 0 Max rows: 0 FillRecord: No Truncate String: No Escape Char: None Accept Control Chars: No Allow CR in string: No Ignore Zero: No Quoted data: NO Require Quotes: No BoolStyle: 1_0 Decimal Delimiter: '.' Date Style: YMD Date Delim: '-' Time Style: 24HOUR Time Delim: ':' Time extra zeros: No Found bad records bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined] ---------------------------------------------------------------------------------------------------------------------------- 1: 10(1) [1, INT4] expected field delimiter or end of record, "2"[t] Statistics number of records read: 25 number of bad records: 1 ------------------------------------------------- number of records loaded: 24 Elapsed Time (sec): 3.0 ----------------------------------------------------------------------------- Load completed at: 01-Jan-11 12:34:59 EDT ============================================================================= [nz@netezza movingData] more nation.log
  • 95. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 30 of 32 11. The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you will be in the command mode. To change the file you need to switch into the insert mode by pressing “i”. The editor will show an – INSERT – at the bottom of the screen. 12. You can now use the cursor keys to navigate. Change the first two chars of the bad row from 2t to 24. Your screen should look like the following: 13. We will now save our changes. Press “Esc” to switch back into command mode. You should see that the “—INSERT— “ string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without any questions. 14. After the nation.bad file has modified to correct the record issue a nzload to load the modified nation.bad file: Which will return the following status message: 15. And now check the new row has been loaded into the table in session one: Which will return all rows in the NATION table: Load session of table 'NATION' completed successfully LABDB(LABADMIN)=> select * from nation; [nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t nation -df nation.bad -delimiter '|' 24|denmark|3|denmark ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ -- INSERT --
  • 96. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 31 of 32 The row in bold denotes the new row that was added to the table, which was the bad record you corrected. N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT -------------+---------------------------+-------------+---------------------------------- 1 | canada | 1 | canada 8 | united arab emirates | 3 | al imarat al arabiyah multahidah 16 | ascension islan | 3 | ascension 17 | austria | 3 | osterreich 21 | chile | 2 | chile 22 | cuba | 2 | cuba 23 | cook islands | 4 | cook islands 35 | isle of man | 3 | isle of man 24 | denmark | 3 | denmark 2 | united states | 1 | united states of america 11 | japan | 4 | nippon 18 | bahamas | 2 | bahamas 19 | barbados | 2 | barbados 20 | belgium | 3 | belqique 25 | ecuador | 2 | ecuador 33 | iceland | 3 | lyoveldio island 34 | ireland | 3 | eire 39 | monaco | 3 | monaco 3 | brazil | 2 | brasil 4 | guyana | 2 | guyana 5 | venezuela | 2 | venezuela 9 | south africa | 3 | south africa 13 | hong kong | 4 | xianggang 15 | andorra | 2 | andorra 27 | fiji | 4 | fiji 28 | finland | 3 | suomen tasavalta 30 | great britain | 3 | great britian 36 | jamaica | 2 | jamaica 37 | korea | 4 | han-guk 38 | luxembourg | 3 | luxembourg 6 | united kingdom | 3 | united kingdom 7 | portugal | 3 | portugal 10 | australia | 4 | australia 12 | macau | 4 | aomen 14 | new zealand | 4 | new zealand 26 | falkland islands | 3 | islas malinas 29 | greenland | 1 | kalaallit nunaat 31 | gibraltar | 3 | gibraltar 32 | hungary | 3 | magyarorszag (39 rows)
  • 97. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 32 of 32 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 98. Backup Restore Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 99. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 23 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Creating a QA Database .................................................4 3 Creating the Test Database............................................8 4 Backing up and Restoring a Database........................10 4.1 Backing up the Database...............................................10 4.2 Verifying the Backups....................................................11 4.3 Restoring the Database .................................................15 4.4 Single Table Restore .....................................................18 5 Backing up User Data and Host Data ..........................20 5.1 User Data Backup..........................................................21 5.2 Host Data Backup..........................................................21
  • 100. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 23 1 Introduction PureData System appliances are 99.99% reliable and all internal components are redundant. Nevertheless regular backups should be part of any data warehouse. The first reason for this is disaster recovery, for example in case of a fire in the data warehouse. The second reason is to undo changes like accidental deletes. For disaster recovery, backups should be stored in a different location than the data center that hosts the data warehouse. For most of the big companies this will be a backup server which will have a version of Veritas Netbackup, Tivoli Storage Manager, or similar software, furthermore, backing up to a file server is also possible. 1.1 Objectives In the last labs we have created our LABDB database, and loaded the data into it. In this lab we will first set up a QA database that contains a subset of the tables and data of the full database. To create the tables we will use cross database access from our QA database to the LABDB production database. We will then use the schema-only function of nzbackup to create a test database that contains the same tables and data objects as the QA database but no data. Test data will later be added specifically for testing needs. After that we will do a multistep backup of our QA database and test the restore functionality. Testing backups by restoring them is generally a good idea and should be done during the development phase and also at regular intervals. After all - you are not fully sure what a backup contains until you restore it. Finally we will backup the system user data and the host data. While a database backup saves all users and groups that are involved in that database, a full user backup may be needed to get the full picture - for example to archive users and groups that are not used in any database. Also host data should be backed up regularly. In case of a host failure, which leaves the user data on the S-Blades intact, having a recent host backup will make the recovery of the appliance much faster and more straightforward.
  • 101. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 23 Figure 1 LABDB database 2 Creating a QA Database In this chapter we will create a QA database called LABDBQA, which contains a subset of the tables. It will contain the full NATION and REGION tables and the CUSTOMER table with a subset of the data. We will first create our QA database then we will connect to it and use CTAS tables to create the table copies. We will use cross-database access to create our CTAS tables from the foreign LABDB database. This is possible since PureData System allows read-only cross database access if fully qualified names are used. In this lab we will switch regularly between the operating system prompt and the nzsql console. The operating system prompt will be used to execute the backup and restore commands and review the created files. The nzsql console will be used to create the tables and further review the changes made to the user data using the restore commands. To make this easier you should open two putty sessions, the first one will be used to execute the operating system commands and it will be referred to as session 1 or the OS session, in the second session we will start the nzsql console. It will be referred to as session 2 or the nzsql session. You can also see which session to use from the command prompt in the screenshots.
  • 102. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 23 Figure 2 The two putty sessions for this lab, OS session 1 on the left, NZSQL session 2 on the right 1. Open the first putty session. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) 2. Access the lab directory for this lab with the following command, 3. Open the second putty session. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) 4. Access the lab directory for this lab with the same command as before 5. Start the NZSQL console with the following command: nzsql This will connect you to the SYSTEM database with the ADMIN user. These are the default settings stored in the environment variables of the NZ user. 6. Create our empty QA database with the following command: 7. Connect to the QA database with the following command: 8. Create a full copy of the REGION table from the LABDB database: With this statement we create a local REGION table in the currently connected QA database that has the same definition and content as the REGION table from the LABDB database. The CREATE TABLE AS statement is one of the most flexible administrative tools for a PureData System administrator. LABDBQA(ADMIN)=> create table region as select * from labdb..region; SYSTEM(ADMIN)=> c LABDBQA SYSTEM(ADMIN)=> create database LABDBQA; [nz@netezza ~]$ cd /labs/backupRestore/ [nz@netezza ~]$ cd /labs/backupRestore/
  • 103. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 23 We can easily access tables of databases we are currently not connected to, but only for read operations. We couldn’t insert data into a database we are not connected to. 9. Lets verify that the content has been copied over correctly. First lets look at the original data in the LABDB database: You should see four rows in the result set: To access a table from a foreign database we need to have the fully qualified name. Notice that we leave out the schema name between the two dots. Schemas are not fully supported in PureData System and since each table name needs to be unique in a given database it can be omitted. 10. Now let’s compare that to our local REGION table: You should see the same rows as before although they can be in a different order: 11. Now we copy over the NATION table as well: 12. And finally we will copy over a subset of our CUSTOMER table, we will only use the rows from the automobile market segment for the QA database: You will see that this inserts almost 30000 rows into the QA customer table, this is roughly a fifth of the original table: LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where c_mktsegment = 'AUTOMOBILE'; LABDBQA(ADMIN)=> create table nation as select * from labdb..nation; LABDBQA(ADMIN)=> select * from region; R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 1 | na | north america 3 | emea | europe, middle east, africa 4 | ap | asia pacific 2 | sa | south america (4 rows) LABDBQA(ADMIN)=> select * from region; LABDBQA(ADMIN)=> select * from labdb..region; R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 2 | sa | south america 3 | emea | europe, middle east, africa 1 | na | north america 4 | ap | asia pacific (4 rows) LABDBQA(ADMIN)=> select * from labdb..region;
  • 104. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 23 13. We will now create a view NATIONSBYREGION which returns a list of nation names with their corresponding region names. This is used in a couple of applications: 14. Let’s have a look at what the view returns: You should get a list of all nations and their corresponding region name: Views are a very convenient way to hide SQL complexity. They can also be used to implement column level security by creating views of tables that only contain a subset of columns.They are fully supported by PureData System. 15. Verify the created tables with the following command: d You will see that our QA database only contains the three tables we just created: LABDBQA(ADMIN)=> select * from nationsbyregions; R_NAME | N_NAME ---------------------------+--------------------------- sa | guyana emea | united arab emirates ap | macau sa | brazil emea | portugal ap | japan na | canada sa | venezuela emea | south africa ap | hong kong na | united states emea | united kingdom ap | australia ap | new zealand (14 rows) LABDBQA(ADMIN)=> select * from nationsbyregions; LABDBQA(ADMIN)=> create view nationsbyregions as select r_name, n_name from nation, region where r_regionkey = n_regionkey; LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where c_mktsegment = 'AUTOMOBILE'; INSERT 0 29752
  • 105. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 23 16. Finally we will create a QA user and make him owner of the database. Create the user with: 17. Make him the owner of the QA database: We have successfully created our QA database using cross database CTAS statements. Our QA database contains three tables, a view and we have a user that is the owner of this database. In the next chapter we will use backup and restore to create an empty copy of the QA database for the test database. 3 Creating the Test Database In this chapter we will use schema-only backup and restore to create an empty copy of the QA database as test database. This will not contain any data since the developers will fill it with test-specific data. Schema only backup is a convenient way to recreate databases without the contained user data. 1. Switch to the OS session and create the schema only backup of our QA database: To do this we need to specify three parameters the database we want to backup, the file system location where to save the backup files to and the –schema-only parameter to specify that user data shouldn’t be backed up. Normally backups shouldn’t be saved on the host hard discs but on a remote network file server. Not only is this essential for disaster recovery but the host hard discs are small, optimized for speed and not intended to hold large amount of data. They are strictly intended for PureData System software and operational data. Later we will have a deeper look at the created files and the logs but for the moment we will not go into that. 2. Now we will restore the test database from this backup: We can restore a database to a different database name. We simply need to specify the new name in the –db parameter and the old name in the –sourcedb parameter. [nz@netezza backupRestore]$ nzrestore -dir /tmp/bkschema -db labdbtest -sourcedb labdbqa -schema-only [nz@netezza backupRestore]$ nzbackup -schema-only -db labdbqa -dir /tmp/bkschema LABDBQA(ADMIN)=> alter database labdbqa owner to qauser; LABDBQA(ADMIN)=> create user qauser; LABDBQA(ADMIN)=> d List of relations Name | Type | Owner ------------------+-------+------- CUSTOMER | TABLE | ADMIN NATION | TABLE | ADMIN NATIONSBYREGIONS | VIEW | ADMIN REGION | TABLE | ADMIN (4 rows)
  • 106. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 23 3. In the nzsql session we will verify that we successfully created an empty copy of our database. See all available databases with the following command: l You can see that the LABDBTEST database was successfully created and that its privilege information have been copied as well, the owner is QAUSER as in the LABDBQA database. 4. First we do not want the QA user being the owner of the test database, change the owner to ADMIN for now: 5. Now lets check the contents of our test database, connect to it with: c labdbtest 6. Check if our test database contains all the objects of the QA database: d You will see the three tables and the view we created: PureData System Backup does save all database objects including views, stored procedures etc. Also all users, groups and privileges that refer to the backed up database are saved as well. 7. Since we used the –schema-only option we have not copied any data verify this for the NATION table with the following command: You will see an empty result set as expected. The –schema-only backup option is a convenient way to save your database schema and to create empty copies of your database. Apart from the missing user data it will create a full 1:1 copy of the original database. You could also restore the database to a different PureData System Appliance. This LABDBTEST(ADMIN)=> select * from nation; LABDBTEST(ADMIN)=> d List of relations Name | Type | Owner ------------------+-------+------- CUSTOMER | TABLE | ADMIN NATION | TABLE | ADMIN NATIONSBYREGIONS | VIEW | ADMIN REGION | TABLE | ADMIN (4 rows) LABDBTEST(ADMIN)=> alter database labdbtest owner to admin; LABDBTEST(ADMIN)=> l List of databases DATABASE | OWNER -----------+---------- INZA | ADMIN LABDB | LABADMIN LABDBQA | QAUSER LABDBTEST | QAUSER MASTER_DB | ADMIN NZA | ADMIN NZM | ADMIN NZR | ADMIN SYSTEM | ADMIN (9 rows)
  • 107. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 23 would only require that the backup server location is accessible from both PureData System Appliances. It could even be a differently sized appliance and the target appliance can have a higher version number of the NPS software than the source. It cannot be lower though. 4 Backing up and Restoring a Database . PureData System’s user data backup will create a backup of the complete database, including all database objects and user data. Even global objects like Users and privileges that are used in the database are backed up. Backup and Restore is therefore a very easy and straightforward process. Since PureData System has no transaction log, point in time restore is not possible. Therefore frequent backups are advisable. NPS supports full, differential and cumulative backups that allow easy and fast regular data backups. An example backup strategy would be monthly full backups, weekly cumulative backups and daily differential. Since PureData System is not intended to be used nor has been designed as an OLTP database, this should provide enough backup flexibility for most situations. For example to run differential backups after the daily ETL processes that feed the warehouse. Figure 3 A typical backup strategy This chapter we will create a backup of our QA database. We will then do a differential backup and then do a restore. Our VMWare environment has some specific restrictions that only allow the restoration of up to 2 increments. The labs will work correctly but don’t be surprised of errors during restore operations of more than 2 increments. 4.1 Backing up the Database PureData System’s backup is organized in so called backup sets. Every new full backup creates a new backup set. Differential and cumulative backups are per default added to the last backup set. But they can be added to a different backup set as well. In this section we will switch between the two putty sessions. 1. In the OS session execute the following command to create a full backup of the QA database: You should get the following result: [nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
  • 108. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 23 This command will create a full user data backup of the LABDBQA database. Each backup set has a unique id that can be later used to access it. Per default the last active backup set is used for restore and differential backups. In this lab we split up the backup between two file system locations. You can specify up to 16 file system locations after the –dir parameter. Alternatively you could use a directory list file as well with the –dirfile option. Splitting up the backup between different file servers will result in higher backup performance. 2. In the NZSQL session we will now add a new row to the REGION table. First connect to the QA database: 3. Now add a new entry for the north pole to the REGION table: 4. In the OS session create an differential backup: We now create a differential backup with the –differential option. This will create a new entry to the backup set we created previously only containing the differences since the full backup. You can see that the backup set id hasn’t changed. 5. In the NZSQL session add the south pole to the REGION table: You have now one full backup with the original 4 rows in the REGION table, then a differential backup that has additionally the north pole entry and a current state that has in addition to that the south pole region. 4.2 Verifying the Backups In this subchapter we will have a closer look at the files and logs that are created during the PureData System Backup process. 1. In the OS session display the backup history of your Appliance: You should get the following result: [nz@netezza backupRestore]$ nzbackup -history LABDBQA(ADMIN)=> insert into region values (6, 'sp', 'south pole'); [nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2 -differential LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole'); LABDBTEST(ADMIN)=> c labdbqa [nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2 Backup of database labdbqa to backupset 20111214173551 completed successfully.
  • 109. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 23 PureData System keeps track of all backups and saves them in the system catalog. This is used for differential backups and it is also integrated with the Groom process. Since PureData System doesn’t use transaction logs it needs logically deleted rows for differential backups. Per default Groom doesn’t remove a logically deleted row that has not been backed up yet. Therefore the Groom process is integrated with the backup history. We will explain this in more detail in the Transaction and Groom modules. In our machine we have done three backups, one backup set containing the schema only backup and two backups for the second backup set, one full and one differential. Lets have a closer look at the log that has been generated for the last differential backup. 2. In the OS session, switch to the log directory of the backupsrv process, which is the process responsible for backing up data: The /nz/kit/log directory contains the log directories for all PureData System processes. 3. Display the end of the log for the last differential backup process. You will need to replace the XXX values with the actual values of your log. You can cut and paste the log name from the history output above. We are interested in the last differential backup process: You will see the following result: [[nz@netezza backupsvr]$ tail backupsvr.21594.2011-12-14.log 2011-12-14 12:44:59.445051 EST Info: [21604] Postgres client pid: 21606, session: 19206 2011-12-14 12:45:00.461034 EST Info: Capturing deleted rows 2011-12-14 12:45:03.971731 EST Info: Backing up table REGION 2011-12-14 12:45:04.675441 EST Info: Backing up table NATION 2011-12-14 12:45:06.077822 EST Info: Backing up table CUSTOMER 2011-12-14 12:45:08.673602 EST Info: Operation committed 2011-12-14 12:45:08.673636 EST Info: Wrote 264 bytes in less than one second to location 1 2011-12-14 12:45:08.673643 EST Info: Wrote 385 bytes in less than one second to location 2 2011-12-14 12:45:08.682316 EST Info: Backup of database labdbqa to backupset 20111214173551 completed successfully. 2011-12-14 12:45:08.767215 EST Info: NZ-00023: --- program 'backupsvr' (21594) exiting on host 'netezza' ... --- [nz@netezza backupsvr]$ tail backupsvr.xxxxx.xxxx-xx-xx.log [nz@netezza backupRestore]$ cd /nz/kit/log/backupsvr/ [nz@netezza backupRestore]$ nzbackup -history Database Backupset Seq # OpType Status Date Log File -------- -------------- ----- ------- --------- ------------------- -------------------- ---------- LABDBQA 20111213225029 1 SCHEMA COMPLETED 2011-12-13 17:50:29 backupsvr.10724.2011-12-13.log LABDBQA 20111214173551 1 FULL COMPLETED 2011-12-14 12:35:51 backupsvr.21406.2011-12-14.log LABDBQA 20111214173551 2 DIFF COMPLETED 2011-12-14 12:44:53 backupsvr.21594.2011-12-14.log
  • 110. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 23 You can see that the process backed up the three tables REGION, NATION and CUSTOMER and wrote the result to two different locations. You also see the amount of data written to these locations. Since we only added a single row the amount of data is tiny. If you look at the log of the full backup you will see a lot more data being written. 4. Now let’s have a look at the files that are created during the backup process, enter the first backup location: 5. And display the contents with ls You will see the following result: The PureData System folder contains all backup sets for all PureData System appliances that use this backup location. If you need to move the backup you always have to move the complete folder. 6. Enter the Netezza folder with cd Netezza and display the contents with ls : You will see the following result: Under the main Netezza folder you will find sub folders for each Netezza host that is backed up to this location. In our case we only have one Netezza host called “netezza”. But if your company had multiple Netezza hosts you would find them here. 7. Enter the Netezza folder with cd Netezza and display the contents with ls : Below the host you will find all the databases of the host that have been backed up to this location, in our case the QA database. 8. Enter the LABDBQA folder with cd LABDBQA and display the contents with ls : In this folder you can see all the backup sets that have been saved for this database. Each backup set corresponds to one full backup plus an optional set of differential and cumulative backups. Note that we backed up the schema to a different location so we only have one backup set in here. 9. Enter the backup set folder with cd <your backupset id> and display the contents with ls : [nz@netezza bk1]$ ls Netezza [nz@netezza backupsvr]$ cd /tmp/bk1 [nz@netezza Netezza]$ ls netezza [nz@netezza netezza]$ ls LABDBQA [nz@netezza LABDBQA]$ ls 20111214173551
  • 111. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 23 Under the backup set are folders for each backup that has been added to that backup set. “1” is always the full backup followed by additional differential or cumulative backups. We will later use these numbers to restore our database to a specific backup of the backup set. 10. Enter the full backup with cd 1 and display the contents with ls : As expected it’s a differential backup. 11. Enter the FULL folder with cd FULL and display the contents with ls : The data folder contains the user data, the md folder contains metadata including the schema definition of the database. 12. Enter the data folder with cd data and display detailed information with ll : You can see that there are three data files two small files for the REGION and NATION table and a big file for the CUSTOMER table. 13. Now switch to the md folder with cd ../md and display the contents with ls : This folder contains information about the files that contribute to the backup and the schema definition of the database in the schema.xml 14. Let’s have a quick look inside the schema.xml file: You should see the following result: [nz@netezza 20111214173551]$ ls 1 2 [nz@netezza 1]$ ls FULL [nz@netezza FULL]$ ls data md [nz@netezza data]$ ll total 1120 -rw------- 1 nz nz 338 Dec 14 12:36 206208.full.2.1 -rw------- 1 nz nz 451 Dec 14 12:36 206222.full.2.1 -rw------- 1 nz nz 1132410 Dec 14 12:36 206238.full.1.1 [nz@netezza md]$ ls contents.txt loc1 schema.xml stream.0.1 stream.1.1.1.1 [nz@netezza md]$ more schema.xml
  • 112. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 23 As you see this file contains a full XML description of your database, including table definition, views, users etc. 15. Switch back to the lab folder with : You should now have a pretty good understanding of the PureData System Backup process, in the next subchapter we will demonstrate the restore functionality. 4.3 Restoring the Database In this subchapter we will restore our database first to the first increment and then we will upgrade our database to the next increment. PureData System allows you to return a database to a specific increment in your backup set. If you want to do an incremental restore the database must be locked. Tables can be queried but not changed until the database is in the desired state and unlocked. 1. In the NZSQL session we will now drop the QA database and the QA user, first connect to the SYSTEM database: 2. Now drop the QA database: 3. Now drop the QA User: 4. Let’s verify that the QA database really has been deleted with l You will see that the LABDBQA database has been removed: more schema.xml <ARCHIVE archive_major="4" archive_minor="0" product_ver="Release 6.1, Dev 2 [Bu ild 16340]" catalog_ver="3.976" hostname="netezza" dataslices="4" createtime="20 11-12-14 17:35:57" lowercase="f" hpfrel="4.10" model="WMware" family="vmware" pl atform="xs"> <OPERATION backupset="20111214173551" increment="1" predecessor="0" optype="0" d bname="LABDBQA"/> <DATABASE name="LABDBQA" owner="QAUSER" oid="206144" delimited="f" odelim="f" ch arset="LATIN9" collation="BINARY" collecthist="t"> <STATISTICS column_count="15"/> <TABLE ver="2" name="REGION" owner="ADMIN" oid="206208" delimited="f" odelim="f" rowsecurity="f" origoid="206208"> <COLUMN name="R_REGIONKEY" owner="" oid="206209" delimited="f" odelim="t" seq="1 " type="INTEGER" typeno="23" typemod="-1" notnull="t"/> ... [nz@netezza md]$ cd /labs/backupRestore/ LABDBQA(ADMIN)=> c SYSTEM SYSTEM(ADMIN)=> DROP DATABASE LABDBQA; SYSTEM(ADMIN)=> DROP USER QAUSER;
  • 113. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 23 5. In the OS session we will now restore the database to the first increment: Notice that we have specified the increment with the –increment option. In our case this is the first full backup in our backup set. We didn’t need to specify a backup set, per default the most recent one is used. Since we are not sure to which increment we want to restore the database we have to lock the database with the –lockdb option. This allows only read-only access until the desired increment has been restored. 6. In the NZSQL session verify that the database has been recreated with l You will see the LABDBQA database and you can also see that the owner QAUSER has been recreated and is again the database owner: 7. Connect to the LABDBQA database with You will see that LABDBQA database is currently in read-only mode. SYSTEM(ADMIN)=> l List of databases DATABASE | OWNER -----------+---------- INZA | ADMIN LABDB | LABADMIN LABDBTEST | ADMIN MASTER_DB | ADMIN NZA | ADMIN NZM | ADMIN NZR | ADMIN SYSTEM | ADMIN (8 rows) [nz@netezza md]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment 1 -lockdb true SYSTEM(ADMIN)=> l List of databases DATABASE | OWNER -----------+---------- INZA | ADMIN LABDB | LABADMIN LABDBQA | QAUSER LABDBTEST | ADMIN MASTER_DB | ADMIN NZA | ADMIN NZM | ADMIN NZR | ADMIN SYSTEM | ADMIN (9 rows) SYSTEM(ADMIN)=> c labdbqa
  • 114. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 17 of 23 8. Verify the contents of the REGION table from the LABDBQA database: You can see that we have returned the database to the point in time before the first full backup. There is no north or south pole in the comments column: 9. Try to insert a row to verify the read only mode: As expected this is prohibited until we unlock the database: 10. In the OS session we will now apply the next increment to the database You will see that we now apply the second increment to the database: 11. Since we do not need to load any more increments, we can now unlock the database: SYSTEM(ADMIN)=> c labdbqa NOTICE: Database 'LABDBQA' is available for read-only You are now connected to database labdbqa. SYSTEM(ADMIN)=> select * from region; LABDBQA(ADMIN)=> select * from region; R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 2 | sa | south america 1 | na | north america 3 | emea | europe, middle east, africa 4 | ap | asia pacific (4 rows) LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole'); LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole'); ERROR: Database 'LABDBQA' is available for read-only (command ignored) [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment next -lockdb true [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment next -lockdb true Restore of increment 2 from backupset 20111214173551 to database 'labdbqa' committed. [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -unlockdb
  • 115. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 18 of 23 After the database unlock we cannot apply any further increments to this database. To jump to a different increment we would need to start from the beginning. 12. In the NZSQL session we have a look at the REGION table again: You can see that we have added the north pole region which was created before the first differential backup: 13. Verify that the database is unlocked and ready for use again by adding a new set of customers to the CUSTOMER table. In addition to the Automobile users we want to add the machinery users from the main database: You will see that we now can use the database in a normal fashion again. 14. We had around 30000 customers before, verify that the new user set has been added successfully: You will see that we now have around 60000 rows in the CUSTOMER table. You have now done a full restore cycle for the database and applied a full and incremental backup to your database. In the next chapter we will demonstrate single table restore and the ability to restore from any backup set. 4.4 Single Table Restore In this chapter we will demonstrate the targeted restore of a subset of tables from a backup set. We will also demonstrate how to restore from a specific older backup set. 1. First we will create a second backup set with the new customer data. In the OS session execute the following command: LABDBQA(ADMIN)=> select * from region; LABDBQA(ADMIN)=> select * from region; R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 2 | sa | south america 3 | emea | europe, middle east, africa 4 | ap | asia pacific 1 | na | north america 5 | np | north pole (5 rows) LABDBQA(ADMIN)=> insert into customer select * from labdb..customer where c_mktsegment = 'MACHINERY'; LABDBQA(ADMIN)=> select count(*) from customer;
  • 116. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 19 of 23 Since we didn’t specify anything else this is a full database backup. In this case PureData System automatically creates a new backup set. 2. We want to return the CUSTOMER table to the previous condition. But we do not want to change the REGION or the NATION tables. To do this we need to know the backup set id of the previous backup set. To do this execute the history command again: We now see three different backup sets, the schema only backup, the two step backupset and the new full backupset. Remember the backup set id of the two step backupset. 3. To return only the CUSTOMER table to its condition of the second backup set we can do a table level restore with the following command: This command will only restore the tables of the –tables option. If you want to restore multiple tables you can simply write them in a list after the option. We use the –backupset option to specify a specific backup set. Remember to replace the id with the value you retrieved with the history command. Notice that the table name needs to be case sensitive. This is in contrast to the database name. You will get the following error: [nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2 [nz@netezza backupRestore]$ nzbackup -history [nz@netezza backupRestore]$ nzbackup -history Database Backupset Seq # OpType Status Date Log File --------- -------------- ----- ------- --------- ------------------- --------------- --------------- (LABDBQA) 20111213225029 1 SCHEMA COMPLETED 2011-12-13 17:50:29 backupsvr.10724.2011-12-13.log (LABDBQA) 20111214173551 1 FULL COMPLETED 2011-12-14 12:35:51 backupsvr.21406.2011-12-14.log (LABDBQA) 20111214173551 2 DIFF COMPLETED 2011-12-14 12:44:53 backupsvr.21594.2011-12-14.log LABDBQA 20111214205536 1 FULL COMPLETED 2011-12-14 15:55:36 backupsvr.23621.2011-12-14.log [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset <your_backup_set_id> -tables CUSTOMER [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset 20111214173551 -tables CUSTOMER Error: Specify -droptables to force drop of tables in the -tables list.
  • 117. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 20 of 23 PureData System cannot restore a table that exists in the target database. You can either drop the table before restoring it, or use the –droptables option. 4. Repeat the previous command with the added –droptables option: You will get the following result: You can see the target table was dropped before the restore happened and the specified backup set was used. Since we didn’t stipulate a specific increment, the full backup set has been applied with all increments. Also the table is automatically unlocked after the restore process finishes. 5. Finally lets verify that the restore worked as expected, in the NZSQL console count the rows of the customer table again: You will see that we are back to 30000 rows. This means that we have reverted the most recent changes: In this chapter you have executed a single table restore and you did a restore from a specific backup set. 5 Backing up User Data and Host Data In the previous chapters you have learned to backup PureData System databases. This backs up all the database objects that are used in the database and the user data from the S-Blades. These are the most critical components to back up in a PureData System appliance. They will allow you to recreate your databases even if you would need to switch to a completely new Appliance. But there are two other things that should be backed up: • The global user information. [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset <your_backup_set_id> -tables CUSTOMER -droptables [nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset 20111214173551 -tables CUSTOMER -droptables [Restore Server] : Dropping TABLE 'CUSTOMER' Restore of increment 1 from backupset 20111214173551 to database 'labdbqa' committed. Restore of increment 2 from backupset 20111214173551 to database 'labdbqa' committed. LABDBQA(ADMIN)=> select count(*) from customer; LABDBQA(ADMIN)=> select count(*) from customer; COUNT ------- 29752 (1 row)
  • 118. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 21 of 23 • The host data In this chapter you will do a backup of these components, so you would be able to revert your appliance to the exact condition it was in before the backup. 5.1 User Data Backup Users, groups, and privileges that are not used in databases will not be backed up by the user data backup. To be able to revert a PureData System Appliance completely to its original condition you need to have a backup of the global user information as well, to capture for example administrative users that are not part of any database. This is done with the –users option of the nzbackup command: 1. In the OS session execute the following command: You will see the following results: . This will create a backup of all Users, Groups and Privileges. Restoring it will not delete any users, instead it will only add missing Users, Groups and Privileges, so it doesn’t need to be fully synchronized with the user data backup. You can even restore an older user backup without fear of destroying information. 5.2 Host Data Backup Until now we have always backed up database content. This is essentially catalog and user data that can be applied to a new PureData System appliance. PureData System also provides the functionality to backup and restore host data. This is essentially the data in the /nz/data and /export/nz directories of the host server. There are two reasons for regularly backing up host data. The first is a host crash. If the S-Blades of your appliance are intact but the host file system has been destroyed you could recreate all databases from the user backup. But in very large systems this might take a long time. It is much easier to only restore the host information and reconnect to the undamaged user tables on the S-Blades. The second reason is that the host data contains configuration information, log and plan files etc. that are not saved by the user backup. If you for example changed the system configuration that information would be lost. Therefore it is advisable to regularly backup host data. 1. To backup the host data execute the following command in the OS session: This will pause your system and copy the host files into the specified file name: [nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users [nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users Backup of users, groups, and global permissions completed successfully. [nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup
  • 119. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 22 of 23 As you can see the system has been paused for the duration of the host backup but is automatically resumed after the backup is successful. Also notice that the host backup is done with the nzhostbackup command instead of the standard nzbackup command. 2. Lets have a look at the created file: You will see the following results: You can see that a backup file has been created. It’s a compressed file containing the system catalog and PureData System host information. If possible host backups should be done regularly. If for example an old host backup is restored there might exist so called orphaned tables. This means tables that have been created after the host backup and exist on the S-Blades but are now not registered in the system catalog anymore. During host restore PureData System will create a script to clean up these orphaned tables, so they do not take up any disc space. Congratulations you have finished the Backup&Restore lab and you have had a chance to see all components of a successful PureData System backup strategy. The one missing component is that we did only use file system backup. In a real environment you would more likely use a Veritas or TSM backup server. For further information regarding the setup steps please refer to the excellent system administration guide. [nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup Starting host backup. System state is 'online'. Pausing the system ... Checkpointing host catalog ... Archiving system catalog ... Resuming the system ... Host backup completed successfully. System state is 'online'. [nz@netezza backupRestore]$ ll /tmp total 66160 drwxrwxrwx 3 nz nz 4096 Dec 14 12:35 bk1 drwxrwxrwx 3 nz nz 4096 Dec 14 12:35 bk2 drwxrwxrwx 3 nz nz 4096 Dec 13 17:50 bkschema -rw------- 1 nz nz 67628809 Dec 14 16:37 hostbackup drwxrwxr-x 2 nz nz 4096 Dec 12 14:55 inza1.1.2 drwxrwxrwx 2 root root 16384 Jan 20 2011 lost+found srwxrwxrwx 1 nz nz 0 Dec 12 15:04 nzaeus__nzmpirun___ -rw-rw-r-- 1 nz nz 33 Dec 12 15:04 nzaeus__nzmpirun_____Process -rw-rw-r-- 1 nz nz 0 Dec 12 15:05 nzcm.lock drwx------ 2 nz nz 4096 Dec 12 14:46 nzcm-temp_18uEeq drwx------ 2 nz nz 4096 Dec 12 12:55 nzcm-temp_rvAZXR [nz@netezza backupRestore]$ ll /tmp
  • 120. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 23 of 23 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 121. hIBM Software Information Management Query Optimization Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 122. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 14 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Generate Statistics..........................................................3 3 Identifying Join Problems ..............................................6 4 HTML Explain ................................................................10
  • 123. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 14 1 Introduction PureData System uses a cost-based optimizer to determine the best method for scan and join operations, join order, and data movement between SPUs (redistribute or broadcast operations if necessary). For example the planner tries to avoid redistributing large tables because of the performance impact. The optimizer can also dynamically rewrite queries to improve query performance. The optimizer takes a SQL query as input and creates a detailed execution or query plan for the database system. For the optimizer to create the best execution plan that results in the best performance, it must have the most up-to-date statistics. You can use EXPLAIN, HTML (also known as bubble), and text plans to analyze how the PureData System system executes a query. Explain is a very useful tool to spot and identify performance problems, bad distribution keys, badly written SQL queries and out-of-date statistics. 1.1 Objectives During our POC we have identified a couple of very long running customer queries that have significantly worse performance than the number of rows involved would suggest. In this lab we will use Explain functionality to identify the concrete bottlenecks and if possible fix them to improve query performance. 2 Generate Statistics Our first long running customer query returns the average order price by customer segment for a given year and order priority. It joins the customer table for the market segment and the orders table for the total price of the order. Due to restrictive join conditions it shouldn’t require too much processing time. But on our test systems it runs a very long time. In this chapter we will use PureData System Explain functionality to find out why this is the case. The customer query in question: 1. Connect to your PureData System image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) 2. First we will make sure that the system doesn’t run a different workload that could influence our tests. Use the following nzsession command to verify that the system is free: You should get a similar result to the following: [nz@netezza ~]$ nzsession show ID Type User Start Time PID Database State Priority Name Client IP Client PID Command ----- ---- ----- ----------------------- ---- -------- ------ ------------- -------- - ---------- ------------------------ 16023 sql ADMIN 29-Apr-11, 09:18:13 EDT 4795 SYSTEM active normal 127.0.0.1 4794 SELECT session_id, clien [nz@netezza ~]$ nzsession show SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' GROUP BY c.c_mktsegment;
  • 124. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 14 This result shows that there is currently only one session connected to the database, which is the nzsession command itself. Per default the database user in your vmware image is ADMIN. Executing this command before doing any performance measurements ensures that other workloads are not influencing the performance of the system. You can use the nzsession command as well to abort bad or locked sessions. 3. After we verified that the system is free we can start analyzing the query. Connect to the lab database with the following command: 4. Let’s first have a look at the two tables and the WHERE conditions to get an idea of the row numbers involved. Our query joins the CUSTOMER table without any where condition applied to it and the ORDERS table that has two where conditions restricting it on the date and order priority. From the data distribution lab we know that the CUSTOMER table has 150000 rows. To get the rows that are involved from the ORDERS table Execute the following COUNT(*) command: You should get the following results: So the ORDERS table has 46014 rows that fit the WHERE condition. We will use EXPLAIN functionality to check if the available Statistics allow the PureData System optimizer to estimate this correctly for its plan creation. 5. The PureData System optimizer uses statistics about the data in the system to estimate the number of rows that result from WHERE conditions, joins, etc. Doing wrong approximations can lead to bad execution plans. For example a huge result set could be broadcast for a join instead of doing a double redistribution. To see its estimated rows for the WHERE conditions in our query run the following EXPLAIN command: You will see a long output. Scroll up to your command and you should see the following: LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT'; LABDB(LABADMIN)=> SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT'; COUNT ------- 46014 (1 row) LABDB(LABADMIN)=> SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT'; [nz@netezza ~]$ nzsql labdb labadmin
  • 125. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 14 The execution plan of this query consists of two nodes or snippets. First the table is scanned and the WHERE conditions are applied, which can be seen in the Restrictions sub node. Since we use a COUNT(*) the Projections node is empty. Then an Aggregation node is applied to count the rows that are returned by node 1. When we look at the estimated number of rows we can see that it is way off the mark. The PureData System Optimizer estimates from its available statistics that only 150 rows are returned by the WHERE conditions. We have seen before that in reality its 46014 or roughly 300 times as many. 6. One way to help the optimizer in its estimates is the collection of detailed statistics about the involved tables. Execute the following command to generate detailed statistics about the ORDERS table: Since generating full statistics involves a table scan this command may take some time to execute. 7. We will now check if generating statistics has improved the estimates. Execute the EXPLAIN command again: Scroll up to your command and you should now see the following: As we can see the estimated rows of the SELECT query have improved drastically. The optimizer now assumes this WHERE condition will apply to 3000 rows of the order table. Still significantly off the true number of 46000 but by a factor of 20 better than the original estimate of 150. explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) = 1996 and o.o_orderpriority = '1-URGENT'; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}] -- Estimated Rows = 3000, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0 Restrictions: ((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) = 1996)) Projections: Node 2. [SPU Aggregate] ... LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT'; LABDB(LABADMIN)=> generate statistics on orders; explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) = 1996 and o.o_orderpriority = '1-URGENT'; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}] -- Estimated Rows = 150, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0 Restrictions: ((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) = 1996)) Projections: Node 2. [SPU Aggregate] ...
  • 126. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 14 Estimations are very difficult to make. Obviously the optimizer cannot do the actual computation during planning. It relies on current statistics about the involved columns. Statistics include min/max values, distinct values, numbers of null values etc. Some of these statistics are collected on the fly but the most detailed statistics can be generated manually with the Generate Statistics command. Generating full statistics after loading a table or changing its content significantly is one of the most important administration tasks in PureData System. The PureData System appliance will automatically generate express statistics after many tasks like load operations and just-in-time statistics during planning. Nevertheless full statistics should be generated on a regular basis. 3 Identifying Join Problems In the last chapter we have taken a first look at the tables involved in our join query and have improved optimizer estimates by generating statistics on the involved tables. Now we will have a look at the complete execution plan and we will have a specific look at the distribution and involved join. In our example we have a query that doesn’t finish in a reasonable amount of time. It is taken much longer than you would expect from the involved data sizes. We will now analyze why this is the case. 1. Lets analyze the execution plan for this query using the EXPLAIN VERBOSE command: You should see the following results (Scroll up to your query) LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' GROUP BY c.c_mktsegment; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "CUSTOMER" as "C" {(C.C_CUSTKEY)}] -- Estimated Rows = 150000, Width = 10, Cost = 0.0 .. 90.5, Conf = 100.0 Projections: 1:C.C_MKTSEGMENT [SPU Broadcast] Node 2. [SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}] -- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0 Restrictions: ((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) = 1996)) Projections: 1:O.O_TOTALPRICE Node 3. [SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}] -- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0 Restrictions: 't'::BOOL Projections: 1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE Node 4. [SPU Group {(C.C_MKTSEGMENT)}] -- Estimated Rows = 100, Width = 18, Cost = 1048040.0 .. 7732377.0, Conf = 0.0 Projections: 1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE [SPU Distribute on {(C.C_MKTSEGMENT)}] [SPU Merge Group] Node 5. [SPU Aggregate {(C.C_MKTSEGMENT)}] -- Estimated Rows = 100, Width = 26, Cost = 1048040.0 .. 7732377.0, Conf = 0.0 Projections: 1:C.C_MKTSEGMENT 2:(SUM(O.O_TOTALPRICE) / "NUMERIC"(COUNT(O.O_TOTALPRICE))) [SPU Return] [Host Return] ... Removed Plan Text ...
  • 127. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 14 2. First try to answer the following questions through the execution plan yourself. Take your time. We will walk through the answers after that. Question Answer a. Which columns of Table Customer are used in further computations? b. Is Table Customer redistributed, broadcast or can it be joined locally? c. Is Table Orders redistributed, broadcast or can it be joined locally? d. In which node are the WHERE conditions applied and how many rows does PureData System expect to fulfill the where condition? e. What kind of join takes place and in which node? f. What is the number of estimated rows for the join? g. What is the most expensive node and why? Hint: a stream operation in PureData System Explain is a join whose output isn’t persisted on disc but streamed to further computation nodes or snippets. 3. So let’s walk through the questions: a. Which columns of Table Customer are used in further computations? The first node in the execution plan does a sequential scan of the CUSTOMER table on the SPUs. It estimates that 150000 rows are returned which we know is the number of rows in the CUSTOMER table. The statement that tells us which columns are used in further computations is the “Projections:” clause. We can see that only the C_MKTSEGMENT column is carried on from the CUSTOMER table. All other columns are thrown away. Since C_MKTSEGMENT is a CHAR(10) column the returned resultset has a width of 10. b. Is Table Customer redistributed, broadcast or can it be joined locally? During scan the table is broadcast to the other SPUs. This means that a complete CUSTOMER table is assembled on the host and broadcast to each SPU for further computation of the query. This may seem surprising at first since we have a substantial amount of rows. But since the width of the result set is only 10 we are talking about 150000 rows * 10 bytes = 1.5mb. This is almost nothing for a warehousing system. Node 2. [SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}] -- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0 Restrictions: ((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) = 1996)) Projections: 1:O.O_TOTALPRICE Node 1. [SPU Sequential Scan table "CUSTOMER" as "C" {}] -- Estimated Rows = 150000, Width = 10, Cost = 0.0 .. 90.5, Conf = 100.0 Projections: 1:C.C_MKTSEGMENT [SPU Broadcast]
  • 128. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 14 c. Is Table Orders redistributed, broadcast or can it be joined locally? The second node of the execution plan does a scan of the ORDERS table. One column O_TOTALPRICE is projected and used in further computations. We cannot see any distribution or broadcast clauses so this table can be joined locally. This is true because the CUSTOMER table is broadcast to all SPUs. If one table of a join is broadcast the other table doesn’t need any redistribution. d. In which node are the WHERE conditions applied and how many rows does PureData System expect to fulfill the where condition? We can see in the “Restrictions” clause that the WHERE conditions of our query are applied during the second node as well. This should be clear since both of the WHERE conditions are applied to the ORDERS table and they can be executed during the scan of the ORDERS table. As we can see in the “Estimated Rows” clause, the optimizer estimates a returned set of 3000 rows which we know is not perfectly true since in reality 46014 rows are returned from this table. e. What kind of join takes place and in which node? The third node of our execution plan contains the join between the two tables. It is a Nested Loop Join which means that every row of the first join set is compared to each row of the second join set. If the join condition holds true the joined row is then added to the result set. This can be a very efficient join for small tables but for large tables its complexity is quadratic and therefore in general less fast than for example a Hash Join. The Hash Join though cannot be used in cases of inequality join conditions, floating point join keys etc. f. What is the number of estimated rows for the join? We can see in the Estimated Rows clause that the optimizer estimates this join node to return roughly 450m rows. Which is the number of rows from the first node times the number of rows from the second node. g. What is the most expensive node and why? As we can see from the Cost clause the optimizer estimates, that the join has a cost in the range from 1048040 .. 7676127.0. This is a roughly 2000 – 14000 times higher cost than what was expected for Node 1 and Node 2. Node 4 and 5 which group and aggregate the result set do not add much cost as well. So our performance problems clearly originate in the join node 3. So what is happening here? If we take a look at the query we can assume that it is intended to compute the average order cost per market segment. This means we should join all customers to their corresponding order rows. But for this to happen we would need a join condition that joins the customer table and the orders table on the customer key. Instead the query performs a Cartesian Join, joining each customer row to each orders row. This is a very work intensive query that results in the behavior we have seen. The joined result set becomes huge. And it even returns results that cannot have been expected for the query we see. 4. So how do we fix this? By adding a join condition to the query that makes sure that customers are only joined to their orders. This additional join condition is O.O_CUSTKEY=C.C_CUSTKEY. Execute the following EXPLAIN command for the modified query. Node 3. [SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}] -- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0 Restrictions: 't'::BOOL Projections: 1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
  • 129. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 14 You should see the following results. Scroll up to your query to see the scan and join nodes. As you can see there have been some changes to the exeuction plan. The ORDERS table is now scanned first and distributed on the customer key. The CUSTOMER table is already distributed on the customer key so there doesn’t need to happen any redistribution here. Both tables are then joined in node 3 through a Hash Join on the customer key. The estimated number of rows is now 150000, the same as the number of customers. Since we have a 1:n relationship between customers and orders this is as we would expect. Also the estimated cost of node 3 has come down significantly to 578.6 ... 746.7. 5. Let’s make sure that the query performance has indeed improved. Switch on the display of elapsed query time with the following command: If you want you can later switch off the elapsed time display by executing the same command again. It is a toggle. 6. Now execute our modified query: You should see the following results: LABDB(LABADMIN)=> time LABDB(LABADMIN)=> SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}] -- Estimated Rows = 3000, Width = 12, Cost = 0.0 .. 578.6, Conf = 64.0 Restrictions: ((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) = 1996)) Projections: 1:O.O_TOTALPRICE 2:O.O_CUSTKEY Cardinality: O.O_CUSTKEY 3.0K (Adjusted) [SPU Distribute on {(O.O_CUSTKEY)}] [HashIt for Join] Node 2. [SPU Sequential Scan table "CUSTOMER" as "C" {(C.C_CUSTKEY)}] -- Estimated Rows = 150000, Width = 14, Cost = 0.0 .. 90.5, Conf = 100.0 Projections: 1:C.C_MKTSEGMENT 2:C.C_CUSTKEY Node 3. [SPU Hash Join Stream "Node 2" with Temp "Node 1" {(C.C_CUSTKEY,O.O_CUSTKEY)}] -- Estimated Rows = 150000, Width = 18, Cost = 578.6 .. 746.7, Conf = 51.2 Restrictions: (C.C_CUSTKEY = O.O_CUSTKEY) Projections: 1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE Cardinality: O.O_CUSTKEY 100 (Adjusted) LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
  • 130. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 14 Before we made our changes the query took so long that we couldn’t wait for it to finish. After our changes the execution time has improved to slightly more than a second. In this relatively simple case we might have been able to pinpoint the problem through analyzing the SQL on its own. But this can be almost impossible for complicated multi-join queries that are often used in warehousing. Reporting and BI tools tend to create very complicated portable SQL as well. In these cases EXPLAIN can be a valuable tool to pinpoint the problem. 4 HTML Explain In this section we will look at the HTML plangraph for the customer query that we just fixed. Besides the text descriptions of the exeution plan we used in the previous chapter, PureData System provides the ability to generate a graphical query tree as well. This is done with the help of HTML. So plangraph files can be created and viewed in your internet browser. PureData System can be configured to save a HTML plangraph or plantext file for every executed SQL query. But in this chapter we will use the basic EXPLAIN PLANGRAPH command and use Cut&Paste to export the file to your host computer. 1. Enter the query with the keyword explain plangraph to generate the HTML plangraph: You will get a long print output of the HTML file content on your screen: LABDB(LABADMIN)=> SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment; C_MKTSEGMENT | AVG --------------+--------------- HOUSEHOLD | 150196.009267 BUILDING | 151275.977882 AUTOMOBILE | 151488.825830 MACHINERY | 151348.971079 FURNITURE | 150998.129771 (5 rows) Elapsed time: 0m1.129s LABDB(ADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
  • 131. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 14 Next open your host computer’s text editor. If you workstation is windows open notepad, if you use a linux desktop use the default text editor like KEDIT, or GEDIT. Copy the output from the explain plangraph from your putty window into notepad. Make sure that you only copy the HTML file from the <html… start tag to the </html> end tag. 2. Save the file as “explain.html” on your desktop. LABDB(LABADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment; NOTICE: QUERY PLAN: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta http-equiv="Generator" content="Netezza Performance Server"> <meta http-equiv="Author" content="Babu Tammisetti <btammisetti@netezza.com>"> <style> v:* {behavior:url(#default#VML);} </style> </head> <body lang="en-US"> <pre style="font:normal 68% verdana,arial,helvetica;background:#EEEEEE;margin-top:1em;margin-bottom:1em;margin- left:0px;padding:5pt;"> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment; </pre> <v:textbox style="position:absolute;margin-left:230pt;margin-top:19pt;width:80pt;height:25pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">AGG<br/>r=100 w=26 s=2.5KB</p></v:textbox> <v:oval style="position:absolute;margin-left:231pt;margin-top:15pt;width:78pt;height:25pt;z-index:9;"></v:oval> <v:textbox style="position:absolute;margin-left:230pt;margin-top:0pt;width:80pt;height:25pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">snd,ret</p></v:textbox> <v:textbox style="position:absolute;margin-left:230pt;margin-top:54pt;width:80pt;height:25pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">GROUP<br/>r=100 w=18 s=1.8KB</p></v:textbox> <v:oval style="position:absolute;margin-left:231pt;margin-top:50pt;width:78pt;height:25pt;z-index:9;"></v:oval> <v:line style="position:absolute;z-index:8;" from="270pt,27pt" to="270pt,62pt"/> <v:textbox style="position:absolute;margin-left:233pt;margin-top:42pt;width:80pt;height:25pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">dst,m-grp</p></v:textbox> <v:textbox style="position:absolute;margin-left:230pt;margin-top:89pt;width:80pt;height:31pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">HASHJOIN<br/>r=150.0K w=18 s=2.6MB<br/>(C_CUSTKEY = O_CUSTKEY)</p></v:textbox> <v:oval style="position:absolute;margin-left:231pt;margin-top:85pt;width:78pt;height:31pt;z-index:9;"></v:oval> <v:line style="position:absolute;z-index:8;" from="270pt,62pt" to="270pt,100pt"/> <v:textbox style="position:absolute;margin-left:190pt;margin-top:124pt;width:80pt;height:31pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=150.0K w=14 s=2.0MB<br/>C</p></v:textbox> <v:oval style="position:absolute;margin-left:191pt;margin-top:120pt;width:78pt;height:31pt;z-index:9;"></v:oval> <v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="230pt,135pt"/> <v:textbox style="position:absolute;margin-left:270pt;margin-top:124pt;width:80pt;height:25pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">HASH<br/>r=3.0K w=12 s=35.2KB</p></v:textbox> <v:oval style="position:absolute;margin-left:271pt;margin-top:120pt;width:78pt;height:25pt;z-index:9;"></v:oval> <v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="310pt,132pt"/> <v:textbox style="position:absolute;margin-left:253pt;margin-top:112pt;width:80pt;height:25pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">dst{(O_CUSTKEY)}</p></v:textbox> <v:textbox style="position:absolute;margin-left:270pt;margin-top:159pt;width:80pt;height:31pt;z-index:10;"> <p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=3.0K w=12 s=35.2KB<br/>O</p></v:textbox> <v:oval style="position:absolute;margin-left:271pt;margin-top:155pt;width:78pt;height:31pt;z-index:9;"></v:oval> <v:line style="position:absolute;z-index:8;" from="310pt,132pt" to="310pt,170pt"/> </body> </html> EXPLAIN
  • 132. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 14 3. Now on your desktop double click on “explain.html”. In windows make sure to open it with Internet Explorer since this will result in the best output You can see a graphical representation of the query we analyzed before. The left leg of the tree is the scan node of the Customer tables C, the right leg contains a scan of the Orders table O and a node hashing the result set from orders in preparation for the HASHJOIN node, that is joining the resultsets of the two table scans on the customer key. After the join the result is fed into a GROUP node and an Aggregation node that computes the Average total price, before being returned to the caller.
  • 133. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 14 A graphical representation of the execution plan can be valuable for complicated multi-join queries to get an overview of the join. Congratulations in this lab you have used PureData System Explain functionality to analyze a query.
  • 134. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 14 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 135. Optimization Objects Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 136. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 2 of 17 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Materialized Views...........................................................3 2.1 Wide Tables.....................................................................4 2.2 Lookup of small set of rows .............................................6 3 Cluster Based Tables (CBT).........................................12 3.1 Cluster Based Table Usage...........................................12 3.2 Cluster Based Table Maintenance .................................15
  • 137. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 3 of 17 1 Introduction A PureData System appliance is designed to provide excellent performance in most cases without any specific tuning or index creation. One of the key technologies used to achieve this are zone maps: Automatically computed and maintained records of the data that is inside the extents of a database table. In general data is loaded into data warehouses ordered by the time dimension; therefore zone maps have the biggest performance impact on queries that restrict the time dimension as well. This approach works well for most situations, but PureData System provides additional functionality to enhance specific workloads, which we will use in this chapter. We will first use materialized views to enhance performance of database queries against wide tables and for queries that only lookup small subsets of columns. Then we will use Cluster Based Tables to enhance query performance of queries which are using multiple lookup dimensions. 1.1 Objectives In the last couple of labs we have recreated a customer database in our PureData System system. We have picked distribution keys, loaded the data and made some first performance investigations. In this lab we will take a deeper look at some customer queries and try to enhance their performance by tuning the system. Figure 1 LABDB database 2 Materialized Views A materialized view is a view of a database table that projects a subset of the base table’s columns and can be sorted on a specific set of the projected columns. When a materialized view is created, the sorted projection of the base table’s data is stored in a materialized table on disk. Materialized views reduce the width of data being scanned in a base table. They are beneficial for wide tables that contain many columns (i.e. 50-500 columns) where typical queries only reference a small subset of the columns.
  • 138. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 4 of 17 Materialized views also provide fast, single or few record lookup operations. The thin materialized view is automatically substituted by the optimizer for the base table, allowing faster response, particularly for shorter tactical queries that examine only a small segment of the overall database table. 2.1 Wide Tables In our customer scenario we have a couple of queries that do some basic computations on the LINEITEM table but only touch a small number of columns of the table. 1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) 2. Enter NZSQL and connect to LABDB as user LABADMIN. 3. The first thing we need to do is to make sure table statistics have been generated so that more accurate estimated query costs can be reported by explain commands which we will be looking at. Please generate statistics for the ORDERS and LINEITEM tables using the following commands. 4. The following query computes the total quantity of items shipped and their average tax rate for a given month. In this case the fourth month or April. Execute the following query: Your results should look similar to the following: Notice the EXTRACT(MONTH FROM L_SHIPDATE) command. The EXTRACT command can be used to retrieve parts of a date or time column like YEAR, MONTH or DAY. 5. Now let’s have a look at the cost of this query. To get the projected cost from the Optimizer we use the following EXPLAIN VERBOSE command: You will see a long output on the screen. Scroll up till you reach the command you just executed. You should see something similar to the following: LABDB(LABADMIN)=> GENERATE STATISTICS ON ORDERS; LABDB(LABADMIN)=> GENERATE STATISTICS ON LINEITEM; [nz@netezza labs]$ nzsql LABDB LABADMIN LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH FROM L_SHIPDATE) = 4; LABDB(LABADMIN)=> SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH FROM L_SHIPDATE) = 4; SUM | AVG -------------+---------- 13136228.00 | 0.039974 (1 row) LABDB(LABADMIN)=> SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH FROM L_SHIPDATE) = 4;
  • 139. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 5 of 17 Notice the highlighted cost associated with the table scan. In our example it’s a value of over 2400. 6. Since this query is run very frequently we want to enhance the scanning performance. And since it only uses 3 of the 16 LINEITEM columns we have decided to create a materialized view covering these three columns. This should significantly increase scan speed since only a small subset of the data needs to be scanned. To create the materialized view THINLINEITEM execute the following command: This command can take several minutes since we effectively create a copy of the three columns of the table. 7. Repeat the explain call from step 2. Execute the following command: Again scroll up till you reach your command. The results should now look like the following: LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH FROM L_SHIPDATE) = 4; LABDB(LABADMIN)=> CREATE MATERIALIZED VIEW THINLINEITEM AS SELECT L_QUANTITY, L_TAX, L_SHIPDATE FROM LINEITEM; QUERY SQL: EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH FROM L_SHIPDATE) = 4; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}] -- Estimated Rows = 60012, Width = 16, Cost = 0.0 .. 2417.5, Conf = 80.0 Restrictions: (DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4) Projections: 1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX Node 2. [SPU Aggregate] -- Estimated Rows = 1, Width = 32, Cost = 2440.0 .. 2440.0, Conf = 0.0 Projections: 1:SUM(LINEITEM.L_QUANTITY) 2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX))) [SPU Return] [HOST Merge Aggs] [Host Return] ...
  • 140. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 6 of 17 Notice that the PureData System Optimizer has automatically replaced the LINEITEM table with the view THINLINEITEM. We didn’t need to make any changes to the query. Also notice that the expected cost has been reduced to 174 which is less than 10% of the original. As you have seen in cases where you have wide database tables, with queries only touching a subset of them, a materialized view of the hot columns can significantly increase performance for these queries, without any changes to the executed queries. 2.2 Lookup of small set of rows Materialized views not only reduce the width of tables, they can also be used in a similar way to indexes to increase the speed of queries that only access a very limited set of rows. 1. First we drop the view we used in the last chapter with the following command: 2. The following command returns the number of returned shipments vs. total shipments for a specific shipping day. Execute the following command: You should have a similar result to the following: LABDB(LABADMIN)=> DROP VIEW THINLINEITEM; LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; QUERY SQL: EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH FROM L_SHIPDATE) = 4; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan mview "_MTHINLINEITEM" {(LINEITEM.L_ORDERKEY)}] -- Estimated Rows = 511888, Width = 16, Cost = 0.0 .. 174.1, Conf = 90.0 [MV: MaxPages=136 TotalPages=544] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats) Restrictions: (DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4) Projections: 1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX Node 2. [SPU Aggregate] -- Estimated Rows = 1, Width = 32, Cost = 366.0 .. 366.0, Conf = 0.0 Projections: 1:SUM(LINEITEM.L_QUANTITY) 2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX))) [SPU Return] [HOST Merge Aggs] [Host Return] ...
  • 141. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 7 of 17 You can see that on the 15th June of 1995 there have been 176 returned shipments out of a total of 2550. Notice the use of the CASE statement to change the L_RETURNFLAG column into a Boolean 0-1 value, which is easily countable. 3. We will now take a look at the underlying data distribution of the LINEITEM table and its zone map values. To do this exit the NZSQL console by executing the q command. 4. In our demo image we have installed the PureData System support tools. You can normally find them as an installation package in /nz on your PureData System appliances or you can retrieve them from IBM support. One of these tools is the nz_zonemap tool that returns detailed information about the zone map values associated with a given database table. First let’s have a look at the zone mappable columns of the LINEITEM table. Execute the following command: You should get the following result: This command returns an overview of the zonemappable columns of the LINEITEM table in the LABDB database. Seven of the sixteen columns have zone maps created for them. Zonemappable columns include integer and date data types. We see that the L_SHIPDATE column we have in the WHERE condition of the customer query is zonemappable. 5. Now we will have a look at the zone map values for the L_SHIPDATE column. Execute the following command: This command returns a list of all extents that make up the LINEITEM table and the minimum and maximum values of the data in the L_SHIPDATE column for each extent. Your results should look like the following: [nz@netezza ~]$ nz_zonemap LABDB LINEITEM L_SHIPDATE [nz@netezza ~]$ nz_zonemap LABDB LINEITEM Database: LABDB Object Name: LINEITEM Object Type: TABLE Object ID : 243252 The zonemappable columns are: Column # | Column Name | Data Type ----------+---------------+----------- 1 | L_ORDERKEY | INTEGER 2 | L_PARTKEY | INTEGER 3 | L_SUPPKEY | INTEGER 4 | L_LINENUMBER | INTEGER 11 | L_SHIPDATE | DATE 12 | L_COMMITDATE | DATE 13 | L_RECEIPTDATE | DATE (7 rows) [nz@netezza ~]$ nz_zonemap LABDB LINEITEM LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; RET | TOTAL -----+------- 176 | 2550 (1 row)
  • 142. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 8 of 17 You can see that the LINEITEM table consists of 23 extents of data (3MB chunks on each dataslice). We can also see the minimum and maximum values for the L_SHIPDATE column in each extent. These values are stored in the zone map and automatically updated when rows are inserted, updated or deleted. If a query has a where condition on the L_SHIPDATE column that falls outside of the data range of an extent, the whole extent can be discarded by PureData System without scanning it. In this case the data has been equally distributed on all extents. This means that our query which has a WHERE condition on the 15th June of 1995 doesn’t profit from the zone maps and requires a full table scan. Not a single extent could be safely ruled out. 6. Enter the NZSQL console again by entering the nzsql labdb labadmin command. 7. We will now create a materialized view that is ordered on the L_SHIPDATE column. Execute the following command: Note that our customer query has a WHERE condition on the L_SHIPDATE column but aggregates the L_RETURNFLAG column. Nevertheless we didn’t add the L_RETURNFLAG column to the materialized view. We could have done it to enhance the performance of our specific query even more. But in this case we assume that there are lots of customer queries which are restricted on the ship date and access different columns of the LINEITEM table. A materialized view LABDB(LABADMIN)=> CREATE MATERIALIZED VIEW SHIPLINEITEM AS SELECT L_SHIPDATE FROM LINEITEM ORDER BY L_SHIPDATE; [nz@netezza ~]$ nz_zonemap LABDB LINEITEM L_SHIPDATE Database: LABDB Object Name: LINEITEM Object Type: TABLE Object ID : 243252 Data Slice: 1 Column 1: L_SHIPDATE (DATE) Extent # | L_SHIPDATE (Min) | L_SHIPDATE (Max) | ORDER'ed ----------+------------------+------------------+---------- 1 | 1992-01-04 | 1998-11-29 | 2 | 1992-01-06 | 1998-11-30 | 3 | 1992-01-03 | 1998-11-28 | 4 | 1992-01-02 | 1998-11-29 | 5 | 1992-01-04 | 1998-11-29 | 6 | 1992-01-03 | 1998-11-28 | 7 | 1992-01-04 | 1998-11-29 | 8 | 1992-01-04 | 1998-11-30 | 9 | 1992-01-07 | 1998-12-01 | 10 | 1992-01-03 | 1998-11-28 | 11 | 1992-01-05 | 1998-11-27 | 12 | 1992-01-03 | 1998-12-01 | 13 | 1992-01-03 | 1998-11-30 | 14 | 1992-01-04 | 1998-11-30 | 15 | 1992-01-06 | 1998-11-27 | 16 | 1992-01-03 | 1998-11-30 | 17 | 1992-01-02 | 1998-11-29 | 18 | 1992-01-07 | 1998-11-29 | 19 | 1992-01-04 | 1998-11-30 | 20 | 1992-01-04 | 1998-11-30 | 21 | 1992-01-03 | 1998-11-30 | 22 | 1992-01-04 | 1998-11-29 | 23 | 1992-01-02 | 1998-11-26 | (23 rows)
  • 143. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 9 of 17 retains the information about the location of a parent row in the base table and can be used for lookups even if columns of the parent table are accessed in the SELECT clause. You can specify more than one order column. In that case they are ordered first by the first column; in case this column has equal values the next column is used to order rows with the same value in column one etc. In general only the first order column provides a significant impact on performance. 8. Let’s have a look at the zone map of the newly created view. Leave the NZSQL console again with the q command. 9. Display the zone map values of the materialized view SHIPLINEITEM with the following command: The results should look like the following: We can make a couple of observations here. First the materialized view is significantly smaller than the base table, since it only contains one column. We can also see that the data values in the extent are ordered on the L_SHIPDATE column. This means that for our query, which is accessing data from the 15th June of 1995, only extent 3 needs to be accessed at all, since only this extent has a data range that contains this date value. 10. Now let’s verify that our materialized view is indeed used for this query. Enter the NZSQL console by entering the following command: nzsql labdb labadmin 11. Use the Explain command again to verify that our materialized view is used by the Optimizer: You will see a long text output, scroll up till you find the command you just executed. Your result should look like the following: LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; [nz@netezza ~]$ nz_zonemap LABDB SHIPLINEITEM L_SHIPDATE Database: LABDB Object Name: SHIPLINEITEM Object Type: MATERIALIZED VIEW Object ID : 252077 Data Slice: 1 Column 1: L_SHIPDATE (DATE) Extent # | L_SHIPDATE (Min) | L_SHIPDATE (Max) | ORDER'ed ----------+------------------+------------------+---------- 1 | 1992-01-02 | 1993-04-11 | 2 | 1993-04-11 | 1994-05-24 | TRUE 3 | 1994-05-24 | 1995-07-03 | TRUE 4 | 1995-07-03 | 1996-08-14 | TRUE 5 | 1996-08-14 | 1997-09-24 | TRUE 6 | 1997-09-24 | 1998-12-01 | TRUE (6 rows) [nz@netezza ~]$ nz_zonemap LABDB SHIPLINEITEM L_SHIPDATE
  • 144. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 10 of 17 Notice that the Optimizer has automatically changed the table scan to a scan of the view SHIPLINEITEM we just created. This is possible even though the projection is taking place on column L_RETURNFLAG of the base table. 12. In some cases you might want to disable or suspend an associated materialized view. For troubleshooting or administrative tasks on the base table. For these cases use the following command to suspend the view: 13. We want to make sure that the view is not used anymore during query execution. Execute the EXPLAIN command for our query again: Scroll up till you see your explain query. With the view suspended we can see that the optimizer again scans the original table LINEITEM. EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}] -- Estimated Rows = 60012, Width = 1, Cost = 0.0 .. 2417.5, Conf = 80.0 Restrictions: ... LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; LABDB(LABADMIN)=> ALTER VIEW SHIPLINEITEM MATERIALIZE SUSPEND; EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}] -- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV: MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats) Restrictions: (LINEITEM.L_SHIPDATE = '1995-06-15'::DATE) Projections: 1:LINEITEM.L_RETURNFLAG Node 2. [SPU Aggregate] -- Estimated Rows = 1, Width = 24, Cost = 62.2 .. 62.2, Conf = 0.0 Projections: 1:SUM(CASE WHEN (LINEITEM.L_RETURNFLAG <> 'N'::BPCHAR) THEN 1 ELSE 0 END) 2:COUNT(*) [SPU Return] [HOST Merge Aggs] [Host Return] ...
  • 145. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 11 of 17 14. Note that we have only suspended our view not dropped it. We will now reactivate it with the following refresh command: This command can also be used to reorder materialized views in case the base table has been changed. While INSERTs, UPDATEs and DELETEs into the base table are automatically reflected in associated materialized views, the view is not reordered for every change. Therefore it is advisable to refresh them periodically esp. after major changes to the base table. 15. To check that the Optimizer again uses the materialized view for query execution, execute the following command: Make sure that the Optimizer again uses the materialized view for its first scan operation. The output should again look like before you suspended the view. 16. If you execute the query again you should get the same results as you got before creating the materialized view. Execute the query again: You should see the following output: There is a defect in our VMWare image which in some cases only returns the rows from one dataslice instead of all four, when a materialized view is used. This means that instead of seeing a TOTAL of 2550 you will see a total of 623 (or similar numbers depending on your data distribution and which dataslice is returned). You can solve this problem by restarting your PureData System database. It will also not occur on a real PureData System appliance. LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; RET | TOTAL -----+------- 176 | 2550 (1 row) LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}] -- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV: MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats) Restrictions: (LINEITEM.L_SHIPDATE = '1995-06-15'::DATE) Projections: 1:LINEITEM.L_RETURNFLAG ... LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15'; LABDB(LABADMIN)=> ALTER VIEW SHIPLINEITEM MATERIALIZE REFRESH;
  • 146. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 12 of 17 You have just created a materialized view to speed up queries that lookup small numbers of rows. A materialized view can provide a significant performance improvement and is transparent to end users and applications accessing the database. But it also creates additional overhead during INSERTs, UPDATEs and DELETEs, requires additional hard disc space and it may require regular maintenance. Therefore materialized views should be used sparingly. In the next chapter we will discuss an alternative approach to speed up scan speeds on a database table. 3 Cluster Based Tables (CBT) We have received a set of new customer queries on the ORDERS table that do not only restrict the table by order date but also only accesses orders in a given price range. These queries make up a significant part of the system workload and we will look into ways to increase performance for them. The following query is a template for the queries in question. It returns the aggregated total price of all orders by order priority for a given year (in this case 1996) and price range (in this case between 150000 and 180000). In this example we have a very restrictive WHERE condition on two columns O_ORDERDATE and O_TOTALPRICE, which can help us to increase performance. The ORDERS table has around 220,000 rows with an order date of 1996 and 160,000 rows with the given price range. But it only has 20,000 columns that satisfy both conditions. Materialized views provide their main performance improvements on one column. Also INSERTS to the ORDERS table are frequent and time critical, therefore we would prefer not to use materialized views and will in this chapter investigate the use of cluster based tables. Cluster based tables are PureData System tables that are created with an ORGANIZE ON keyword. They use a special space filling algorithm to organize a table by up to 4 columns. Zone maps for a cluster based table will provide approximately the same performance increases for all organization columns. This is useful if your query restricts a table on more than one column or if your workload consists of multiple queries hitting the same table using different columns in WHERE conditions. In contrast to materialized views no additional disc space is needed, since the base table itself is reordered. 3.1 Cluster Based Table Usage Cluster based tables are created like normal PureData System database tables. They need to be flagged as a CBT during table creation by specifying up to four organization columns. A PureData System table can be altered at any time to become a cluster based table as well. 1. We are going to change the create table command for ORDERS to create a cluster based table. We will create a new cluster based table called ORDERS_CBT. Exit the NZSQL console by executing the q command. 2. Switch to the optimization lab directory by executing the following command: cd /labs/optimizationObjects 3. We have supplied a the script for the creation of the ORDERS_CBT table but we need to add the ORGANIZE ON(O_ORDERDATE, O_TOTALPRICE) clause to create the table as a cluster based table organized on the O_ORDERDATE and O_TOTALPRICE columns. To change the CREATE statement open the orders_cbt.sql script in the vi editor with the following command: vi orders_cbt.sql 4. Enter the insert mode by pressing “i”, the editor should now show an “---INSERT MODE---“ statement in the bottom line. SELECT O_ORDERPRIORITY, SUM(O_TOTALPRICE) FROM ORDERS WHERE EXTRACT(YEAR FROM O_ORDERDATE) = 1996 AND O_TOTALPRICE > 150000 AND O_TOTALPRICE <= 180000 GROUP BY O_ORDERPRIORITY;
  • 147. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 13 of 17 5. Navigate the cursor on the semicolon ending the statement. Press enter to move it into a new line. Enter the line “organize on (o_orderdate, o_totalprice)” before it. Your screen should now look like the following. 6. Exit the insert mode by pressing Esc. 7. Enter :wq! In the command line and press Enter to save and exit without questions. 8. Create and load the orders_cbt table by executing the following script: ./create_orders_test.sh 9. This may take a couple minutes because of our virtualized environment. You may see an error message that the table orders_cbt does not exist. This is expected since the script first tries to clean up an existing orders_cbt table. 10. We will now have a look at how Netezza has organized the data in this table. For this we use the nz_zonemap utility again. Execute the following command: You will get the following result: [nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt create table orders_cbt ( o_orderkey integer not null , o_custkey integer not null , o_orderstatus char(1) not null , o_totalprice decimal(15,2) not null , o_orderdate date not null , o_orderpriority char(15) not null , o_clerk char(15) not null , o_shippriority integer not null , o_comment varchar(79) not null ) distribute on (o_orderkey) organize on (o_orderdate, o_totalprice); ~ -- INSERT --
  • 148. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 14 of 17 This command shows you the zone mappable columns of the ORDERS_CBT table. If you compare it with the output of the nz_zonemap tool for the ORDERS table, you will see that it contains the additional column O_TOTALPRICE. Numeric columns are not zone mapped per default for performance reasons but zone maps are created for them, if they are part of the organization columns. 11. Execute the following command to see the zone map values of the O_ORDERDATE column: You will get the following results: This is unexpected. Since we used O_ORDERDATE as an organization column we would have expected some kind of order in the data values, but they are again distributed equally over all extents. The reason for this is that the organization process takes place during a command called groom. Instead of creating a new table we could also have altered the existing ORDERS table to become a cluster based table. Creating or altering a table to become a cluster based table doesn’t actually change the physical table layout till the groom command has been used. [nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate Database: LABDB Object Name: ORDERS_CBT Object Type: TABLE Object ID : 264428 Data Slice: 1 Column 1: O_ORDERDATE (DATE) Extent # | O_ORDERDATE (Min) | O_ORDERDATE (Max) | ORDER'ed ----------+-------------------+-------------------+---------- 1 | 1992-01-01 | 1998-08-02 | 2 | 1992-01-01 | 1998-08-02 | 3 | 1992-01-01 | 1998-08-02 | 4 | 1992-01-01 | 1998-08-02 | 5 | 1992-01-01 | 1998-08-02 | 6 | 1992-01-01 | 1998-08-02 | 7 | 1992-01-01 | 1998-08-02 | (7 rows) [nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate [nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt Database: LABDB Object Name: ORDERS_CBT Object Type: TABLE Object ID : 264428 The zonemappable columns are: Column # | Column Name | Data Type ----------+----------------+--------------- 1 | O_ORDERKEY | INTEGER 2 | O_CUSTKEY | INTEGER 4 | O_TOTALPRICE | NUMERIC(15,2) 5 | O_ORDERDATE | DATE 8 | O_SHIPPRIORITY | INTEGER (5 rows)
  • 149. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 15 of 17 This command will be covered in detail in the following presentation and lab. But we will use it in the next chapter to reorganize the table. 3.2 Cluster Based Table Maintenance When a table is created as a cluster based table in Netezza the data isn’t actually organized during load time. Also similar to ordered materialized views a cluster based table can become partially unordered due to INSERTS, UPDATES and DELETES. A threshold is defined for reorganization and the groom command can be used at any time to reorganize a cluster based table, based on its organization keys. 1. To organize the table you created in the last chapter you need to switch to the NZSQL console again. Execute the following command: nzsql labdb labadmin 2. Execute the following command to groom your cluster based table: This command does a variety of things which will be covered in a further presentation and lab. In this case it organizes the cluster based table based on its organization keys. This command requires a lot of RAM on the SPUs to operate. Our VMWare systems have been tuned so the command should be able to finish. Since the whole table is reordered it may take a couple of minutes to finish but should you get the impression that the system is stuck please inform the lecturer. 3. Let’s have a look at the data organization in the table. To do this quit the NZSQL console with the q command. 4. Review the zone maps of the two organization columns by executing the following command: Your results should look like the following (we removed the ORDER columns from the results to make it better readable) [nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice Database: LABDB Object Name: ORDERS_CBT Object Type: TABLE Object ID : 264428 Data Slice: 1 Column 1: O_ORDERDATE (DATE) Column 2: O_TOTALPRICE (NUMERIC(15,2)) Extent # | O_ORDERDATE (Min) | O_ORDERDATE (Max) | O_TOTALPRICE (Min) | O_TOTALPRICE (Max) ----------+-------------------+-------------------+--------------------+-------------------- 1 | 1992-01-01 | 1994-06-22 | 912.10 | 144450.63 | 2 | 1993-08-27 | 1996-12-08 | 875.52 | 144451.22 | 3 | 1996-02-13 | 1998-08-02 | 884.52 | 144446.76 | 4 | 1995-04-18 | 1998-08-02 | 78002.23 | 215555.39 | 5 | 1993-08-27 | 1998-08-02 | 196595.73 | 530604.44 | 6 | 1992-01-01 | 1995-04-18 | 144451.94 | 296228.30 | 7 | 1992-01-01 | 1993-08-27 | 196591.22 | 555285.16 | (7 rows) [nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice LABDB(LABADMIN)=> groom table orders_cbt;
  • 150. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 16 of 17 You can see that both columns have some form of order now. Our query is restricting rows in two ranges Condition 1: O_ORDERDATE = 1996 AND Condition 2: 150000 < O_TOTALPRICE <= 180000 Below we enter the minimum and maximum values of the extents in a table and add a column to mark (with an X) if the contained values of an extent overlap with the above conditions. Min(Date) Max(Date) Min(Price) Max(Price) Cond 1 Cond 2 Both Cond 1992-01-01 1994-06-22 912.10 144450.63 1993-08-27 1996-12-08 875.52 144451.22 X 1996-02-13 1998-08-02 884.52 144446.76 X 1995-04-18 1998-08-02 78002.23 215555.39 X X X 1993-08-27 1998-08-02 196595.73 530604.44 X 1992-01-01 1995-04-18 144451.94 296228.30 X 1992-01-01 1993-08-27 196591.22 555285.16 As you can see there are now 4 extents that have rows from 1996 in them and 2 extents that contain rows in the price range from 150000 to 18000. But we have only one extent that contains rows that satisfy both conditions and needs to be scanned during query execution. In this scenario we probably would have been able to get similar results with one organization column or a materialized view, but with bigger tables and more extents cluster based tables gain a performance advantage. Congratulations, you have finished the Optimization Objects lab. In this lab you have created materialized views to speedup scans of wide tables and queries that only look up small numbers of rows. Finally you created a cluster based table and used the groom command to organize it. Throughout the lab you have used the nz_zonemap tool to see zone maps and get a better idea on how data is stored in the Netezza appliance.
  • 151. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 20112 All rights reserved Page 17 of 17 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 152. IBM Software Information Management Groom Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 153. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 18 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Transactions....................................................................3 2.1 Insert Transaction............................................................3 2.2 Update and Delete Transactions......................................4 2.3 Aborting Transactions......................................................7 2.4 Cleaning up .....................................................................8 3 Grooming Logically Deleted Rows ..............................10 4 Performance Benefits of GROOM................................12 5 Changing the Data Type of a Column..........................13
  • 154. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 18 1 Introduction As part of your routine database maintenance activities, you should plan to recover disk space occupied by outdated or deleted rows. In normal PureData System operation, an UPDATE or DELETE of a table row does not remove the physical row on the hard disc. Instead the old row is marked as deleted together with a transaction id of the deleting transaction and in case of update a new row is created. This approach is called multiversioning. Rows that could potentially be visible to other transactions with an older transaction id are still accessible. Over time however, the outdated or deleted rows are of no interest to any transaction anymore and need to be removed to free up hard disc space and improve performance. After the rows have been captured in a backup, you can reclaim the space they occupy using the SQL GROOM TABLE command. The GROOM TABLE command does not lock a table while it is running; you can continue to SELECT, UPDATE, and INSERT into the table while the table is being groomed. 1.1 Objectives In this lab we will use the GROOM command to prepare our tables for the customer. During the course of the POC we have deleted and update a number of rows. At the end of a POC it is sensible to clean up the system. Use Groom on the created tables, Generate Statistics, and other cleanup tasks. 2 Transactions In this section we will show how transactions can leave logically deleted rows in a table which later as an administrative task need to be removed with the groom command. We will go through the different transaction types and show you what happens under the covers in a PureData System Appliance. 2.1 Insert Transaction In this chapter we will add a new row to the regions table and review the hidden fields that are saved in the database. As you remember from the Transactions presentation, PureData System uses a concept called multi-versioning for transactions. Each transaction has its own image of the table and doesn’t influence other transactions. This is done by adding a number of hidden fields to the PureData System table. The most important ones are the CREATEXID and the DELETEXID. Each PureData System transaction has a unique transaction id that is increasing with each new transaction. In this subsection we will add a new row to the REGION table. 1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) 2. Start NZSQL with : nzsql 3. Connect to the database LABDB as user LABADMIN by typing the following command: SYSTEM(ADMIN)=> c LABDB LABADMIN 4. Select all rows from the REGION table: You should see the following output with 4 existing regions: LABDB(LABADMIN)=> SELECT * FROM REGION;
  • 155. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 18 5. Insert a new row into the REGIONS table for the region Australia with the following SQL command 6. Now we will again do a select on the REGION table. But this time we will also query the hidden fields CREATEXID, DELETEXID and ROWID: You should see the following results: As you can see we now have five rows in the REGION table. The new row for Australia has the id of the last transaction as CREATEXID and “0” as DELETEXID since it has not yet been deleted. Other transactions with a lower transaction id that might still be running will not be able to see this new row. Note also that each row has a unique rowid. Rowids do not need to be consecutive but they are unique across all dataslices for one table. 2.2 Update and Delete Transactions Delete transactions in PureData System do not physically remove rows but update the DELETEXID field of a row to mark it as logically deleted. These logically deleted rows need to be removed regularly with the administrative Groom command. Update transactions in PureData System consist of a logical delete of the old row and an insert of a new row with the updated fields. To show this effectively we will need to change a system parameter in PureData System that allows us to switch off the invisibility lists in PureData System. Note that the parameter we will be using is dangerous and shouldn’t be used in a real PureData System environment. There is also a safer environment variable but this has some restrictions. 1. First we will change the system variable that allows us to see deleted rows in the system To do this exit the console with q 2. Stop the PureData System database with nzstop LABDB(LABADMIN)=> SELECT CREATEXID,DELETEXID,ROWID,* FROM REGION; CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT -----------+-----------+-----------+-------------+---------------------------+------------------- ---------- 365584 | 0 | 163100000 | 5 | as | australia 357480 | 0 | 161271001 | 1 | na | north america 357480 | 0 | 161271002 | 2 | sa | south america 357480 | 0 | 161271000 | 3 | emea | europe, ... 357480 | 0 | 161271003 | 4 | ap | asia pacific (5 rows) LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; LABDB(LABADMIN)=> SELECT * FROM REGION; R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 2 | sa | south america 4 | ap | asia pacific 3 | emea | europe, middle east, africa 1 | na | north america (4 rows) LABDB(LABADMIN)=> INSERT INTO REGION VALUES (5, 'as', 'australia');
  • 156. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 18 3. Navigate to the system config directory with the following command: 4. Open the system.cfg file that contains the PureData System system configuration with vi 5. Enter the insert mode by pressing “i”, the editor should now show an “---INSERT MODE---“ statement in the bottom line. 6. Navigate the cursor to the end of the last line. Press enter to create a new line. Enter the line host.fpgaAllowXIDOverride=yes before it. Your screen should now look like the following. 7. Exit the insert mode by pressing Esc. 8. Enter :wq! In the command line and press Enter to save and exit without questions. 9. Start the system again with the nzstart command. Note in a real PureData System system changing system configuration parameters can be a very dangerous thing that is normally not advisable without PureData System service support. 10. Enter the NZSQL console again with the following command: 11. Now we will update the row we inserted in the last chapter to the REGION table: LABDB(LABADMIN)=> UPDATE REGION SET R_COMMENT='Australia' WHERE R_REGIONKEY=5; [nz@netezza config]$ nzsql labdb labadmin system.enableCompressedTables=false system.realFpga=no system.useFpgaPrep=yes system.enableCompressedTables=yes system.enclosurePollInterval=0 system.envPollInterval=0 system.esmPollInterval=0 system.hbaPollInterval=0 system.diskPollInterval=0 system.enableCTA2=1 system.enableCTAColumns=1 sysmgr.coreCountWarning=1 sysmgr.coreCountFailover=1 system.emulatorMode=64 system.emulatorThreads=4 host.fpgaAllowXIDOverride=yes ~ ~ -- INSERT -- [nz@netezza config]$ vi system.cfg [nz@netezza ~]$ cd /nz/data/config
  • 157. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 18 12. Do a SELECT on the REGION table again: You should see the following output: Normally you would now see 5 rows with the update value. But since we disabled the invisibility lists you now see 6 rows in the REGION table. Our transaction that updated the row had the transaction id 369666. You can see that the original row with the lowercase “australia” in the comment column is still there and now has a DELETXID field that contains the transaction id of the transaction that deleted it. Transactions with a higher transaction id will not see a row with a deletexid that indicates that it has been logically deleted before the transaction is run. We also see a newly inserted row with the new comment value ‘Australia’. It has the same rowid as the deleted row and the same CREATEXID as the transaction that did the insert. 13. Finally lets clean up the table again by deleting the Australia row: 14. Do a SELECT on the REGION table again: You should see the following output: LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT -----------+-----------+-----------+-------------+---------------------------+------------------- ---------- 357480 | 0 | 161271000 | 3 | emea | europe, ... 365584 | 369666 | 163100000 | 5 | as | australia 369666 | 369670 | 163100000 | 5 | as | Australia 357480 | 0 | 161271001 | 1 | na | north america 357480 | 0 | 161271003 | 4 | ap | asia pacific 357480 | 0 | 161271002 | 2 | sa | south america (6 rows) LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; LABDB(LABADMIN) => DELETE FROM REGION WHERE R_REGIONKEY=5; LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT -----------+-----------+-----------+-------------+---------------------------+------------------- ---------- 357480 | 0 | 161271003 | 4 | ap | asia pacific 357480 | 0 | 161271000 | 3 | emea | europe, ... 357480 | 0 | 161271002 | 2 | sa | south america 365584 | 369666 | 163100000 | 5 | as | australia 369666 | 0 | 163100000 | 5 | as | Australia 357480 | 0 | 161271001 | 1 | na | north america (6 rows) LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
  • 158. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 18 We can now see that we have logically deleted our updated row as well. It has now a DELETEXID field with the value of the new transaction. New transactions will see the original table from the start of this lab again. Normally the logically deleted rows are filtered out automatically by the FPGA. If you do a SELECT the FPGA will remove all rows that: • have a CREATEXID which is bigger than the current transaction id. • have a CREATEXID of an uncommitted transaction. • have a DELETENXID which is smaller than the current transaction, but only if the transaction of the DELETEXID field is committed. • have a DELETEXID of “1” which means that the insert has been aborted. 2.3 Aborting Transactions PureData System never deletes a row during transactions even if transactions are rolled back. In this section we will show what happens if a transaction is rolled back. Since an update transaction consists of a delete and insert transaction, we will demonstrate the behavior for all tree transaction types with this. 1. To start a transaction that we can later rollback we need to use the BEGIN keyword. Per default all SQL statements entered into the NZSQL console are auto-committed. To start a multi command transaction the BEGIN keyword needs to be used. All SQL statements that are executed after it will belong to a single transaction. To end the transaction two keywords can be used COMMIT to commit the transaction or ROLLBACK to rollback the transaction and all changes since the BEGIN statement was executed. 2. Update the row for the AP region: 3. Do a SELECT on the REGION table again: You should see the following output: LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT -----------+-----------+----------+-------------+---------------------------+----------------------------- 5160 | 0 | 37801002 | 2 | sa | south america 5160 | 0 | 37801001 | 1 | na | north america 5172 | 9218 | 38962000 | 5 | as | australia 9218 | 9222 | 38962000 | 5 | as | Australia 5160 | 0 | 37801000 | 3 | emea | europe, middle east, africa 5160 | 9226 | 37801003 | 4 | ap | asia pacific 9226 | 0 | 37801003 | 4 | ap | AP (7 rows) LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; LABDB(LABADMIN)=> BEGIN; LABDB(LABADMIN)=> UPDATE REGION SET R_COMMENT='AP' WHERE R_REGIONKEY=4;
  • 159. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 18 Note that we have the same results as in the last chapter, the original row for the AP region was logically deleted by updating its DELETEXID field, and a new row with the updated comment and new rowid has been added. Note that its CREATEXID is the same as the DELETEXID of the old row, since they were updated by the same transaction. 4. Now lets rollback the transaction: 5. Do a SELECT on the REGION table again: You should see the following output: We can see that the transaction has been rolled back. The DELETEXID of the old version of the row has been reset to “0” , which means that it is a valid row that can be seen by other transactions, and the DELETEXID of the new row has been set to “1” which marks it as aborted. 2.4 Cleaning up In this section we will use the Groom command to remove the logically deleted rows we have entered and we will remove the system parameter from the configuration file. The Groom command will be used in more detail in the next chapter. It is the main maintenance command in PureData System and we have already used it in the Cluster Based Table labs to reorder a CBT. It also removes all logically deleted rows from a table and frees up the space on the machine again. 1. Execute the Groom command on the REGION table: You should see the following result: You can see that the groom command purged 3 rows, exactly the number of aborted and logically deleted rows we have generated in the previous chapter. LABDB(LABADMIN)=> groom table region; NOTICE: Groom processed 4 pages; purged 3 records; scan size unchanged; table size unchanged. GROOM RECORDS ALL LABDB(LABADMIN)=> groom table region; LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT -----------+-----------+----------+-------------+---------------------+---------------------- 5160 | 0 | 37801002 | 2 | sa | south america 5160 | 0 | 37801000 | 3 | emea | Europe … 5160 | 0 | 37801003 | 4 | ap | asia pacific 9226 | 1 | 37801003 | 4 | ap | AP 5160 | 0 | 37801001 | 1 | na | north america 5172 | 9218 | 38962000 | 5 | as | australia 9218 | 9222 | 38962000 | 5 | as | Australia (7 rows) LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; LABDB(LABADMIN)=> ROLLBACK;
  • 160. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 18 2. Now select the rows from the REGION table again. You should see the following result: You can see that the groom command has removed all logically deleted rows from the table. Remember that we still have the parameter switched on that allows us to see any logically deleted rows. Especially in tables that are heavily changed with lots and updates and deletes running the groom command will free up hard drive space and increase performance. 3. Finally we will remove the system parameter again. Quit the nzsql console with the q command. 4. Stop the PureData System database with nzstop 5. Navigate to the system config directory with the following command: 6. Open the system.cfg file that contains the PureData System system configuration with vi 7. Navigate the cursor to the last line and delete it by pressing “d” twice. Your screen should look like the following: [nz@netezza config]$ vi system.cfg [nz@netezza ~]$ cd /nz/data/config LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION; CREATEXID | DELETEXID | ROWID | R_REGIONKEY | R_NAME | R_COMMENT -----------+-----------+-----------+-------------+---------------------------+------------------- ---------- 357480 | 0 | 161271002 | 2 | sa | south america 357480 | 0 | 161271000 | 3 | emea | europe,... 369682 | 0 | 164100000 | 1 | na | north america 357480 | 0 | 161271003 | 4 | ap | asia pacific (4 rows) LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
  • 161. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 18 8. Enter :wq! In the command line and press Enter to save and exit without questions. 9. Start the system again with the nzstart command. We have now returned the system to its original status. Logically deleted lines will again be hidden by the database. 3 Grooming Logically Deleted Rows In this section we will delete rows and determine that they have not really been deleted from the disk. Then using groom we will physically delete the rows. 1. First determine the physical size on disk of the table ORDERS using the following command : You should see the following results: Notice that the ORDERS table is 75MBs in size. 2. Now we are going to delete some rows from ORDERS table. Delete all rows where the orderstatus is marked as ‘F’ for finished using the following command : system.enableCompressedTables=false system.realFpga=no system.useFpgaPrep=yes system.enableCompressedTables=yes system.enclosurePollInterval=0 system.envPollInterval=0 system.esmPollInterval=0 system.hbaPollInterval=0 system.diskPollInterval=0 system.enableCTA2=1 system.enableCTAColumns=1 sysmgr.coreCountWarning=1 sysmgr.coreCountFailover=1 system.emulatorMode=64 system.emulatorThreads=4 ~~ "system.cfg" 16L, 421C [nz@netezza ~]$ nz_db_size LABDB [nz@netezza ~]$ nz_db_size LABDB Object | Name | Bytes | KB | MB | GB | TB -----------+------------------+---------------+-------------+-------------+------------+------- Appliance | netezza | 769,785,856 | 751,744 | 734 | .7 | .0 Database | LABDB | 761,921,536 | 744,064 | 727 | .7 | .0 Table | CUSTOMER | 13,631,488 | 13,312 | 13 | .0 | .0 Table | LINEITEM | 588,644,352 | 574,848 | 561 | .5 | .0 Table | NATION | 524,288 | 512 | 1 | .0 | .0 Table | ORDERS | 78,118,912 | 76,288 | 75 | .1 | .0 Table | PART | 12,058,624 | 11,776 | 12 | .0 | .0 Table | PARTSUPP | 67,502,080 | 65,920 | 64 | .1 | .0 Table | REGION | 393,216 | 384 | 0 | .0 | .0 Table | SUPPLIER | 1,048,576 | 1,024 | 1 | .0 | .0
  • 162. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 18 The output should be: 3. Now check the physical table size for ORDERS and see if the size decreased using the same command as before. You must first exit NZSQL to shell using q. The output should be the same as above showing that the ORDERS table did not change in size and is still 75MB. This is because the deleted rows were logically deleted but are still left on disk. The rows will still accessible to transactions that started before the DELETE statement which we just executed. (i.e. have a lower transaction id) 4. Next let’s physically delete what we just logically deleted using the GROOM TABLE command and specifying table ORDERS. When you run the GROOM TABLE command, it removes outdated and deleted records from tables. The output should be: You can see that 729413 rows were removed from disk resulting in the table size shrinking by 12 extents. Notice that this is the same number of rows we deleted in the previous step. 5. Check if the ORDERS table size on disk has shrunk using the nz_db_size command. You must first exit NZSQL to shell using q. The output is shown below. Note the reduced size of the ORDERS table: [nz@netezza labs]$ nzsql LABDB LABADMIN LABDB(LABADMIN)=> DELETE FROM ORDERS WHERE O_ORDERSTATUS='F'; [nz@netezza ~]$ nz_db_size labdb Object | Name | Bytes | KB | MB | GB | TB -----------+-------------+--------------+------------+------------+-----------+------- Appliance | netezza | 430,833,664 | 420,736 | 411 | .4 | .0 Database | LABDB | 422,969,344 | 413,056 | 403 | .4 | .0 Table | CUSTOMER | 13,631,488 | 13,312 | 13 | .0 | .0 Table | LINEITEM | 294,256,640 | 287,360 | 281 | .3 | .0 Table | NATION | 524,288 | 512 | 1 | .0 | .0 Table | ORDERS | 40,370,176 | 39,424 | 39 | .0 | .0 Table | PART | 5,242,880 | 5,120 | 5 | .0 | .0 Table | PARTSUPP | 67,502,080 | 65,920 | 64 | .1 | .0 Table | REGION | 393,216 | 384 | 0 | .0 | .0 Table | SUPPLIER | 1,048,576 | 1,024 | 1 | .0 | .0 LABDB(LABADMIN)=> q [nz@netezza ~]$ nz_db_size LABDB LABDB(LABADMIN)=> GROOM TABLE ORDERS; NOTICE: Groom processed 596 pages; purged 729413 records; scan size shrunk by 288 pages; table size shrunk by 12 extents. GROOM RECORDS ALL [nz@netezza labs]$ nzsql LABDB LABADMIN LABDB(LABADMIN)=> GROOM TABLE ORDERS; LABDB(LABADMIN)=> q [nz@netezza ~]$ nz_db_size LABDB DELETE 729413
  • 163. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 18 We can see that GROOM did purge the deleted rows from disk. GROOM reported that the table size was reduced by 12 extents and we can confirm this because we can see that the size of the table reduced by 36MB which is the correct size for 12 exents. (1 extent’s size is 3 MB). 4 Performance Benefits of GROOM In this section we will show that grooming a table can also result in a performance benefit because the amount of data that needs to be scanned is smaller. Outdated rows are still present on the hard disc. They can be dismissed by the FPGA chip but the system still needs to read them from disc. In this example we need for accounting reasons increase the order price of all columns. This means that we need to update every row in the ORDERS table. We will measure query performance before and after Grooming the table. 1. Update the ORDERS table so that the price of everything is increased by $1. Do this using the following command: Output: All rows will be affected by the update resulting in a doubled number of physical rows in the table. This is because the update operation leaves a copy of the rows before the update occurred incase a transaction is still operating on the rows.. New rows are created and the results of the UPDATE are put in these rows. The old rows that are left on disk are marked as logically deleted. 2. To measure the performance of our test query, we can configure the NZSQL console to show the elapsed execution time using the following command: Output: 3. Run our given test query and note the performance: Output: 4. Please rerun the query once or twice more to see roughly what a consistent query time is on your machine. 5. Now run the GROOM TABLE command on the ORDER table again: LABDB(LABADMIN)=> GROOM TABLE ORDERS; LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS; LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS; COUNT -------- 770587 (1 row) Elapsed time: 0m0.502s LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS; Query time printout on LABDB(LABADMIN)=> time UPDATE 770587 [nz@netezza labs]$ nzsql LABDB LABADMIN LABDB(LABADMIN)=> UPDATE ORDERS SET O_TOTALPRICE = O_TOTALPRICE+1;
  • 164. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 18 The output should be: Can you tell how much disk space this saved? (It’s the number of extents times 3MB) 6. Now run our chosen test query again and you should see a difference in performance: Output: You should see that the query ran faster than before. This is because GROOM reduced the number of rows that must be scanned to complete the query. The COUNT(*) command on the table will return the same number of rows before and after the GROOM command was run since it can only see the current version of the table, which means all rows that have not been deleted by a lower transaction id. Since our UPDATE command hasn’t changed the number of logical rows this will not change. Nevertheless the outdated rows, which have been logically deleted by our UPDATE command, are still present on disk. The COUNT(*) query cannot access these rows but they do take up space on disk and need to be scanned. GROOM is used to purge these logically deleted rows from disk which increase disk usage and scan distance. You should GROOM tables that receive frequent updates or deletes more often than tables that are seldom updated. You might want to schedule tasks that routinely GROOM the frequently updated tables or run a GROOM command as part of you ETL process. 5 Changing the Data Type of a Column In some situations you will realize that the initially used data types are not suitable for longterm use, for example because new entries exceed the range of an initially picked integer value. You cannot directly change the data type by using the ALTER statement but there are two approaches that allow you to do it without loading and unloading the data. The first approach is to: • Create a CTAS table from the old table with a CAST to the new datatype for the column you want to change • Drop the old table • Rename the new table to the name of the old table In general this is a good approach because it lets you keep the order of the columns. But in this example we will use a second approach to highlight the groom command and its role during ADD and DROP column commands. Its disadvantages are that the order of the columns will change, which may result in difficulties for third party applications that access columns by their order. In this chapter we will: • Add a new column to the table with the new datatype • Copy over all values from the old row to the new one with an UPDATE command • Drop the old column LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS; COUNT -------- 770587 (1 row) Elapsed time: 0m0.315s LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS; LABDB(ADMIN)=> GROOM TABLE ORDERS; NOTICE: Groom processed 616 pages; purged 770587 records; scan size shrunk by 308 pages; table size shrunk by 16 extents. GROOM RECORDS ALL
  • 165. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 18 • Rename the new column to the name of the old one • Use the groom command to materialize the results of our table changes For our example we find out that we have a new Region we want to add to our Regions table which has a name that exceeds the limits of the CHAR(25) field R_NAME. “Australia, New Zealand, and Tasmania”. And we decide to increase the R_NAME field to a CHAR(40) field. 1. Add a new column to the region table with name R_NAME_TEMP and data type CHAR(40) Notice that the ALTER command is practically instantaneous. This even holds true for huge tables. Under the cover the system will create a new empty version of the table. It will not lock and change the whole table. 2. Lets insert a row into the table using the new name column 3. Now do a select on the table: You should get the following results: You can see that the results are exactly as you would expect them to be, but how does the system actually achieve this. Remember inside the PureData System appliances we have two versions of the table, one containing the old columns and rows and one containing the new row column. 4. Let’s do an EXPLAIN on the SELECT query You should get the following results: LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION; LABDB(LABADMIN)=> INSERT INTO REGION VALUES (5,'', 'South Pacific Region', 'Australia, New Zealand, and Tasmania'); DELETE 39099 LABDB(LABADMIN)=> LABDB(LABADMIN)=> SELECT * FROM REGION; R_REGIONKEY | R_NAME | R_COMMENT | R_NAME_TEMP -------------+---------------------------+-----------------------+---------------------------------- 1 | na | north america | 5 | | South Pacific Region | Australia, New Zealand, and Tasmania 4 | ap | asia pacific | 2 | sa | south america | 3 | emea | europe, | (5 rows) LABDB(LABADMIN)=> SELECT * FROM REGION; LABDB(LABADMIN)=> ALTER TABLE REGION ADD COLUMN R_NAME_TEMP CHAR(40);
  • 166. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 18 Normally the query would result in a single table scan node. But now we see a more complicated query plan. The Optimizer automatically translates the simple SELECT into a UNION of two tables. The two tables are internal and are called _TV_315893_1, which is the old version of the table before the ALTER statement. And _TV_315893_2, which is the new version of the table after the table statement containing the new column R_NAME_TEMP. Notice that in the old table a 4th column of CHAR(40) with default value NULL is added. This is necessary for the UNION to succeed. The merger of those tables is done in Node 5, which takes both result sets and appends them. But lets proceed with our data type change operation. 5. First lets remove the new row again 6. Now we will move all values of the R_NAME column to the R_NAME_TEMP column by updating them 7. Lets have a look at the table again: LABDB(LABADMIN)=> UPDATE REGION SET R_NAME_TEMP = R_NAME; LABDB(LABADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY > 4; LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION; NOTICE: QUERY PLAN: QUERY SQL: EXPLAIN VERBOSE SELECT * FROM REGION; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table ""_TV_315893_2"" {("_TV_315893_2".R_REGIONKEY)}] -- Estimated Rows = 1, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0 User table: REGION version 2 Projections: 1:"_TV_315893_2".R_REGIONKEY 2:"_TV_315893_2".R_NAME 3:"_TV_315893_2".R_COMMENT 4:"_TV_315893_2".R_NAME_TEMP Node 2. [SPU Sub-query Scan table "*SELECT* 1" Node "1" {(0."1")}] -- Estimated Rows = 1, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0 Projections: 1:0."1" 2:0."2" 3:0."3" 4:0."4" Node 3. [SPU Sequential Scan table ""_TV_315893_1"" {("_TV_315893_1".R_REGIONKEY)}] -- Estimated Rows = 8, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0 User table: REGION version 1 Projections: 1:"_TV_315893_1".R_REGIONKEY 2:"_TV_315893_1".R_NAME 3:"_TV_315893_1".R_COMMENT 4:(NULL::BPCHAR)::CHAR(40) Node 4. [SPU Sub-query Scan table "*SELECT* 2" Node "3" {(0."1")}] -- Estimated Rows = 8, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0 Projections: 1:0."1" 2:0."2" 3:0."3" 4:0."4" Node 5. [SPU Append Nodes: , "2", "4 (stream)" {(0."1")}] -- Estimated Rows = 9, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0 Projections: 1:0."1" 2:0."2" 3:0."3" 4:0."4" Node 6. [SPU Sub-query Scan table "_BV_315893" Node "5" {("_BV_315893".R_REGIONKEY)}] -- Estimated Rows = 9, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0 Projections: 1:"_BV_315893".R_REGIONKEY 2:"_BV_315893".R_NAME 3:"_BV_315893".R_COMMENT 4:"_BV_315893".R_NAME_TEMP [SPU Return] [Host Return] …
  • 167. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 18 You should get the following results: 8. Now lets remove the old column: 9. And rename the column name 10. Lets have a look at the table again: You should get the following results: We have achieved to change the data type of the R_NAME column. The column order has changed but our R_NAME column has the same values as before and now supports longer region names. But we have one last step to do. Under the cover the system now has three different versions of the table which are merged for each call against the REGION table. This not only uses up space it is also bad for the query performance. So we have to materialize these table changes with the groom command. 11. Groom the REGION table with the VERSIONS keyword to merge table versions: You should get the following results: LABDB(LABADMIN)=> GROOM TABLE REGION VERSIONS; NOTICE: Groom processed 8 pages; purged 5 records; scan size shrunk by 4 pages; table size shrunk by 4 extents. GROOM VERSIONS LABDB(LABADMIN)=> GROOM TABLE REGION VERSIONS; LABDB(LABADMIN)=> ALTER TABLE REGION RENAME COLUMN R_NAME_TEMP TO R_NAME; LABDB(LABADMIN)=> SELECT * FROM REGION; LABDB(LABADMIN)=> SELECT * FROM REGION; R_REGIONKEY | R_COMMENT | R_NAME -------------+-----------------------------+------------------------------------------ 4 | asia pacific | ap 3 | europe, middle east, africa | emea 2 | south america | sa 1 | north america | na (4 rows) LABDB(LABADMIN)=> ALTER TABLE REGION DROP COLUMN R_NAME RESTRICT; LABDB(LABADMIN)=> SELECT * FROM REGION; LABDB(LABADMIN)=> SELECT * FROM REGION; R_REGIONKEY | R_NAME | R_COMMENT | R_NAME_TEMP -------------+---------------------------+-----------------------------+----------------------------------- ------- 3 | emea | europe, middle east, africa | emea 1 | na | north america | na 2 | sa | south america | sa 4 | ap | asia pacific | ap (4 rows)
  • 168. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 17 of 18 12. And finally we will look at the EXPLAIN output again: You should get the following results: Now this is much nicer. As we would expect we only have a single table scan snippet in the query plan and a single version of the REGION table. 13. Finally we will return the REGION table to the old column ordering to not interfere with future labs, to do this we will use a CTAS statement 14. Now drop the REGION table: 15. And finally rename the REGION_NEW table to make the transformation complete: If a table can be inaccessible for a short period of time using CTAS tables can be the better solution to change data types than using an ALTER TABLE statement. In this lab you have looked behind the scenes of the PureData System appliances. You have seen how transactions are implemented and we have shown different reasons for using the groom command. It not only removes logically deleted rows from INSERT and UPDATE operations, aborted INSERTS and Loads, it also materializes table changes and reorders cluster based tables. LABDB(LABADMIN)=> ALTER TABLE REGION_NEW RENAME TO REGION; LABDB(LABADMIN)=> DROP TABLE REGION; LABDB(LABADMIN)=> CREATE TABLE REGION_NEW AS SELECT R.R_REGIONKEY, R.R_NAME, R.R_COMMENT FROM REGION R; LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION; NOTICE: QUERY PLAN: QUERY SQL: EXPLAIN VERBOSE SELECT * FROM REGION; QUERY VERBOSE PLAN: Node 1. [SPU Sequential Scan table "REGION" {(REGION.R_REGIONKEY)}] -- Estimated Rows = 4, Width = 73, Cost = 0.0 .. 0.0, Conf = 100.0 Projections: 1:REGION.R_REGIONKEY 2:REGION.R_COMMENT 3:REGION.R_NAME [SPU Return] [Host Return] … LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;
  • 169. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 18 of 18 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.
  • 170. Stored Procedures Hands-On Lab IBM PureData System for Analytics … Powered by Netezza Technology
  • 171. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 2 of 22 Table of Contents 1 Introduction .....................................................................3 1.1 Objectives........................................................................3 2 Implementing the addCustomer stored procedure ......3 2.1 Create Insert Stored Procedure .......................................4 2.2 Adding integrity checks....................................................8 2.3 Managing your stored procedure ...................................10 3 Implementing the checkRegions stored procedure...15
  • 172. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 3 of 22 1 Introduction Stored Procedures are subroutines that are saved in PureData System. They are executed inside the database server and are only available by accessing the NPS system. They combine the capabilities of SQL to query and manipulate database information with capabilities of procedural programming languages, like branching and iterations. This makes them an ideal solution for tasks like data validation, writing event logs or encrypting data. They are especially suited for repetitive tasks that can be easily encapsulated in a sub-routine. 1.1 Objectives In the last labs we have created our database, loaded the data and we have done some optimization and administration tasks. In this lab we will enhance the database by a couple of stored procedures. As we mentioned in a previous chapter PureData System doesn’t check referential or unique constraints. This is normally not critical since data loading in a data warehousing environment is a controlled task. In our PureData System implementation we get the requirement to allow some non administrative database users to add new customers to the customer table. This happens rarely so there are no performance requirements and we have decided to implement this with a stored procedure that is accessible for these users and checks the input values and referential constraints. In a second part we will implement a business logic function as a stored procedure returning a result set. TODO describe function Figure 1 LABDB database 2 Implementing the addCustomer stored procedure
  • 173. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 4 of 22 In this chapter we will create the stored procedure to insert data into the customer table. The information that is added for a new customer will be the customer key, name, phone number and nation, the rest of the information is updated through other processes. 2.1 Create Insert Stored Procedure First we will review the customer table and define the interface of the insert stored procedure. 1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user “nz” with password “nz”. (192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp) 2. Access the lab directory for this lab with the following command, this folder already contains empty files for the stored procedure scripts we will later create. If you want review them with the ls command: 3. Enter NZSQL and connect to LABDB as user LABADMIN. 4. Describe the customer table with the following command d customer You should see the following: We will now create a stored procedure that adds a new customer entry and sets the 4 fields: C_CUSTKEY, C_NAME, C_NATIONKEY, and C_PHONE, all other fields will be set with an empty value or 0, since the fields are flagged as NOT NULL. 5. To create a stored procedure we will use the internal vi editor of the nzsql console. Open the already existing empty file addUser.sql with the following command (note you can tab out the filename): 6. You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing “i” 7. We will now create the interface of the stored procedure so we can test creating it. We need the 4 input field mentioned above and will return an integer return code. Enter the text as seen in the following, then exit the insert mode by pressing ESC and enter wq! and enter to save the file and quit vi. [nz@netezza ~]$ cd /labs/storedProcedure/ LABDB(ADMIN)=> e addCustomer.sql LABDB(ADMIN)=> d customer Table "CUSTOMER" Attribute | Type | Modifier | Default Value --------------+------------------------+----------+--------------- C_CUSTKEY | INTEGER | NOT NULL | C_NAME | CHARACTER VARYING(25) | NOT NULL | C_ADDRESS | CHARACTER VARYING(40) | NOT NULL | C_NATIONKEY | INTEGER | NOT NULL | C_PHONE | CHARACTER(15) | NOT NULL | C_ACCTBAL | NUMERIC(15,2) | NOT NULL | C_MKTSEGMENT | CHARACTER(10) | NOT NULL | C_COMMENT | CHARACTER VARYING(117) | NOT NULL | Distributed on hash: "C_CUSTKEY" [nz@netezza labs]$ nzsql LABDB LABADMIN
  • 174. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 5 of 22 The minimal stored procedure we create here doesn’t yet do anything, since it has an empty body. We simply create the signature with the input and output variables. We use the command CREATE OR REPLACE so we can later execute the same command multiple times to update the stored procedure with more code. The input variables cannot be given names so we only add the datatypes for our input parameters key, name, nation and phone. We also return an integer return code. Note that we have to specify the procedure language even though NZPLSQL is the only available option in PureData System. 8. Back in the nzsql command line execute the script we just created with i addCustomer.sql You should see, that the procedure has been created successfully 9. Display all stored procedures in the LABDB database with the following command: You will see the following result: You can see the procedure ADDCUSTOMER with the arguments we specified. 10. Execute the stored procedure with the following dummy input parameters: LABDB(LABADMIN)=> call addcustomer(1,'test', 2, 'test'); LABDB(ADMIN)=> show procedure; RESULT | PROCEDURE | BUILTIN | ARGUMENTS ---------+-------------+---------+------------------------------------------------------ ------------ INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) (1 row) LABDB(LABADMIN)=> SHOW PROCEDURE; LABDB(ADMIN)=> i addCustomer.sql CREATE PROCEDURE LABDB(ADMIN)=> CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15)) LANGUAGE NZPLSQL RETURNS INT4 AS BEGIN_PROC END_PROC; ~ ~ ~ ~~ ~ :wq!
  • 175. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 6 of 22 You should see the following: The result shows that we have a syntax error in our stored procedure. Every stored procedure needs at least one BEGIN .. END block that encapsulates the code that is to be executed. Stored procedures are compiled when they are first executed not when they are created, therefore errors in the code can only be seen during execution. 11. Switch back to the VI view with the following command 12. Switch to insert mode by pressing “i" 13. We will now create a simple stored procedure that inserts the new entry into the customer table. But first we will add some variables that alias the input variables $1, $2 etc. After the BEGIN_PROC statement enter the following lines: Each BEGIN..END block in the stored procedure can have its own DECLARE section. Variables are valid in the block they belong to. It is a good best practice to change the input parameters into readable variable names to make the stored procedure code maintainable. We will later add some additional parameters to our procedures as well. Be careful not to use variable names that are restricted by PureData System like for example NAME. 14. Next we will add the BEGIN..END block with the INSERT statement. This statement will add a new row to the customer table using the input variables. It will replace the remaining fields like account balance with default values that can be later filled. It is also possible to execute dynamic SQL queries which we will do in a later chapter. Your complete stored procedure should now look like the following: BEGIN INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', ''); END; DECLARE C_KEY ALIAS FOR $1; C_NAME ALIAS FOR $2; N_KEY ALIAS FOR $3; PHONE ALIAS FOR $4; LABDB(LABADMIN)=> e addCustomer.sql LABDB(LABADMIN)=> call addcustomer(1,'test', 2, 'test'); NOTICE: plpgsql: ERROR during compile of ADDCUSTOMER near line 1 ERROR: syntax error, unexpected <EOF>, expecting BEGIN at or near ""
  • 176. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 7 of 22 15. Save and exit VI again by pressing ESC to enter the command mode and entering “wq!” and pressing enter. This will bring you back to the nzsql console. 16. Execute the stored procedure script with the following command: i addCustomer.sql 17. Now lets try our stored procedure lets add a new customer John Smith with customer key 999999, phone number 555- 5555 and nation 2 (which is the key for the United States in our nation table). You can also check first that the customer doesn’t yet exist if you want. You should get the following results: 18. Lets check if the insert was successful: You should get the following results: We can see that our insert was successful. Congratulations, you have built your first PureData System stored procedure. LABDB(LABADMIN)=> SELECT * FROM CUSTOMER WHERE C_CUSTKEY = 999999; C_CUSTKEY | C_NAME | C_ADDRESS | C_NATIONKEY | C_PHONE | C_ACCTBAL | C_MKTSEGMENT | C_COMMENT -----------+------------+-----------+-------------+-----------------+----------- +--------------+----------- 999999 | John Smith | | 2 | 555-5555 | 0.00 | | (1 row) LABDB(LABADMIN)=> CALL addcustomer(999999,'John Smith', 2, '555-5555'); ADDCUSTOMER ------------- (1 row) LABDB(LABADMIN)=> SELECT * FROM CUSTOMER WHERE C_CUSTKEY = 999999; LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555'); CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15)) LANGUAGE NZPLSQL RETURNS INT4 AS BEGIN_PROC DECLARE C_KEY ALIAS for $1; C_NAME ALIAS for $2; N_KEY ALIAS for $3; PHONE ALIAS for $4; BEGIN INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', ''); END; END_PROC;
  • 177. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 8 of 22 2.2 Adding integrity checks In this chapter we will add integrity checks to the stored procedure we just created. We will make sure that no duplicate customer is entered into the CUSTOMER table by querying it before the insert. We will then check with an IF condition if the value had already been inserted into the CUSTOMER table and abort the insert in that case. We will also check the foreign key relationship to the nation table and make sure that no customer is inserted for a nation that doesn’t exist. If any of these conditions aren’t met the procedure will abort and display an error message. 1. Switch back to the VI view of the procedure with the following command. In case of a message warning about duplicate files press enter. 2. Switch to insert mode by pressing “i" 3. Add a new variable customer_rec with the type RECORD in the DECLARE section: A RECORD is a row set with dynamic fields. It can refer to any row that is selected in a SELECT INTO statement. You can later refer to fields with for example CUSTOMER_REC.C_PHONE. 4. Add the following statement before the INSERT statement: This statement fills the CUSTOMER_REC variable with the results of the query. If there is already one or more customers with the specified key it will contain the first. Otherwise the variable will be null. 5. Now we add the IF condition to abort the stored procedure in case a record already exists. After the newly added SELECT statement add the following lines: In this case we use an IF condition to check if an customer record with the key already exists and has been selected by the previous SELECT condition. We could do an implicit check on the record or any of its fields and see if it compares to the null value, but PureData System provides a number of special variables that make this more convenient. • FOUND specifies if the last SELECT INTO statement has returned any records • ROW_COUNT contains the number of found rows in the last SELECT INTO statement • LAST_OID is the object id of the last inserted row, this variable is not very useful unless used for catalog tables. Finally we use a RAISE EXCEPTION statement to throw an error and abort the stored procedure. To add variable values to the return string use the % symbol anywhere in the string. This is a similar approach as used for example by the C printf statement. 6. We will also check the foreign key relationship to NATION, add the following lines after the last once: IF FOUND REC THEN RAISE EXCEPTION 'Customer with key % already exists', C_KEY; END IF; SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY; REC RECORD; LABDB(LABADMIN)=> e addCustomer.sql
  • 178. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 9 of 22 This is very similar to the last check, only that we this time check if a record was NOT found. Notice that we can reuse the REC record since it is not typed to a particular table. Your stored procedure should now look like the following: 7. Save the stored procedure by pressing ESC, and then entering ‘wq!’ and pressing Enter. 8. In NZSQL create the stored procedure from the script by executing the following command (remember that you can cycle through previous commands by pressing the UP key) 9. Now lets test the check for duplicate customer ids by repeating our last CALL statement, we already know that a customer record with the id 999999 already exists: Your should get the following results: This is what we expected the key value already exists and our first error condition is thrown. LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555'); ERROR: Customer with key 999999 already exists LABDB(LABADMIN)=> LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555'); LABDB(LABADMIN)=> i addCustomer.sql CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15)) LANGUAGE NZPLSQL RETURNS INT4 AS BEGIN_PROC DECLARE C_KEY ALIAS for $1; C_NAME ALIAS for $2; N_KEY ALIAS for $3; PHONE ALIAS for $4; REC RECORD; BEGIN SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY; IF FOUND REC THEN RAISE EXCEPTION 'Customer with key % already exists', C_KEY; END IF; SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY; IF NOT FOUND REC THEN RAISE EXCEPTION 'No Nation with nation key %', N_KEY; END IF; INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', ''); END; END_PROC; SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY; IF NOT FOUND REC THEN RAISE EXCEPTION 'No Nation with nation key %', N_KEY; END IF;
  • 179. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 10 of 22 10. Now lets check the foreign key integrity by executing the following command with a customer id that does not yet exist and a nation key that doesn’t exist in the NATION table as well. You can double check this using select statements if you want: You should see the following output: This is also as we have expected. The customer key didn’t yet exist so the first IF condition is not thrown but the check for the nation key table throws an error. 11. Finally lets try a working example, execute the following command with a customer id that doesn’t yet exist and the nation key 2 for United States. You should see a successful execution. 12. Lets check that the value was correctly inserted: This will give you the following results We have successfully created a stored procedure that can be used to insert values into the CUSTOMER table and checks for unique and foreign key constraints. You should remember that PureData System isn’t optimized to do lookup queries so this will be a pretty slow operation and shouldn’t be used for thousands of inserts. But for the occasional management it is a perfectly valid solution to the problem of missing constraints in PureData System. 2.3 Managing your stored procedure In the last chapters we have created a stored procedure that inserts values to the CUSTOMER table and does check constraints. We will now give rights to execute this procedure to a user and we will use the management functions to make changes to the stored procedure and verify them. 1. First we will create a user custadmin who will be responsible for adding customers, to do this we will need to switch to the admin user since users are global objects: LABDB(LABADMIN)=> c labdb admin LABDB(LABADMIN)=> SELECT C_CUSTKEY, C_NAME FROM CUSTOMER WHERE C_CUSTKEY = 999998; C_CUSTKEY | C_NAME -----------+------------- 999998 | James Brown (1 row) LABDB(LABADMIN)=> LABDB(LABADMIN)=> SELECT C_CUSTKEY, C_NAME FROM CUSTOMER WHERE C_CUSTKEY = 999998; LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 2, '555-5555'); LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 99, '555-5555'); ERROR: No Nation with nation key 99 LABDB(LABADMIN)=> LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 99, '555-5555');
  • 180. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 11 of 22 2. Now we create the user: You can see that he has the same password as the other users in our labs. We do this for simplification, since it allows us to obmit the password during user switches, this would of course not be done in a production environment 3. Now we will grant access to the labdb database, otherwise he couldn’t log on 4. Finally we will grant him the right to select from the customer table, he will need to have this to verify any changes he has made: 5. Now let’s test this out first switch to the user custadmin: 6. Now try to select something from the NATION table to verify that the user only has access to the customer table: You should get the message that access is refused: 7. Now lets select something from the CUSTOMER table: The user should be able to select the row from the CUSTOMER table: 8. Finally lets verify that the user doesn’t have INSERT rights on the table: You will be refused to insert values to the customer table: LABDB(CUSTADMIN)=> INSERT INTO CUSTOMER VALUES (1, '','',1,'',1,'',''); LABDB(CUSTADMIN)=> select c_custkey, c_name from customer where c_custkey = 999998; C_CUSTKEY | C_NAME -----------+------------- 999998 | James Brown (1 row) LABDB(CUSTADMIN)=> LABDB(CUSTADMIN)=> select c_custkey, c_name from customer where c_custkey = 999998; LABDB(CUSTADMIN)=> select * from nation; ERROR: Permission denied on "NATION". LABDB(CUSTADMIN)=> LABDB(CUSTADMIN)=> select * from nation; LABDB(ADMIN)=> c labdb custadmin LABDB(ADMIN)=> grant select on customer to custadmin; LABDB(ADMIN)=> grant list, select on labdb to custadmin; LABDB(ADMIN)=> create user custadmin with password 'password';
  • 181. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 12 of 22 9. We now need to switch back to the admin user to give custadmin the rights to execute the stored procedure: 10. To grant the right to execute a specific stored procedure we need to specify the full name including all input parameters. The easiest way to get these in the correct syntax is to first list them with the SHOW PROCEDURE command: You should see the following screen, you could either cut&paste the arguments or copy them manually: 11. Now grant the right to execute this stored procedure to CUSTADMIN: 12. Lets check the rights of the custadmin user now with : dpu custadmin You should get the following results: LABDB(ADMIN)=> dpu custadmin User object permissions for user 'CUSTADMIN' Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S ---------------+-------------+-------------------------------------+-------------------- ------------------------- LABDB | ADDCUSTOMER | X | LABDB | CUSTOMER | X | GLOBAL | LABDB | X X | (3 rows) Object Privileges (L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock (A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s Administration Privilege (D)atabase (G)roup (U)ser (T)able T(E)mp E(X)ternal Se(Q)uence S(Y)nonym (V)iew (M)aterialized View (I)ndex (B)ackup (R)estore va(C)uum (S)ystem (H)ardware (F)unction (A)ggregate (L)ibrary (P)rocedure U(N)fence (S)ecurity LABDB(ADMIN)=> grant execute on addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) to custadmin; LABDB(ADMIN)=> show procedure all; LABDB(ADMIN)=> show procedure all; RESULT | PROCEDURE | BUILTIN | ARGUMENTS ---------+-------------+---------+------------------------------------------------------ ------------ INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) (1 row) LABDB(CUSTADMIN)=> c labdb admin LABDB(CUSTADMIN)=> INSERT INTO CUSTOMER VALUES (1, '','',1,'',1,'',''); ERROR: Permission denied on "CUSTOMER". LABDB(CUSTADMIN)=>
  • 182. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 13 of 22 You can see that the user has only the rights we have given him. He can select data from the customer table and execute our stored procedure but he is not allowed to change the customer table directly or execute anything but the stored procedure. 13. Lets test this switch to the custadmin user with the following command: c labdb custadmin 14. Add another customer to the customer table: The insert will have been successful and you will have another row in your table, you can check this with a SELECT query if you want. 15. We will now make some changes to the stored procedure to do this we need to switch back to the administrative account: 16. Now we will modify the stored procedure first lets have a detailed look at it. You should see the following screen: You can see the input and output arguments, procedure name, owner, if it is executed as owner or caller and other details. Verbose also shows you the source code of the stored procedure. We see that the description field is still empty so lets add a comment to the stored procedure. This is important to do if you have a big number of stored procedures in your system. Note: nzadmin is a very convenient way to manage your stored procedure it provides most of the managing functionality used in this lab in a graphical UI. 17. Add a description to the stored procedure: LABDB(CUSTADMIN)=> c labdb admin LABDB(ADMIN)=> show procedure addcustomer verbose; LABDB(ADMIN)=> show procedure addcustomer verbose; RESULT | PROCEDURE | BUILTIN | ARGUMENTS | OWNER | EXECUTEDASOWNER | VARARGS | DESCRIPTION | PROCEDURESOURCE ------ INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) | ADMIN | t | f | | DECLARE C_KEY ALIAS for $1; C_NAME ALIAS for $2; N_KEY ALIAS for $3; PHONE ALIAS for $4; REC RECORD; BEGIN SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY; IF FOUND REC THEN RAISE EXCEPTION 'Customer with key % already exists', C_KEY; END IF; SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY; IF NOT FOUND REC THEN RAISE EXCEPTION 'No Nation with nation key %', N_KEY; END IF; INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', ''); END; LABDB(CUSTADMIN)=> CALL addCustomer(999997,'Jake Jones', 2, '555-5554');
  • 183. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 14 of 22 It is necessary to specify the exact stored procedure signature including the input arguments, these can be cut& pasted from the output of the show procedures command. The COMMENT ON command can be used to add descriptions to more or less all database objects you own from procedures, tables till columns. 18. Verify that your description has been set: The description field will now contain your comment: 19. We will now alter the stored procedure to be executed as the caller instead of the owner. This means that whoever executes the stored procedure needs to have access rights to all the objects that are touched in the stored procedure otherwise it will fail. This should be the default for stored procedures that encapsulate business logic and do not do extensive data checking: 20. Since the admin user has access to the customer table he will be able to execute the stored procedure: 21. Now lets switch to the custadmin user: c labdb custadmin 22. Try to add another customer as custadmin: You should see the following results: As expected the stored procedure fails now. The user custadmin has read access to the CUSTOMER table but no read access to the NATION table, therefore this check results in an exception. While EXECUTE AS CALLER is more secure in some circumstances it doesn’t fit our usecase where we specifically want to expose some data modification ability to a user who shouldn’t be able to modify a table otherwise. Therefore we will change the stored procedure back: LABDB(CUSTADMIN)=> call addCustomer(999995, 'John Schwarz', 2, '555-5553'); LABDB(CUSTADMIN)=> CALL addCustomer(999995,'John Schwarz', 2, '555-5552'); NOTICE: Error occurred while executing PL/pgSQL function ADDCUSTOMER NOTICE: line 12 at select into variables ERROR: Permission denied on "NATION". LABDB(ADMIN)=> call addCustomer(999996,'Karl Schwarz', 2, '555-5553'); LABDB(ADMIN)=> alter procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) execute as caller; LABDB(ADMIN)=> show procedure addcustomer verbose; RESULT | PROCEDURE | BUILTIN | ARGUMENTS | OWNER | EXECUTEDASOWNER | VARARGS | DESCRIPTION | PROCEDURESOURCE ------ INTEGER | ADDCUSTOMER | f | (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) | ADMIN | t | f | This procedure adds a new customer entry to the CUSTOMER table | … … LABDB(ADMIN)=> show procedure addcustomer verbose; LABDB(ADMIN)=> comment on procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) IS 'This procedure adds a new customer entry to the CUSTOMER table';
  • 184. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 15 of 22 23. First switch back to the admin user: c labdb admin 24. Change the stored procedure back to being executed as owner: In this chapter you have setup the permissions for the addCustomer stored procedure and the user custadmin who is supposed to use it. You also added comments to the stored procedure. 3 Implementing the checkRegions stored procedure In this chapter we will implement a stored procedure that performs a check on all rows of the regions table. The call of the stored procedure will be very simple and will not contain input arguments. The stored procedure is used to encapsulate a sanity check of the regions table that is executed regularly in the PureData System system for administrative purposes. Our stored procedure will check each row of the REGION table for three things: 1. If the region key is smaller than 1 2. if the name string is empty 3. if the description is lower case only this is needed for application reasons. The procedure will return each row of the region table together with additional columns that describe if the above constraints are broken. It will also return a notice with the number of faulty rows. This chapter will teach you to use loops in a stored procedure and to return table results. You will also use dynamic query execution to create queries on the fly. You should be familiar with the use of VI for the development of stored procedures from the last chapter. Alternatives to using a standard text editor for the creation of your stored procedure would be the use of a graphical development environment like Aginity or the PureData System Eclipse plugins that can be downloaded from the PureData System Developer Network. 1. Open the already existing empty file checkRegion.sql with the following command (note you can tab out the filename): 2. You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing “i” 3. First we will define the stored procedure header similar to the last procedure. It will be very simple since we will not use any input arguments. Enter the following code to the editor: CREATE OR REPLACE PROCEDURE checkRegions() LANGUAGE NZPLSQL RETURNS REFTABLE(tb1) AS BEGIN_PROC END_PROC; ~ ~ ~ ~ -- INSERT -- LABDB(ADMIN)=> e checkRegion.sql LABDB(ADMIN)=> alter procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) execute as owner;
  • 185. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 16 of 22 Let’s have a detailed look at the RETURNS section. We want to return a result set but do not have to describe the column names or datatypes of the table object that is returned. Instead we reference an existing table, which needs to exist at the time the stored procedure is created. This means we will need to create the table TB1 before executing the CREATE PROCEDURE command. Once the stored procedure is executed the stored procedure will create under the cover an empty temporary table that has the same definition as the referenced table. So the results will not actually be saved in the referenced table, which is only used for the definition. This means that multiple stored procedures can be executed at the same time without influencing each other. Since the created table is temporary it will be cleaned up once the connection to the database is aborted. Note: If the referenced table contains rows they will neither be changed nor copied over to the temporary table, the table is strictly used for reference. 4. For our stored procedure we need four variables, add the following lines after the BEGIN_PROC statement: The four variables are needed for our stored procedure: • rec, is a RECORD structure while we loop through the rows of the table we will use it to save and access the values of each row and check them with our constraints • errorRows will be used to contain the total number of rows that violate our constraints • fieldEmpty will be used to store if the row violates either the constraint that the name is empty or the record code is smaller than 1, this is appropriate since values of -1 or 0 in the region code are used to denote that it is empty • descUpper will be true if a record violates the constraint that the description needs to be lowercase 5. We will now add the main BEGIN..END clause and initialize the errorRows variable. Add the following rows after the DECLARE section: Each stored procedure must at least contain one BEGIN .. END clause, which encapsulates the executed commands. We also initially set the number of error rows to 0 and display a short sentence. 6. We will now add the main loop. It will iterate through all rows of the REGION table and store each row in the rec variable. Add the following lines before the END statement FOR rec IN SELECT * FROM REGION ORDER BY R_REGIONKEY LOOP fieldEmpty := false; descUpper := false; END LOOP; RAISE NOTICE ' % rows had an error see result set', errorRows; BEGIN RAISE NOTICE 'Start check of Region'; errorRows := 0; END; DECLARE rec RECORD; errorRows INTEGER; fieldEmpty BOOLEAN; descUpper BOOLEAN;
  • 186. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 17 of 22 The FOR rec IN expression LOOP .. END LOOP command is used to iterate through a result set, in our case a SELECT * on the region table. The loop body is executed once for every row in the expression and the current row is saved in the rec field. The loop needs to be ended with the END LOOP keyword. There are many other types of loops in NZPLSQL, for a complete set refer to the stored procedure guide. For each iteration of the loop we initially set the value of the fieldEmpty and descUpper to false. Variables can be assigned with the ‘:=’ operator. Finally we will display a notice that shows the number of rows that either had an empty field or upper case expression. This number will be saved in the errorRows variable. 7. Now its time to check the rows for our constraints and set our variables accordingly. Enter the following rows behind the variable initialization and before the END LOOP keyword: In this section we check our constraints for each row and set our three variables accordingly. First we check if the name field of the row is the empty string or if the region key is smaller than one. In that case the fieldEmpty field is set to true. Note how we can access the fields by adding the fieldname to our loop record. The second ‘IF’ statement checks if the comment field of the row is different to the lower case version of the comment field. This would be the case if it contains uppercase characters. Note that we can use the available PureData System functions like LOWER in the stored procedure, as if it were a SQL statement. Finally if one of these variables has been set to true by the previous checks, we increase the value of the errorRows variable by one. The final number will in the end be displayed by the RAISE NOTICE statement we already added to the stored procedure. 8. Finally add the following lines after the lines you just added and before the END LOOP statement: These lines add the row of the REGION table to the result set of our stored procedure adding two columns containing the fieldEmpty and descUpper flags for this row. There are a couple of important points here: For each call of a stored procedure with a result set as return value a temporary table is created that is later returned to the caller. Since the name is unique it needs to be referenced through a variable. This is the REFTABLENAME variable. Apart from that, adding values to the result set is identical to other INSERT operations. EXECUTE IMMEDIATE 'INSERT INTO '|| REFTABLENAME ||' VALUES (' || rec.R_REGIONKEY ||',''' || trim(rec.R_NAME) ||''',''' || trim(rec.R_COMMENT) ||''',' || fieldEmpty ||',' || descUpper ||')'; IF rec.R_NAME = '' OR rec.R_REGIONKEY < 1 THEN fieldEmpty := true; END IF; IF rec.R_COMMENT <> LOWER(rec.R_COMMENT) THEN descUpper := true; END IF; IF (fieldEmpty = true) OR (descUpper = true) THEN errorRows := errorRows + 1; END IF;
  • 187. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 18 of 22 Since the name of the table is dynamic we need to execute the INSERT operations as a dynamic statement. This means that the EXECUTE IMMEDIATE statement is used with a string that contains the query that is to be executed. To add variable values to the string the pipe symbol || is used. Note that the values for R_NAME and R_COMMENT are inserted as strings, which means they need to be surrounded by quotes. To add quotes to a string they need to be escaped with a second quote character. This is the reason that R_NAME and R_COMMENT are surrounded by triple quotes. Apart from that we trim them, so the inserted VARCHAR values are not blown up with empty characters. It can be tricky to construct a string like that and you will see the error only once it is executed. For debugging it can be useful to construct the string and display it with a RAISE NOTICE statement. 9. Your VI should now look like that, containing the complete stored procedure: 10. Save and exit VI. Press ESC to enter the command mode, enter “:wq!” to save and force quit and press enter. 11. To create the stored procedure the table reference tb1 needs to exist. Create the table with the following statement: LABDB(ADMIN)=> create table TB1 as select *, false AS FIELDEMPTY, false as DESCUPPER from region limit 0; CREATE OR REPLACE PROCEDURE checkRegions() LANGUAGE NZPLSQL RETURNS REFTABLE(tb1) AS BEGIN_PROC DECLARE rec RECORD; errorRows INTEGER; fieldEmpty BOOLEAN; descUpper BOOLEAN; BEGIN RAISE NOTICE 'Start check of Region'; errorRows := 0; FOR rec IN SELECT * FROM REGION ORDER BY R_REGIONKEY LOOP fieldEmpty := false; descUpper := false; IF rec.R_NAME = '' OR rec.R_REGIONKEY < 1 THEN fieldEmpty := true; END IF; IF rec.R_COMMENT <> lower(rec.R_COMMENT) THEN descUpper := true; END IF; IF (fieldEmpty = true) OR (descUpper = true) THEN errorRows := errorRows + 1; END IF; EXECUTE IMMEDIATE 'INSERT INTO '|| REFTABLENAME ||' VALUES (' || rec.R_REGIONKEY ||',''' || trim(rec.R_NAME) ||''',''' || trim(rec.R_COMMENT) ||''',' || fieldEmpty ||',' || descUpper ||')'; END LOOP; RAISE NOTICE ' % rows had an error see result set', errorRows; END; END_PROC; -- INSERT --
  • 188. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 19 of 22 This command creates a table TB1 that has all the rows of the REGION table and two additional BOOLEAN fields FIELDNULL and DESCUPPER. It will also be empty because we used the LIMIT 0 clause. 12. Describe the reference table with d TB1 You should see the following result: You can see the three columns of the REGION table and the two additional BOOLEAN fields that will contain for each row if the row violates the specified constraints. Note this table needs to exist before the procedure can be created. 13. Now create the stored procedure. Execute the script you just created with the following command: You should successfully create your stored procedure. 14. Now lets have a look at our REGION table, select all rows: You will get the following results: We can see that none of the rows would violate the constraints we defined which would be pretty boring. So lets test our stored procedure by adding two rows that violate our constraints. 15. Add the two violating rows with the following commands: LABDB(ADMIN)=> d TB1 Table "TB1" Attribute | Type | Modifier | Default Value -------------+------------------------+----------+--------------- R_REGIONKEY | INTEGER | NOT NULL | R_NAME | CHARACTER(25) | NOT NULL | R_COMMENT | CHARACTER VARYING(152) | | FIELDEMPTY | BOOLEAN | | DESCUPPER | BOOLEAN | | Distributed on hash: "R_REGIONKEY" LABDB(ADMIN)=> INSERT INTO REGION VALUES (0, 'as', 'Australia'); LABDB(ADMIN)=> SELECT * FROM REGION; R_REGIONKEY | R_NAME | R_COMMENT -------------+---------------------------+----------------------------- 2 | sa | south america 1 | na | north america 4 | ap | asia pacific 3 | emea | europe, middle east, africa (4 rows) LABDB(ADMIN)=> SELECT * FROM REGION; LABDB(ADMIN)=> i checkRegion.sql
  • 189. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 20 of 22 This row violates the lower case constraints for the comment field and the empty field constraint for the region key This row violates the empty field constraint for the region name. 16. Now finally lets try our checkRegions stored procedure: You should see the following output: You can see the expected results. Our stored procedure has found two rows that violated the constraints we check for. In the FIELDNULL and DESCUPPER columns we can easily see that the row with the key 0 has both an empty field and uppercase comment. We can also see that row 6 only violated the empty field constraint. Note that the TB1 table we created doesn’t contain any rows, it is only used as a template. 17. Finally let’s cleanup our REGION table again: 18. And lets run our checkRegions procedure again: You will see the following results: You can see that the table now is error free and all constraint violation fields are false. LABDB(ADMIN)=> call checkRegions(); NOTICE: Start check of Region NOTICE: 0 rows had an error see result set R_REGIONKEY | R_NAME | R_COMMENT | FIELDEMPTY | DESCUPPER -------------+---------------------------+-----------------------------+------------+----------- 3 | emea | europe, middle east, africa | f | f 4 | ap | asia pacific | f | f 1 | na | north america | f | f 2 | sa | south america | f | f (4 rows) LABDB(ADMIN)=> call checkRegions(); LABDB(ADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY = 0 OR R_REGIONKEY = 6; LABDB(ADMIN)=> call checkRegions(); NOTICE: Start check of Region NOTICE: 2 rows had an error see result set R_REGIONKEY | R_NAME | R_COMMENT | FIELDEMPTY | DESCUPPER -------------+---------------------------+-----------------------------+------------+----------- 1 | na | north america | f | f 3 | emea | europe, middle east, africa | f | f 0 | as | Australia | t | t 4 | ap | asia pacific | f | f 2 | sa | south america | f | f 6 | | mongolia | t | f (6 rows) LABDB(ADMIN)=> call checkRegions(); LABDB(ADMIN)=> INSERT INTO REGION VALUES (6, '', 'mongolia');
  • 190. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 21 of 22 Congratulations you have finished the stored procedure lab and created two stored procedures that help you to manage your database.
  • 191. IBM Software Information Management IBM PureData System for Analytics © Copyright IBM Corp. 2012. All rights reserved Page 22 of 22 © Copyright IBM Corporation 2011 All Rights Reserved. IBM Canada 8200 Warden Avenue Markham, ON L6G 1C7 Canada IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. No part of this document may be reproduced or transmitted in any form without written permission from IBM Corporation. Product data has been reviewed for accuracy as of the date of initial publication. Product data is subject to change without notice. Any statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED “AS IS” WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited Warranty, International Program License Agreement, etc.) under which they are provided.