NENUG Apr14 Talk - data modeling for netezza

Data
Modeling
and
Netezza
Biju
Nair
NENUG
Talk
30-‐Apr-‐2014

Mo@va@on
and
Goal
• Mo@va@on
– Performance
degrada@on
with
data
volume
• Goal
– Highlight
considera@ons
while
modeling
for
NZ
2

Data
Modeling
• Logical
Data
Modeling
– Business
domain
data
representa@on
– Independent
of
DBMS
technology
– NZ
is
an
appliance
for
analy@cs
• Set
based
processing
vs
row
based
processing
• De-‐normaliza@on
• Snow
flake/Star
schema
• Physical
Data
Modeling
– Takes
in
the
DBMS
features
and
constraints
– Need
to
understand
the
DBMS
architecture
3

NZ
Architecture
Host
Snippet
Processors
Snippet
Processors
Snippet
Processors
Data
Data
Data
-‐
Parse,
Op+mize
and
Compile
query
-‐
Schedule
snippets
-‐
Distribute
data
Executes
snippets
-‐
Shared
nothing
MPP
-‐
Custom
IP
backbone
-‐ Appliance
efficiency
is
maximized
when
-‐ snippet
processors
can
run
independently
i.e.
data
independence
-‐ All
snippet
processors
are
u@lized
uniformly
4

Snippet
Processors
Data
Accelerator
(FPGA)
CPU
Compute
Host
Reads
compressed
data
Un-‐compress
Remove
columns
Restrict
rows
(where)
Perform
computa+on
Send
data
to
host
-‐ Disk
reads
are
incredibly
slow
rela@ve
to
other
components
especially
seek
@me
-‐ While
the
CPU
overhead
is
reduced,
volume
of
data
read
will
impact
performance
5

Data
Storage
Host
Snippet
Processors
Data
Extend
Page
Extend
Extend
…
Meta-‐Data
-‐
Meta
data
iden@fies
extends/pages
to
read
or
skip
6

Modeling
Priori@es
• U@liza@on
of
all
snippet
processors
– Need
to
be
able
to
u@lize
uniformly
• Maximize
MPP
capability
of
snippet
processors
– Ideally
snippet
processors
should
be
independent
• Minimize
data
read
from
disk
– Minimize
data
stored
• Improve
computa@on
in
snippet
processor
– Compounded
with
data
volume
will
help
performance
7

Snippet
Processor
U@liza@on
Data
Distribu+on
Host
Snippet
Processors
Snippet
Processors
Snippet
Processors
1,MA,1212,…
3,MA,0414,…
2,CA,0113,…
1,MA,1212,…
2,CA,0113,…
3,MA,0414,…
Distribute
by
state
Data
Skew
-‐ NZ
will
pick
one
of
the
columns
to
distribute
if
none
specified
in
table
defini@on
-‐ First
column
in
the
table
8

Snippet
Processor
U@liza@on
Data
Distribu+on
Host
1,MA,1212,…
2,CA,0113,…
3,MA,0414,…
Snippet
Processors
Snippet
Processors
Distribute
by
mo-‐yr
Snippet
Processors
1,MA,1212,…
2,CA,0113,…
3,MA,0414,…
-‐ Snippet
processors
are
u@lized
uniformly
-‐ What
if
most
of
the
query
is
on
for
the
current
month?
-‐ Processing
skew
9

Snippet
Processor
U@liza@on
Data
Distribu+on
Host
Snippet
Processors
Snippet
Processors
Snippet
Processors
1,MA,1212,…
4,CA,0414,…
2,CA,0113,…
3,MA,0414,…
1,MA,1212,…
2,CA,0113,…
3,MA,0414,…
4,CA,0414,…
Distribute
random
-‐ Snippet
processors
are
u@lized
uniformly
-‐ Helps
prevent
processing
skew
10

Snippet
Processor
U@liza@on
Data
Distribu+on
and
Table
Joins
3,ORD1,ITEM1,…
2,ORD1,ITEM1,…
4,ORD1,ITEM1,…
1,ORD1,ITEM1,…
Host
Snippet
Processors
Snippet
Processors
Snippet
Processors
4,CA,0414,…
1,MA,1212,…
3,MA,0414,…
2,CA,0113,…
1,ORD1,ITEM1,…
2,ORD1,ITEM1,…
3,ORD1,ITEM1,…
4,ORD1,ITEM1,…
Distribute
random
Need
to
redistribute
data
from
both
tables
-‐ Snippet
processors
are
u@lized
uniformly
-‐ Makes
snippet
processors
dependent
on
others
impac@ng
MPP
maximiza@on
11

Snippet
Processor
U@liza@on
Data
Distribu+on
and
Table
Joins
2,ORD1,ITEM1,…
3,ORD1,ITEM1,…
1,ORD1,ITEM1,…
4,ORD1,ITEM1,…
Host
Snippet
Processors
Snippet
Processors
Snippet
Processors
4,CA,0414,…
1,MA,1212,…
3,MA,0414,…
2,CA,0113,…
1,ORD1,ITEM1,…
2,ORD1,ITEM1,…
3,ORD1,ITEM1,…
4,ORD1,ITEM1,…
Distribute
on
Join
column
-‐
cid
-‐ Snippet
processors
are
u@lized
uniformly
-‐ Makes
snippet
processors
dependent
on
others
impac@ng
MPP
maximiza@on
-‐ Becer
than
the
previous
scenario
Need
to
redistribute
data
from
one
table
12

Snippet
Processor
U@liza@on
Data
Distribu+on
and
Table
Joins
2,ORD1,ITEM1,…
Distribute
both
tables
on
Join
column
-‐
cid
3,ORD1,ITEM1,…
1,ORD1,ITEM1,…
4,ORD1,ITEM1,…
Host
Snippet
Processors
Snippet
Processors
Snippet
Processors
1,MA,1212,…
4,CA,0414,…
2,CA,0113,…
3,MA,0414,…
-‐ Snippet
processors
are
u@lized
uniformly
-‐ Makes
snippet
processors
independent
maximizing
MPP
13

Snippet
Processor
U@liza@on
Data
Distribu+on
• Iden@fy
keys
to
distribute
data
uniformly
– Avoid
data
and
processing
skew
– Try
using
join
columns
as
the
distribu@on
keys
– Choose
same
data
types
for
join
columns
• If
table
size
is
small
random
distribu@on
is
fine
– If
one
of
the
join
table
is
small,
NZ
will
broadcast
• Redistribu@on
may
not
be
an
overkill
for
small
data
– For
e.g.,
selec@ng
a
small
number
if
columns
14

Snippet
Processor
U@liza@on
Distribu+on
and
Query
Time
3.5
3
2.5
2
1.5
1
0.5
0
Query
Time
For
Different
Distribu+ons
Random
1
Correct
Distribu@on
2
Correct
Distribu@on
Time
(min)
15

Snippet
Processor
U@liza@on
Join
Column
Type
and
Query
Time
2.5
2
1.5
1
0.5
0
Join
query
+me
-‐
same
and
diff
data
types
Incorrect
Data
Types
Correct
Data
Types
Time
(min)
16

Minimize
Data
Read
From
Disk
Zone
Maps
• Data
types
which
supports
Zone
Maps
– All
integer
data
types
• int1
• int2
• int4
• int8
– Date
– Timestamp
17
Refer
to
the
product
manual
for
the
version
of
NZ
used
for
the
complete
list
of
zone
map
able
data
types

Minimize
Data
Read
From
Disk
Table
column
(cid)
is
numeric(10,0)
Host
Snippet
Processors
Data
Extend
1
Page
Extend
2
…
Extend
3
Zone
Maps
Meta-‐Data
-‐
May
end
up
reading
all
data
from
disk
No
zone
map
for
cid
18

Minimize
Data
Read
From
Disk
Table
column
(cid)
is
bigint
Host
Snippet
Processors
Data
Extend
Page
Extend
Extend
…
Zone
Maps
Meta-‐Data
-‐
Zone
maps
can
be
used
to
minimize
data
read
from
disk
19

Minimize
Data
Read
From
Disk
Zone
Maps
and
Query
Time
1.6
1.55
1.5
1.45
1.4
1.35
1.3
Query
+me
with
and
without
zone
map
Incorrect
Data
Type
Correct
Data
Type
Time
(min)
20

Minimize
Data
Read
From
Disk
Clustered
Base
Tables
• NZ
stores
data
with
same
organize
keys
closely
• Addi@onal
data
types
are
zone
map
able
– char
– varchar
– nchar
– nvarchar
– float
– double
– bool
– @me
– Interval
• Helps
improve
performance
of
mul@
table
join
21

Minimize
Data
Read
From
Disk
Extend
Extend
Extend
Clustered
Base
Tables
Table
distributed
on
cid
1,MA,Boston,…
5,CA,LA,…
3,FL,Tampa,,…
1,MA,Salem,…
5,CA,SF,…
3,FL,Orlando,…
1,MA,Lowell,…
5,CA,Pasadena,
…
3,FL,Miami,…
Table
distributed
on
cid
organize
on
state
Extend
1,MA,Boston,…
1,MA,Salem,…
1,MA,Lowell,…
Extend
3,FL,Tampa,,…
3,FL,Orlando,…
3,FL,Miami,…
Extend
5,CA,LA,…
5,CA,SF,…
5,CA,Pasadena,
…
State
Extend
1,AL,Alabama,…
5,CA,California,
…
10,FL,Florida,…
22,MA,Mass,…
22

Minimize
Data
Read
From
Disk
Clustered
Base
Table
2.5
2
1.5
1
0.5
0
Query
Time
with
and
without
org
No
Org+Correct
Dist
Org+Correct
Dist
Time
(min)
23

Minimize
Data
Read
From
Disk
Materialized
Views
• View
with
frequently
used
columns
of
base
table
– Unlike
views,
materialized
view
stores
data
– Reduced
data
read
from
disk
– Addi@onal
storage
required
– Need
to
be
refreshed
if
base
table
data
changes
• Can
be
used
as
an
index
against
base
table
– Stores
loca@on
of
base
table
data
loc
in
a
column
24

Minimize
Data
Read
From
Disk
Extend
Extend
Extend
Materialized
Views
T_EMP
distributed
on
cid
1,MA,Mike,…
5,CA,Fally,…
3,FL,Chris,…
4,MA,Robert,…
7,CA,Mary,…
2,FL,Jus+n,…
6,MA,Harini,…
8,CA,Mike,…
9,FL,Martha,…
SELECT
ID,
NAME
FROM
T_EMP;
MV
on
T_EMP
order
by
state,
name
Extend
5,CA,Fally
7,CA,Mary
8,CA,Mike
3,FL,Chris
2,FL,Jus+n
9,FL,Martha
6,MA,Harini
1,MA,Mike
4,MA,Robert
SELECT
*
FROM
T_EMP
WHERE
STATE
=
‘CA’
25

Minimize
Data
Read
From
Disk
Materialized
Views
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Query
+me
on
single
table
-‐MV
vs
No
MV
No
MV
With
MV
Time
(min)
26

Minimize
Data
Read
From
Disk
Materialized
Views
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Join
Query
Time
-‐
With
and
without
MV
No
MV
With
MV
Time
(min)
27

Minimize
Data
Read
From
Disk
Minimize
Data
Stored
• Choose
storage
efficient
data
types
– Difference
between
bigint
and
int
is
4
bytes
• Use
char
instead
of
varchar
if
the
data
length
is
fixed
– varchar
has
a
2
byte
overhead
• Define
columns
as
“not
null”
where
possible
• Store
only
the
required
data
in
table
columns
• Encode
duplicate
data
stored
in
rows
28

Improve
Computa@on
in
Snippet
Processor
NZ
Object
Defini+ons
• Define
columns
as
“not
null”
where
possible
– Removes
logic
to
check
nulls
• Define
table
keys
and
rela@onships
– Helps
NZ
query
op@mizer
to
generate
efficient
code
29

30
bnair@asquareb.com
blog.asquareb.com
https://github.com/bijugs
@gsbiju

NENUG Apr14 Talk - data modeling for netezza

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to NENUG Apr14 Talk - data modeling for netezza (20)

More from Biju Nair (8)

Recently uploaded (20)

NENUG Apr14 Talk - data modeling for netezza