Query-Driven Visualization in the Cloud with MapReduce

Query-Driven Visualization in
the Cloud with MapReduce
Bill Howe, UW
Huy Vo, Utah
Claudio Silva, Utah
Juliana Freire, Utah
YingYi Bu, UW
QuickTime™ and a
decompressor
are needed to see this picture.

3/12/09 Bill Howe, UW 2VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics

Why “Query-driven”?
 Vis perspective:
 query = subsetting
 DB perspective:
 query = manipulation, preparation, restructuring, index-building,
aggregation, regridding, downsampling, simplification,
reformatting, etc.
Database Maxims:
1. Push the computation to the data.
2. Declarative programming is a good thing.

Why Visualization?
 Super-charged aggregation
 High bandwidth of the human visual cortex
 Query-writing presupposes a known goal
“What does the salt wedge look like?”

Why Cloud?
 “Cloud”?
 Software as a Service (SaaS)
 Infrastructure as a Service (IaaS)
 Platform as a Service (PaaS)
 Working definition:
General, elastic, data-intensive, scalable computing
This work: Vis techniques + DB techniques in the Cloud

Visualization + Data Management
“Transferring the whole data generated … to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the …
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.”
-- SciDAC Review
We can no longer afford two separate systems

Converging Requirements
Core vis techniques (isosurfaces, volume rendering, …)
Emphasis on interactive performance
Mesh data as a first-class citizen
Vis DB

Declarative languages
Automatic data-parallelism
Algebraic optimization
Vis DB

Vis: “Query-driven Visualization”
Vis: “In Situ Visualization”
Vis: “Remote Visualization”
DB: “Push the computation to the data”
Vis DB

Thesis
 We can no longer afford to build separate
visualization and data management systems
 Data is increasingly destined for the cloud
 At least two approaches:
1. Build “cloud” Vis platform with DM capabilities
2. Extend “cloud” DM platforms with Vis capabilities
 We are assessing option 2.

This Talk
 Brief Technology Review
 Relational Databases
 MapReduce: Data-Intensive Scalable Programming
 GridFields: Mesh Algebra
 VisTrails: Workflow and Provenance
 Preliminary Results with Hadoop/MapReduce
 Climatology queries on a shared cloud
 Core vis algorithms on a private cluster

Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.”
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database History
-- Codd 1979

Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator

Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product

GridFields: An Algebra for Unstructured Grids
unstructured grids model
complex domains at multiple
scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing

GridFields: Data Model
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5

GridFields: Operators
 Lifted set operations
 Union, Intersection, Cross Product
 Scan/Bind
 Read a grid/attribute from disk
 Restrict
 Remove cells that do not satisfy a predicate
 Accrete
 “Grow” a grid by including neighbors of cells
 Regrid
 Map the data of one grid onto another

GridFields: Query Algebra
⊗
H0 : (x,y,b) V0 : (z)
A
restrict(0, z >b)
B
color is depth
Algebraic Manipulation of Scientific Datasets,
B. Howe, D. Maier, VLDBJ 2005
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
color is salinity

GridFields: Optimization

GridFields: Optimization
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
secs
But: only an 800 MB dataset

GridFields + VisTrails

UW + Utah CluE Program
 Goals
 10+-year “climatologies” at interactive speeds
 …with provenance, reproducibility, collaboration
…on a shared-nothing, commodity platform
 In general: Explore the intersection of scientific
databases and scientific visualization, at scale
 Methods
 “Cloud-Enable” two projects

GridFields: Query algebra for mesh data

VisTrails: Scientific workflow and provenance

Why MapReduce?
 Need to scale to hundreds or thousands of CPUs
 Parallel databases expensive, proprietary, difficult
 Not shown to scale to thousands of computers
 MapReduce is a lightweight framework providing automatic
 Data parallelism
 Fault-tolerance
 I/O scheduling

Why Hadoop?
paraview
hadoop

Some distributed algorithm…
Map
(Shuffle)
Reduce

MapReduce Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions:
 Processes input key/value pair
 Produces set of intermediate pairs
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
slide source: Google, Inc.

This Talk
 Brief Technology Review
 Relational Databases
 MapReduce: Data-Intensive Scalable Programming
 GridFields: Mesh Algebra
 VisTrails: Workflow and Provenance
 Preliminary Results with Hadoop/MapReduce
 Climatology queries on a shared cloud
 Core vis algorithms on a private cluster

Application Domain: Oceanography
<Vis movie>QuickTime™ and a
decompressor
Key idea: Zooplankton correlated with temperature

Example Query: Climatology
Feb May
Average Surface Salinity by Month
Columbia River Plume 1999-2006
Columbia
River
psu
Washington
Oregon
animation

CluE Query Results
CluE: 400 node shared Hadoop platform provided by Google, IBM, NSF
4-year-old commodity hardware, suspect IO performance

Preliminary results
 Managing Hadoop jobs with VisTrails
 GridField queries in Hadoop
 Core Visualization algorithms in Hadoop

Core Vis Algorithms in MapReduce
 Scalar/Volume Rendering
 Map: Rasterization
 Reduce: Compositing, blending
 Isosurface Extraction
 Map: Isosurface Extraction
 Reduce: Combine like isovalues
 Mesh Simplification
 Map: Bin vertices
 Reduce: Collapse binned triangles

ATLAS dataset

Rendering (Preliminary)
# of mappers
57-node Nehalem

Isosurface Extraction (Preliminary)
32
48
64
96
128

Conclusions
 Converging requirements in DB and Vis communities
 Motivation exists for a “VisDB”
 declarative query + high-performance vis, at full scale
 We are evaluating Hadoop as a “substrate” for a VisDB
 Scalability and reduced development time are promising
 Interactive performance requires some changes

Acknowledgments
http://escience.washington.edu

BACKUP SLIDES

Shared Nothing Parallel Databases
 Teradata
 Greenplum
 Netezza
 Aster Data Systems
 Datallegro
 Vertica
 MonetDB
Microsoft
Recently commercialized as “Vectorwise”

Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of nodes

3/12/09 Bill Howe, UW 41VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
VisTrails

3/12/09 Bill Howe, UW 42VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
Version Tree

Collaboration
Bill Howe @ UW
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ UW adds
an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
Howe et al., eScience 2008

Preliminary results

Hadoop in VisTrails
 Wrap Hadoop Streaming/HDFS Operations
 Plug “PreProcess” to actual Vis Pipeline
3/12/09 46

Hadoop in VisTrails
 Provenance and Monitoring
3/12/09 47

Preliminary results

All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics

Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered

Example System: Teradata
AMP = unit of parallelism

AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6

AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)

AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1

Workflow Execution Plans
Need execution plans spanning client/server/cloud

Example: Isosurface Browsing
QuickTime™ and a
decompressor

 Plan A
Subset Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3

 Plan B: Build an index
Build Index, e.g., an Interval Tree (Cignoni 97)
Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
Subset
Render
Isosurface Isosurface Isosurface Isosurface
Render Render Render

 Plan C: Build a spatial index to support panning
 Plan D: Build a multi-resolution index to support zoom
 …and so on
 Why not precompute all appropriate indexes?
 Some will (partially) reside on client
 Storage is not as cheap as we pretend
 Need a flexible system where
 a “query result” can be explored interactively, and
 we prepare for similar queries
 similarity defined by natural “browsing patterns” in visualization
systems

Why MapReduce/Hadoop?
 Popular

AWS Elastic MapReduce

100s of startups

# of downloads

# of blog posts
 Free as in Speech
 Free as in Beer
 Flexible, Lightweight
 Scalable
 Fault-tolerant

Reducing Latency
 Online processing/progressive refinement
 Deliver approximate/partial results
 Standing Queries/Prepared plans
 Exploit indexes
Changes to Hadoop and/or other
tools required (e.g., Hbase)

Masking Latency
 Caching/materialized views
 Reuse old results
 Pre-fetching
 Stage and prepare new results
 Speculative processing
 Anticipate future results
No change to Hadoop required

source: Antonio Baptista, NSF CMOP STC

Why Visualization? (2)
north
channel
south
channel

MapReduce?
 Hadoop simplifies parallel data processing
 ++ scalability
 ++ fault tolerance
 ++ less programming
 -- latency is an issue

1 2 3 4 5 6 7
31
23
psu
8 9 10 11 12 13 14 15
16 17 18
(b)
19 20 21 22
24 25 26 27 28 29 30
Climatology Queries

As a GridField Expression
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
H = Scan(contxt, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
T = Scan(contxt, “T”)
V = Scan(contxt, “V”)
HxV = Cross(H, V)
HxVxT = Cross(HxV, T)
salt = Bind(contxt, HxVxT, “salt”)
onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())

As a SQL Query
Select hpos, vpos, avg(salt)
from ocean
group by hpos, vpos

Scientific Workflow Systems
 Value proposition: More time on science, less time on code
 How: By providing language features emphasizing sharing,
reuse, reproducibility, rapid prototyping, efficiency
 Provenance
 Visual programming
 Caching
 Integration with domain-specific tools
 Scheduling

Related Vis Work
 Parallel visualization systems
 ParaView, VisIt
 Query-Driven Visualization
 [Bethel et al 2006,2008,2009]
 FastBit Index
 [Shoshani et al 2007]
 DB Vis systems
 Tableau

Feeding the Pipeline
source: Ken Moreland
missing step?

Cannot Ignore “Preprocessing”
Hadoop

Role 2: Move Computation to the Data
“Transferring the whole data generated … to a storage device or a
visualization machine could become a serious bottleneck, because I/O
would take most of the … time. A more feasible approach is to reduce
and prepare the data in situ for subsequent visualization and data
analysis tasks.”
-- SciDAC Review

Remote Visualization
 Reduce and render remotely, transfer images
 ++ transfers less data
 -- specialized hardware, high load
 Reduce remotely, transfer data/geometry, render locally
 ++ uses local graphics pipeline
 -- transfers more data

Scientific Vis System Roundup
 General
 ParaView [KitWare, Los Alamos, Sandia]
 VisIt [LLNL]
 Specialized
 SALSA, particles, Quinn, UW
 VISUS, streaming/progressive, Jones, LLNL
 SAGE,
 Hyperwall, tiled display, NASA

Query-Driven Visualization in the Cloud with MapReduce

More Related Content

Viewers also liked (14)

Similar to Query-Driven Visualization in the Cloud with MapReduce (20)

More from University of Washington (20)

Query-Driven Visualization in the Cloud with MapReduce

Editor's Notes