SlideShare a Scribd company logo
Bill Howe
Information School
Computer Science & Engineering
University of Washington
Big Data + Big Sim:
Query Processing over
Unstructured CFD Models
8/7/2017 Bill Howe, UW 1
Scott Moe
Applied Math
University of Washington
This morning…
• Data-intensive science in oceanography
• Background on databases and query
algebras
• Regridding: Integrating ocean models using
a database-style algebra
• If time: Responsible data science
8/7/2017 Bill Howe, UW 2
Motivation Algebraic Optimization Regridding End
My position for this talk…
• Simulations are sources of data
• Analysis requires querying across
heterogeneous data sources, including
simulations
• The CS database community has the
right set of concepts and approaches
…but ultimately we’re just plumbers
8/7/2017 Bill Howe, UW 3
Motivation Algebraic Optimization Regridding End
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
8/7/2017 Bill Howe, UW 4
Motivation Algebraic Optimization Regridding End
Nearly every field of discovery is transitioning
from “data poor” to “data rich”
Astronomy: LSST
Physics: LHC
Oceanography: OOI
Social Sciences
Biology: Sequencing
Economics
Neuroscience: EEG, fMRI
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 6
Complex
System
“Little linear windows”
Academic research
Practitioners
One view of “data science” is the streamline the discovery, interpretation,
and operationalization of semi-robust local patterns that have predictive
power for some task.1
In general, these don’t exist. But in specific situations, they do.
slide: John Delaney, UW
Motivation Algebraic Optimization Regridding End
Regional Scale Nodes
8/7/2017 Bill Howe, UW 8
John
Delaney
10s of Gigabits/second from the ocean floor
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 9
17 federal organizations named as partners
11 Regional Associations
“a strategy for incorporating observation systems from …
near shore waters as part of … a network of observatories.”
Motivation Algebraic Optimization Regridding End
Center for Coastal Margin
Observation and Prediction (CMOP)
8/7/2017 Bill Howe, UW 10
Antonio
Baptista
Motivation Algebraic Optimization Regridding End
Virtual Mekong Basin
8/7/2017 Bill Howe, UW 11
img src: Mark Stoermer, UW Center for Environmental Visualization
Jeff
Richey
Motivation Algebraic Optimization Regridding End
So what?
• Geosciences are transitioning from
expedition-based to observatory-based
science
• Enormous investments in integrating
sensors and models
• The big problem: ad hoc queries over
large, heterogeneous, distributed datasets
and models
8/7/2017 Bill Howe, UW 12
Motivation Algebraic Optimization Regridding End
So what do we do about querying across
heterogeneous sources?
Raise the level of abstraction and let the
system handle the details
8/7/2017 Bill Howe, UW 13
Motivation Algebraic Optimization Regridding End
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs should
remain unaffected when the internal representation of data is changed and
even when some aspects of the external representation are changed.”
Key Idea: Programs that manipulate tabular data exhibit an algebraic
structure allowing reasoning and manipulation independently of physical
data representation
Digression: Relational Database History
-- Codd 1979
Motivation Algebraic Optimization Regridding End
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
Motivation Algebraic Optimization Regridding End
16
Review: Algebraic Optimization
N = ((4*2)+((4*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2: N = (2+3)*4
two operations instead of five, no division operator
Same idea works with very large tables, but the payoff is much higher
Motivation Algebraic Optimization Regridding End
17
Algebraic Optimization:
Find a better logical plan
Product Purchase
pid=pid
price>100 and city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σ
Product(pid, name, price)
Purchase(pid, cid, store)
Customer(cid, name, city)
SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid and y.cid = z.cid and
x.price > 100 and z.city = ‘Seattle’
Motivation Algebraic Optimization Regridding End
18
Algebraic Optimization:
Find a better logical plan
Product Purchase
pid=pid
city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σ
price>100
σ
Query optimization =
finding cheaper,
equivalent expressions
SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid and y.cid = z.cid and
x.price > 100 and z.city = ‘Seattle’
Motivation Algebraic Optimization Regridding End
Same logical expression, different physical
algorithms
Which is faster?
SELECT *
FROM Order o, Item i
WHERE o.order = i.order
join
scan scan
o.order = i.order
Order oItem i
for each record i in Item:
for each record o in Order:
if o.order = i.order:
return (r,s)
Option 1
for each record i in Item:
insert into hashtable
for each record o in Order:
lookup corresponding records in hashtable
return matching pairs
Option 2
O(N)
O(1)
O(M)
O(1)
O(N)
O(1)
O(~1)
O(M)
overall:
O(N*M)
overall:
O(N+M)
Motivation Algebraic Optimization Regridding End
3/12/09 Bill Howe, eScience Institute 20

H0 : (x,y,b) V0 : (z)
A
restrict(0, z >b)
B
color is depth
Algebraic Manipulation of Scientific Datasets,
B. Howe, D. Maier, VLDBJ 2005

H0 : (x,y,b) V0 : ( )
apply(0, z=(surf  b) *  )
bind(0, surf)
C
color is salinity
GridFields: An Algebra of Meshes
Motivation Algebraic Optimization Regridding End
Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
H = rH =
dimensionpredicate
color: bathymetry
Motivation Algebraic Optimization Regridding End
8/7/2017 howeb@stccmop.org
Example: Transect
P
Motivation Algebraic Optimization Regridding End
8/7/2017 howeb@stccmop.org
Transect: Bad Query Plan

H(x,y,b)
V(z)
r(z>b) b(s) regrid

P
P  V
1) Construct full-size 3D grid
2) Construct 2D transect grid
3) Interpolate 1) onto 2)
Motivation Algebraic Optimization Regridding End
8/7/2017 howeb@stccmop.org
Transect: Optimized Plan
P  V
V(z)
P
H(x,y,b)
regrid b(s) regrid

1) Find 2D cells containing points
2) Create “stacks” of 2D cells carrying data
3) Create 2D transect grid
4) Interpolate 2) onto 3)
Motivation Algebraic Optimization Regridding End
8/7/2017 howeb@stccmop.org
1) Find cells containing points in P
Motivation Algebraic Optimization Regridding End
8/7/2017 howeb@stccmop.org
1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) Interpolate
Motivation Algebraic Optimization Regridding End
Transect: Results
8/7/2017 howeb@stccmop.org
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
secs
800 MB
(1 timestep)
Motivation Algebraic Optimization Regridding End
Back to integrating models:
What is the right abstraction?
• Claim: Everything reduces to regridding
• Model-data comparisons skill assessment?
Regrid observations onto model mesh
• Model-model comparison?
Regrid one model’s mesh onto the other’s
• Model coupling?
Regrid a meso-scale atmospheric model onto your regional ocean model
• Visualization?
Regrid onto a 3D mesh, or regrid onto a 2D array of pixels
8/7/2017 Bill Howe, UW 28
Motivation Algebraic Optimization Regridding End
Status Quo
• “FTP + MATLAB”
• “Nascent Databases”
– File-based, format-specific API
– UniData’s NetCDF, HDF5
– Some IO optimization, some indexing
• “Data Servers”
– Same as file-based systems,
– but supports RPC
8/7/2017 Bill Howe, UW 29
Hyrax
None of this scales
- up with data volumes
- up with number of sources
- down with developer expertise
Motivation Algebraic Optimization Regridding End
Summary so far
• “Integration” means “regridding”
– mesh to pixels, mesh to mesh, trajectory to mesh
– satellites to models, models to models, observations to models
• Regridding is hard
– Must be easy, tolerant of unusual grids, numerically conservative, efficient
Our goal
• Define a “universal regridding” operator with nice algebraic
properties
• Use it to implement efficient distributed data sharing applications,
parallel algorithms, and more
8/7/2017 Bill Howe, UW 30
Motivation Algebraic Optimization Regridding End
What are some complexities we want to
hide?
• Unstructured Grids
• Numerical Conservation
• Choice of Algorithms
8/7/2017 Bill Howe, UW 31
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 32
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 33
Washington
Oregon
Columbia River Estuary
Motivation Algebraic Optimization Regridding End
Washington
Oregon
Columbia River Estuary
Motivation Algebraic Optimization Regridding End
SciDB
Hyrax
GridFields
ESMF
VTK/Paraview
easy; good support hard; poor support
Motivation Algebraic Optimization Regridding End
Structured grids are easy
8/7/2017 Bill Howe, eScience Institute 36
 The data model…
(Cartesian products of coordinate variables)
 …immediately implies a representation,
(multidimensional arrays)
 …an API,
(reading and writing subslabs)
 …and an efficient implementation
(address calculation using array “shape”)
Motivation Algebraic Optimization Regridding End
What are some complexities we want to
hide?
• Unstructured Grids
• Numerical Conservation
• Choice of Algorithms
8/7/2017 Bill Howe, UW 37
Motivation Algebraic Optimization Regridding End
Naïve Method: Interpolation (Spatial Join)
8/7/2017 Bill Howe, UW 38
For each vertex in the target grid,
Find containing cell in the source grid,
Evaluate the basis functions to interpolate
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 39
Motivation Algebraic Optimization Regridding End
Supermeshing [Farrell 10]
8/7/2017 Bill Howe, UW 40
For each cell in the target grid,
Find overlapping cells in the source grid,
Compute their intersections
Derive new coefficients to minimize L2 norm
* Guaranteeed Conservative
* Minimizes Error
But:
Domains must match exactly
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 41
Motivation Algebraic Optimization Regridding End
What are some complexities we want to
hide?
• Unstructured Grids
• Numerical Conservation
• Choice of algorithms
8/7/2017 Bill Howe, UW 42
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 43
Motivation Algebraic Optimization Regridding End
Finding mesh intersections
8/7/2017 Bill Howe, UW 44
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 45
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 46
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 47
Restrict(Regrid(X,Y)) = Regrid(Restrict(X), Restrict(Y))
Commutativity of Regrid and Restrict:
G0 = Regrid(Restrict0(X), Restrict0(Y)))
G1 = Regrid(Restrict1(X), Restrict1(Y)))
:
GN = Regrid(Restrict2(X), Restrict2(Y)))
R = Stitch(G0, G1, G2)
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 48
Motivation Algebraic Optimization Regridding End
“Lumping”
8/7/2017 Bill Howe, UW 49
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 50
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 51
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 52
Globally conservative
Parallelizable
Commutes with user-
selected restrictions
masking to handle
mismatched domains
Todos:
• Characterize the error relative to plain supermeshing
• Universal Regridding-as-a-Service
Motivation Algebraic Optimization Regridding End
Outreach and Usage
• Code is available, but in transition to github
– Search “gridfields” on google code
– http://code.google.com/p/gridfields/
– C++ with Python bindings
• Integrated into the Hyrax Data Server
– OPULS project funded by NOAA
– Server-side processing of unstructured grids
• Other users
– US Geological Survey
– NOAA
8/7/2017 Bill Howe, UW 538/7/2017 Bill Howe, UW 53
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 54
• Screenshot of OPeNDAP demo
http://ec2-174-129-186-110.compute-1.amazonaws.com:8088/nc/test4.nc.nc?
ugrid_restrict(0,"Y>41.5&Y<42.75&X>-68.0&X<-66.0")
Motivation Algebraic Optimization Regridding End
Wrap up
• Integration of big data and big models is the game
• Database-style systems are about hiding complexity
and raising the level of abstraction
• A database-style query algebra for FEMs emphasizing
interpolation and regridding across data and models
made sense to us
• But more broadly: a richer infrastructure for comparing
and sharing model results and data
• One idea: “Virtual datasets” where the model is
executed in response to queries, perhaps with simpler
grids and relaxed assumptions
8/7/2017 Bill Howe, UW 55
Motivation Algebraic Optimization Regridding End
56
Propublica, May 2016
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
57
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
8/7/2017 Bill Howe, UW 58
Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
8/7/2017 Bill Howe, UW 59
Amazon Prime Now Delivery Area: Boston Bloomberg, 2016
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
8/7/2017 Bill Howe, UW 60
Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examples
e.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraints
e.g., “we must avoid racial discrimination”
8/7/2017 Data, Responsibly / SciTech NW 62
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Rely on human oversight
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Issues with scalability, prejudice
Option 2: Build systems to help enforce these constraints
This is the approach we are exploring
8/7/2017 Data, Responsibly / SciTech NW 63
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”
8/7/2017 Data, Responsibly / SciTech NW 64
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
Fairness
Accountability
Transparency
Privacy
Reproducibility
Fides: A platform for responsible data science
joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]
Data Curation
novel features to support:
So what do we do about it?
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science
Motivation Regridding Supermeshi
ng
Database Algebras Evaluat
ion
Numerical
conservatio
n
Responsible Data Science

More Related Content

PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PPTX
XLDB South America Keynote: eScience Institute and Myria
PDF
Massive Data Analysis- Challenges and Applications
PPTX
Application of Clustering in Data Science using Real-life Examples
PPTX
Python for Data Science with Anaconda
PDF
A New Year in Data Science: ML Unpaused
PPT
Data-Intensive Scalable Science
PDF
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
The Other HPC: High Productivity Computing in Polystore Environments
XLDB South America Keynote: eScience Institute and Myria
Massive Data Analysis- Challenges and Applications
Application of Clustering in Data Science using Real-life Examples
Python for Data Science with Anaconda
A New Year in Data Science: ML Unpaused
Data-Intensive Scalable Science
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014

What's hot (20)

PPTX
Are you ready for BIG DATA?
PDF
President Election of Korea in 2017
PDF
Tutorial Data Management and workflows
PDF
Big Data, The Community and The Commons (May 12, 2014)
PPTX
Data Trajectories: tracking the reuse of published data for transitive credi...
PDF
Big Data - Gerami
PPTX
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
PPT
PUC Masterclass Big Data
PDF
International Collaboration Networks in the Emerging (Big) Data Science
PDF
V3 i35
PDF
Sildes big-data-ia-may
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
PDF
NoSQL (Not Only SQL)
PPTX
A Hacking Toolset for Big Tabular Files (3)
PDF
Introduction To Data Science
PDF
Structured Data Challenges in Finance and Statistics
PPTX
Introduction to Big Data/Machine Learning
PPTX
Machine Learning - Challenges, Learnings & Opportunities
PDF
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Are you ready for BIG DATA?
President Election of Korea in 2017
Tutorial Data Management and workflows
Big Data, The Community and The Commons (May 12, 2014)
Data Trajectories: tracking the reuse of published data for transitive credi...
Big Data - Gerami
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
PUC Masterclass Big Data
International Collaboration Networks in the Emerging (Big) Data Science
V3 i35
Sildes big-data-ia-may
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
NoSQL (Not Only SQL)
A Hacking Toolset for Big Tabular Files (3)
Introduction To Data Science
Structured Data Challenges in Finance and Statistics
Introduction to Big Data/Machine Learning
Machine Learning - Challenges, Learnings & Opportunities
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Ad

Similar to Big Data + Big Sim: Query Processing over Unstructured CFD Models (20)

PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
PPTX
EDBT 2015: Summer School Overview
PDF
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
PDF
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
PDF
Kenett On Information NYU-Poly 2013
PDF
Data legend dh_benelux_2017.key
PDF
AutoML for Data Science Productivity and Toward Better Digital Decisions
PDF
Bill howe 2_databases
PDF
Principal component analysis for dimesion reductions for finer data analysis
PDF
H2O Overview with Amy Wang at useR! Aalborg
PPTX
The Other HPC: High Productivity Computing
PDF
Slides barcelona risk data
PPT
Vldb14
PPSX
PPTX
Democratizing Data Science in the Cloud
PPTX
ANIn Coimbatore _ April 2025 | Why data is important and how synthetic data c...
PPTX
Analytics of analytics pipelines: from optimising re-execution to general Dat...
PPTX
Massive-Scale Analytics Applied to Real-World Problems
PPTX
19CS3052R-CO1-7-S7 ECE
PDF
Why Data Science is a Science
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
EDBT 2015: Summer School Overview
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
Kenett On Information NYU-Poly 2013
Data legend dh_benelux_2017.key
AutoML for Data Science Productivity and Toward Better Digital Decisions
Bill howe 2_databases
Principal component analysis for dimesion reductions for finer data analysis
H2O Overview with Amy Wang at useR! Aalborg
The Other HPC: High Productivity Computing
Slides barcelona risk data
Vldb14
Democratizing Data Science in the Cloud
ANIn Coimbatore _ April 2025 | Why data is important and how synthetic data c...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Massive-Scale Analytics Applied to Real-World Problems
19CS3052R-CO1-7-S7 ECE
Why Data Science is a Science
Ad

More from University of Washington (20)

PPTX
Database Agnostic Workload Management (CIDR 2019)
PPTX
Data Responsibly: The next decade of data science
PPTX
Thoughts on Big Data and more for the WA State Legislature
PPTX
Data, Responsibly: The Next Decade of Data Science
PPTX
Science Data, Responsibly
PPTX
Data Science, Data Curation, and Human-Data Interaction
PPTX
Urban Data Science at UW
PPTX
Intro to Data Science Concepts
PPTX
Big Data Talent in Academic and Industry R&D
PPTX
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
PPTX
Data Science and Urban Science @ UW
PPTX
Myria: Analytics-as-a-Service for (Data) Scientists
PPTX
Big Data Curricula at the UW eScience Institute, JSM 2013
PPTX
eResearch New Zealand Keynote
PPTX
Data science curricula at UW
PPTX
Enabling Collaborative Research Data Management with SQLShare
PPTX
Virtual Appliances, Cloud Computing, and Reproducible Research
PPT
End-to-End eScience
PPT
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
PPT
Query-Driven Visualization in the Cloud with MapReduce
Database Agnostic Workload Management (CIDR 2019)
Data Responsibly: The next decade of data science
Thoughts on Big Data and more for the WA State Legislature
Data, Responsibly: The Next Decade of Data Science
Science Data, Responsibly
Data Science, Data Curation, and Human-Data Interaction
Urban Data Science at UW
Intro to Data Science Concepts
Big Data Talent in Academic and Industry R&D
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Data Science and Urban Science @ UW
Myria: Analytics-as-a-Service for (Data) Scientists
Big Data Curricula at the UW eScience Institute, JSM 2013
eResearch New Zealand Keynote
Data science curricula at UW
Enabling Collaborative Research Data Management with SQLShare
Virtual Appliances, Cloud Computing, and Reproducible Research
End-to-End eScience
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
Query-Driven Visualization in the Cloud with MapReduce

Recently uploaded (20)

PDF
The scientific heritage No 166 (166) (2025)
PDF
Sciences of Europe No 170 (2025)
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
2. Earth - The Living Planet earth and life
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ECG_Course_Presentation د.محمد صقران ppt
The scientific heritage No 166 (166) (2025)
Sciences of Europe No 170 (2025)
The KM-GBF monitoring framework – status & key messages.pptx
Placing the Near-Earth Object Impact Probability in Context
Biophysics 2.pdffffffffffffffffffffffffff
2Systematics of Living Organisms t-.pptx
neck nodes and dissection types and lymph nodes levels
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
POSITIONING IN OPERATION THEATRE ROOM.ppt
2. Earth - The Living Planet earth and life
AlphaEarth Foundations and the Satellite Embedding dataset
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
INTRODUCTION TO EVS | Concept of sustainability
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ECG_Course_Presentation د.محمد صقران ppt

Big Data + Big Sim: Query Processing over Unstructured CFD Models

  • 1. Bill Howe Information School Computer Science & Engineering University of Washington Big Data + Big Sim: Query Processing over Unstructured CFD Models 8/7/2017 Bill Howe, UW 1 Scott Moe Applied Math University of Washington
  • 2. This morning… • Data-intensive science in oceanography • Background on databases and query algebras • Regridding: Integrating ocean models using a database-style algebra • If time: Responsible data science 8/7/2017 Bill Howe, UW 2 Motivation Algebraic Optimization Regridding End
  • 3. My position for this talk… • Simulations are sources of data • Analysis requires querying across heterogeneous data sources, including simulations • The CS database community has the right set of concepts and approaches …but ultimately we’re just plumbers 8/7/2017 Bill Howe, UW 3 Motivation Algebraic Optimization Regridding End
  • 4. The Fourth Paradigm 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray 8/7/2017 Bill Howe, UW 4 Motivation Algebraic Optimization Regridding End
  • 5. Nearly every field of discovery is transitioning from “data poor” to “data rich” Astronomy: LSST Physics: LHC Oceanography: OOI Social Sciences Biology: Sequencing Economics Neuroscience: EEG, fMRI Motivation Algebraic Optimization Regridding End
  • 6. 8/7/2017 Bill Howe, UW 6 Complex System “Little linear windows” Academic research Practitioners One view of “data science” is the streamline the discovery, interpretation, and operationalization of semi-robust local patterns that have predictive power for some task.1 In general, these don’t exist. But in specific situations, they do.
  • 7. slide: John Delaney, UW Motivation Algebraic Optimization Regridding End
  • 8. Regional Scale Nodes 8/7/2017 Bill Howe, UW 8 John Delaney 10s of Gigabits/second from the ocean floor Motivation Algebraic Optimization Regridding End
  • 9. 8/7/2017 Bill Howe, UW 9 17 federal organizations named as partners 11 Regional Associations “a strategy for incorporating observation systems from … near shore waters as part of … a network of observatories.” Motivation Algebraic Optimization Regridding End
  • 10. Center for Coastal Margin Observation and Prediction (CMOP) 8/7/2017 Bill Howe, UW 10 Antonio Baptista Motivation Algebraic Optimization Regridding End
  • 11. Virtual Mekong Basin 8/7/2017 Bill Howe, UW 11 img src: Mark Stoermer, UW Center for Environmental Visualization Jeff Richey Motivation Algebraic Optimization Regridding End
  • 12. So what? • Geosciences are transitioning from expedition-based to observatory-based science • Enormous investments in integrating sensors and models • The big problem: ad hoc queries over large, heterogeneous, distributed datasets and models 8/7/2017 Bill Howe, UW 12 Motivation Algebraic Optimization Regridding End
  • 13. So what do we do about querying across heterogeneous sources? Raise the level of abstraction and let the system handle the details 8/7/2017 Bill Howe, UW 13 Motivation Algebraic Optimization Regridding End
  • 14. Pre-Relational: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation Digression: Relational Database History -- Codd 1979 Motivation Algebraic Optimization Regridding End
  • 15. Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product Motivation Algebraic Optimization Regridding End
  • 16. 16 Review: Algebraic Optimization N = ((4*2)+((4*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*4 two operations instead of five, no division operator Same idea works with very large tables, but the payoff is much higher Motivation Algebraic Optimization Regridding End
  • 17. 17 Algebraic Optimization: Find a better logical plan Product Purchase pid=pid price>100 and city=‘Seattle’ x.name,z.name δ cid=cid Customer Π σ Product(pid, name, price) Purchase(pid, cid, store) Customer(cid, name, city) SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z WHERE x.pid = y.pid and y.cid = z.cid and x.price > 100 and z.city = ‘Seattle’ Motivation Algebraic Optimization Regridding End
  • 18. 18 Algebraic Optimization: Find a better logical plan Product Purchase pid=pid city=‘Seattle’ x.name,z.name δ cid=cid Customer Π σ price>100 σ Query optimization = finding cheaper, equivalent expressions SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z WHERE x.pid = y.pid and y.cid = z.cid and x.price > 100 and z.city = ‘Seattle’ Motivation Algebraic Optimization Regridding End
  • 19. Same logical expression, different physical algorithms Which is faster? SELECT * FROM Order o, Item i WHERE o.order = i.order join scan scan o.order = i.order Order oItem i for each record i in Item: for each record o in Order: if o.order = i.order: return (r,s) Option 1 for each record i in Item: insert into hashtable for each record o in Order: lookup corresponding records in hashtable return matching pairs Option 2 O(N) O(1) O(M) O(1) O(N) O(1) O(~1) O(M) overall: O(N*M) overall: O(N+M) Motivation Algebraic Optimization Regridding End
  • 20. 3/12/09 Bill Howe, eScience Institute 20  H0 : (x,y,b) V0 : (z) A restrict(0, z >b) B color is depth Algebraic Manipulation of Scientific Datasets, B. Howe, D. Maier, VLDBJ 2005  H0 : (x,y,b) V0 : ( ) apply(0, z=(surf  b) *  ) bind(0, surf) C color is salinity GridFields: An Algebra of Meshes Motivation Algebraic Optimization Regridding End
  • 21. Example (1) H = Scan(context, "H") rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H) H = rH = dimensionpredicate color: bathymetry Motivation Algebraic Optimization Regridding End
  • 22. 8/7/2017 howeb@stccmop.org Example: Transect P Motivation Algebraic Optimization Regridding End
  • 23. 8/7/2017 howeb@stccmop.org Transect: Bad Query Plan  H(x,y,b) V(z) r(z>b) b(s) regrid  P P  V 1) Construct full-size 3D grid 2) Construct 2D transect grid 3) Interpolate 1) onto 2) Motivation Algebraic Optimization Regridding End
  • 24. 8/7/2017 howeb@stccmop.org Transect: Optimized Plan P  V V(z) P H(x,y,b) regrid b(s) regrid  1) Find 2D cells containing points 2) Create “stacks” of 2D cells carrying data 3) Create 2D transect grid 4) Interpolate 2) onto 3) Motivation Algebraic Optimization Regridding End
  • 25. 8/7/2017 howeb@stccmop.org 1) Find cells containing points in P Motivation Algebraic Optimization Regridding End
  • 26. 8/7/2017 howeb@stccmop.org 1) 4) 2) 1) Find cells containing points in P 2) Construct “stacks” of cells 4) Interpolate Motivation Algebraic Optimization Regridding End
  • 27. Transect: Results 8/7/2017 howeb@stccmop.org 0 5 10 15 20 25 30 35 40 45 vtk(3D) interpolate simple interp_o simple_o secs 800 MB (1 timestep) Motivation Algebraic Optimization Regridding End
  • 28. Back to integrating models: What is the right abstraction? • Claim: Everything reduces to regridding • Model-data comparisons skill assessment? Regrid observations onto model mesh • Model-model comparison? Regrid one model’s mesh onto the other’s • Model coupling? Regrid a meso-scale atmospheric model onto your regional ocean model • Visualization? Regrid onto a 3D mesh, or regrid onto a 2D array of pixels 8/7/2017 Bill Howe, UW 28 Motivation Algebraic Optimization Regridding End
  • 29. Status Quo • “FTP + MATLAB” • “Nascent Databases” – File-based, format-specific API – UniData’s NetCDF, HDF5 – Some IO optimization, some indexing • “Data Servers” – Same as file-based systems, – but supports RPC 8/7/2017 Bill Howe, UW 29 Hyrax None of this scales - up with data volumes - up with number of sources - down with developer expertise Motivation Algebraic Optimization Regridding End
  • 30. Summary so far • “Integration” means “regridding” – mesh to pixels, mesh to mesh, trajectory to mesh – satellites to models, models to models, observations to models • Regridding is hard – Must be easy, tolerant of unusual grids, numerically conservative, efficient Our goal • Define a “universal regridding” operator with nice algebraic properties • Use it to implement efficient distributed data sharing applications, parallel algorithms, and more 8/7/2017 Bill Howe, UW 30 Motivation Algebraic Optimization Regridding End
  • 31. What are some complexities we want to hide? • Unstructured Grids • Numerical Conservation • Choice of Algorithms 8/7/2017 Bill Howe, UW 31 Motivation Algebraic Optimization Regridding End
  • 32. 8/7/2017 Bill Howe, UW 32 Motivation Algebraic Optimization Regridding End
  • 33. 8/7/2017 Bill Howe, UW 33 Washington Oregon Columbia River Estuary Motivation Algebraic Optimization Regridding End
  • 34. Washington Oregon Columbia River Estuary Motivation Algebraic Optimization Regridding End
  • 35. SciDB Hyrax GridFields ESMF VTK/Paraview easy; good support hard; poor support Motivation Algebraic Optimization Regridding End
  • 36. Structured grids are easy 8/7/2017 Bill Howe, eScience Institute 36  The data model… (Cartesian products of coordinate variables)  …immediately implies a representation, (multidimensional arrays)  …an API, (reading and writing subslabs)  …and an efficient implementation (address calculation using array “shape”) Motivation Algebraic Optimization Regridding End
  • 37. What are some complexities we want to hide? • Unstructured Grids • Numerical Conservation • Choice of Algorithms 8/7/2017 Bill Howe, UW 37 Motivation Algebraic Optimization Regridding End
  • 38. Naïve Method: Interpolation (Spatial Join) 8/7/2017 Bill Howe, UW 38 For each vertex in the target grid, Find containing cell in the source grid, Evaluate the basis functions to interpolate Motivation Algebraic Optimization Regridding End
  • 39. 8/7/2017 Bill Howe, UW 39 Motivation Algebraic Optimization Regridding End
  • 40. Supermeshing [Farrell 10] 8/7/2017 Bill Howe, UW 40 For each cell in the target grid, Find overlapping cells in the source grid, Compute their intersections Derive new coefficients to minimize L2 norm * Guaranteeed Conservative * Minimizes Error But: Domains must match exactly Motivation Algebraic Optimization Regridding End
  • 41. 8/7/2017 Bill Howe, UW 41 Motivation Algebraic Optimization Regridding End
  • 42. What are some complexities we want to hide? • Unstructured Grids • Numerical Conservation • Choice of algorithms 8/7/2017 Bill Howe, UW 42 Motivation Algebraic Optimization Regridding End
  • 43. 8/7/2017 Bill Howe, UW 43 Motivation Algebraic Optimization Regridding End
  • 44. Finding mesh intersections 8/7/2017 Bill Howe, UW 44 Motivation Algebraic Optimization Regridding End
  • 45. 8/7/2017 Bill Howe, UW 45 Motivation Algebraic Optimization Regridding End
  • 46. 8/7/2017 Bill Howe, UW 46 Motivation Algebraic Optimization Regridding End
  • 47. 8/7/2017 Bill Howe, UW 47 Restrict(Regrid(X,Y)) = Regrid(Restrict(X), Restrict(Y)) Commutativity of Regrid and Restrict: G0 = Regrid(Restrict0(X), Restrict0(Y))) G1 = Regrid(Restrict1(X), Restrict1(Y))) : GN = Regrid(Restrict2(X), Restrict2(Y))) R = Stitch(G0, G1, G2) Motivation Algebraic Optimization Regridding End
  • 48. 8/7/2017 Bill Howe, UW 48 Motivation Algebraic Optimization Regridding End
  • 49. “Lumping” 8/7/2017 Bill Howe, UW 49 Motivation Algebraic Optimization Regridding End
  • 50. 8/7/2017 Bill Howe, UW 50 Motivation Algebraic Optimization Regridding End
  • 51. 8/7/2017 Bill Howe, UW 51 Motivation Algebraic Optimization Regridding End
  • 52. 8/7/2017 Bill Howe, UW 52 Globally conservative Parallelizable Commutes with user- selected restrictions masking to handle mismatched domains Todos: • Characterize the error relative to plain supermeshing • Universal Regridding-as-a-Service Motivation Algebraic Optimization Regridding End
  • 53. Outreach and Usage • Code is available, but in transition to github – Search “gridfields” on google code – http://code.google.com/p/gridfields/ – C++ with Python bindings • Integrated into the Hyrax Data Server – OPULS project funded by NOAA – Server-side processing of unstructured grids • Other users – US Geological Survey – NOAA 8/7/2017 Bill Howe, UW 538/7/2017 Bill Howe, UW 53 Motivation Algebraic Optimization Regridding End
  • 54. 8/7/2017 Bill Howe, UW 54 • Screenshot of OPeNDAP demo http://ec2-174-129-186-110.compute-1.amazonaws.com:8088/nc/test4.nc.nc? ugrid_restrict(0,"Y>41.5&Y<42.75&X>-68.0&X<-66.0") Motivation Algebraic Optimization Regridding End
  • 55. Wrap up • Integration of big data and big models is the game • Database-style systems are about hiding complexity and raising the level of abstraction • A database-style query algebra for FEMs emphasizing interpolation and regridding across data and models made sense to us • But more broadly: a richer infrastructure for comparing and sharing model results and data • One idea: “Virtual datasets” where the model is executed in response to queries, perhaps with simpler grids and relaxed assumptions 8/7/2017 Bill Howe, UW 55 Motivation Algebraic Optimization Regridding End
  • 56. 56 Propublica, May 2016 Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 57. 57 The Special Committee on Criminal Justice Reform's hearing of reducing the pre-trial jail population. Technical.ly, September 2016 Philadelphia is grappling with the prospect of a racist computer algorithm Any background signal in the data of institutional racism is amplified by the algorithm operationalized by the algorithm legitimized by the algorithm “Should I be afraid of risk assessment tools?” “No, you gotta tell me a lot more about yourself. At what age were you first arrested? What is the date of your most recent crime?” “And what’s the culture of policing in the neighborhood in which I grew up in?” Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 58. 8/7/2017 Bill Howe, UW 58 Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016 Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 59. 8/7/2017 Bill Howe, UW 59 Amazon Prime Now Delivery Area: Boston Bloomberg, 2016 Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 60. 8/7/2017 Bill Howe, UW 60 Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016 Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 61. First decade of Data Science research and practice: What can we do with massive, noisy, heterogeneous datasets? Next decade of Data Science research and practice: What should we do with massive, noisy, heterogeneous datasets? The way I think about this…..(1) Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 62. The way I think about this…. (2) Decisions are based on two sources of information: 1. Past examples e.g., “prior arrests tend to increase likelihood of future arrests” 2. Societal constraints e.g., “we must avoid racial discrimination” 8/7/2017 Data, Responsibly / SciTech NW 62 We’ve become very good at automating the use of past examples We’ve only just started to think about incorporating societal constraints Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 63. The way I think about this… (3) How do we apply societal constraints to algorithmic decision-making? Option 1: Rely on human oversight Ex: EU General Data Protection Regulation requires that a human be involved in legally binding algorithmic decision-making Ex: Wisconsin Supreme Court says a human must review algorithmic decisions made by recidivism models Issues with scalability, prejudice Option 2: Build systems to help enforce these constraints This is the approach we are exploring 8/7/2017 Data, Responsibly / SciTech NW 63 Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 64. The way I think about this…(4) On transparency vs. accountability: • For human decision-making, sometimes explanations are required, improving transparency – Supreme court decisions – Employee reprimands/termination • But when transparency is difficult, accountability takes over – medical emergencies, business decisions • As we shift decisions to algorithms, we lose both transparency AND accountability • “The buck stops where?” 8/7/2017 Data, Responsibly / SciTech NW 64 Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 65. Fairness Accountability Transparency Privacy Reproducibility Fides: A platform for responsible data science joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE] Data Curation novel features to support: So what do we do about it? Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science
  • 66. Motivation Regridding Supermeshi ng Database Algebras Evaluat ion Numerical conservatio n Responsible Data Science