SlideShare a Scribd company logo
How to interactively visualise and
explore a billion objects (with vaex)
Maarten Breddels
&
Amina Helmi
PyData Paris 2016
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and
astrophysical parameters of >109
stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech
image:Devin Powell
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
• Python
• rich set of libraries, becoming the default in
science
Situation
• TOPCAT comes close, not fast enough, works with
individual rows/particles, no exploratory tools,
written in Java.
• Your own IDL/Python code: a lot to consider to do it
optimal (multicore, efficient storage, efficient
algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in
~1 second
• Do we need to resort to Big Data solutions?
Big data?
• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Great… 2d histograms…
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
• Possibilities
• Total: flux, mass
• Mean: velocity, metallicity
• Dispersions: velocity…
• Basically statistics on a grid
Examples
Examples
Examples
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom,
pan
• interactive selection (lasso,
rectangle)
• client
• undo/redo
Example data / vaex.example()
• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Exploration:Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-
x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-
vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy,
vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-
Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
Mutual information
• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL
divergence:
• In short:
• How much does breaking up correlation
(not just linear) change density
distribution?
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic
coordinates
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
…
dask?
wendelin?
Get vaex
• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex
visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
How to interactively visualise and explore a billion objects (wit vaex)

More Related Content

PPTX
Paris data-geeks-2013-03-28
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
Data Science with Spark - Training at SparkSummit (East)
PDF
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
PDF
Migrating from matlab to python
PPTX
Big Data Science with H2O in R
ODP
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PPTX
Linked Data:Libraries and Beyond
Paris data-geeks-2013-03-28
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Data Science with Spark - Training at SparkSummit (East)
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Migrating from matlab to python
Big Data Science with H2O in R
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Linked Data:Libraries and Beyond

Similar to How to interactively visualise and explore a billion objects (wit vaex) (20)

PDF
Vaex pygrunn
PPSX
Big&open data challenges for smartcity-PIC2014 Shanghai
PPTX
Braintalk cuso nm
PPT
Open Analytics Environment
PDF
Big data tekna may 2016
PPT
Computation and Knowledge
PDF
Big data visualization frameworks and applications at Kitware
PDF
Juliana Freire PPT
PPTX
Opportunities for X-Ray science in future computing architectures
PDF
Chicago AWS user group - "Big Data in Higher Education" - Rebecca Schmidt and...
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
PDF
Guy Coates
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
PDF
Scientific Data Visualizations - Data Doesn't Care What You Believe.
PDF
2015 03-28-eb-final
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
PDF
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
PDF
Scientific Visualization
PPTX
Oceangraphic data formats
PDF
Sv big datascience_cliffclick_5_2_2013
Vaex pygrunn
Big&open data challenges for smartcity-PIC2014 Shanghai
Braintalk cuso nm
Open Analytics Environment
Big data tekna may 2016
Computation and Knowledge
Big data visualization frameworks and applications at Kitware
Juliana Freire PPT
Opportunities for X-Ray science in future computing architectures
Chicago AWS user group - "Big Data in Higher Education" - Rebecca Schmidt and...
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Guy Coates
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
Scientific Data Visualizations - Data Doesn't Care What You Believe.
2015 03-28-eb-final
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Scientific Visualization
Oceangraphic data formats
Sv big datascience_cliffclick_5_2_2013
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Ad

How to interactively visualise and explore a billion objects (wit vaex)

  • 1. How to interactively visualise and explore a billion objects (with vaex) Maarten Breddels & Amina Helmi PyData Paris 2016
  • 3. Copyright ESA/ATG medialab; background: ESO/S. Brunier • Gaia satellite • launched by ESA in december 2013 • determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way • First catalogue in September 2016 • Final catalogue ~2020
  • 6. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network
  • 7. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 8. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 9. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 10. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 11. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 12. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking • Python • rich set of libraries, becoming the default in science
  • 13. Situation • TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java. • Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex) • DataShader? • We want something to visualize 109 rows/objects in ~1 second • Do we need to resort to Big Data solutions?
  • 14. Big data? • Big data • definitions vary • Practical question to ask • Do you need Big Data solutions? • Many computers / grid • Invest in software • Or • Can we do it on a single computer ‘old style’
  • 16. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 17. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 18. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data
  • 19. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 20. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 21. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 22. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 23. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 24. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 25. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 26. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 27. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist) 600 times faster
  • 30. Great… 2d histograms… 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 31. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 32. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2
  • 33. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2 • Possibilities • Total: flux, mass • Mean: velocity, metallicity • Dispersions: velocity… • Basically statistics on a grid
  • 37. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 38. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 39. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 40. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 42. Vaex: Visualization And EXploration • A library • python package • ‘import vaex’ • reading of data • multithreading • binning (1,2,3, Nd) • selections/queries • server/client • integrates with IPython notebook • A GUI program • Gives interactive navigation, zoom, pan • interactive selection (lasso, rectangle) • client • undo/redo
  • 43. Example data / vaex.example() • Helmi and de Zeeuw 2000 • build up of a MW (stellar) halo from 33 satellites • 3.3 * 106 particles • Almost smooth in configuration space (x,y,z) • Structure visible in E-Lz-L space
  • 48. Exploration:Automated • High dimensionality → many subspaces • E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L- x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y- vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz- Lz, z-Lz • Can we automate this / at least help? • Ranking subspaces using Mutual Information
  • 49. Mutual information • Measure the information loss between p(x,y) and p(x)p(y) • information loss measured using the KL divergence: • In short: • How much does breaking up correlation (not just linear) change density distribution?
  • 50. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 51. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 52. Expressions/Virtual columns • Not just visualize existing columns • mathematical operations • sqrt(x**2+y**2) • log(r) • Virtual columns • r=sqrt(x**2+y**2) • Suggest common ones • equatorial to galactic coordinates
  • 57. How to get the data • Gaia: ~1-2 TB • download • torrent • Do you need to? • server/client?
  • 58. Client/Server • 2d histogram example • raw data:15 GB • binned 256x256 grid • 500KB, or 10-100kb compressed • Proof of concept • over http / websocket • vaex webserver [file …]
  • 61. Random subset / shuffled storage
  • 62. Random subset / shuffled storage
  • 63. Random subset / shuffled storage
  • 64. Random subset / shuffled storage
  • 65. Random subset / shuffled storage
  • 67. Get vaex • standalone binary (OS X , Linux) (just download and start) • www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’) • www.github.com/maartenbreddels/vaex • In Python • Vanilla Python • pip install —user —pre vaex • Anaconda • conda install -c maartenbreddels vaex
  • 68. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 69. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 70. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 71. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 72. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 73. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)