Round Table Introduction: Analytics on 100 TB+ catalogs

Analytics on
100 TB+ catalogs
Enabling astronomy in the
era of massive survey
telescopesMario Juric <mjuric@astro.washington.edu>
UW Astronomy | DIRAC | eScience
@mjuric

Zwicky Transient Facility
> 1000 images/night, 576 mpix
> 300 M detected sources/night
> 1 billion objects, 75-250 mea/obj/year
> 1 M alerts/night
http://ztf.caltech.edu

> 1 TB/night (raw), 10 TB (processed)
> 150 GB sources/night
> 20-40 GB alerts/night

> 2.5 PB images/yr
> 37.5 TB of sources/year
> 5-10 TB of alerts/year
Starting: ~NOW!
Uncertainty: ~3x!

Spatial Extent: the Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data

Spatial Extent: the ~Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
(zoomed in on a ”medium deep field”)

New Science: the Time Component
> Time series analysis
(classification)
> Rapid identification and
alerting on “interesting”
variability
> Identification of moving
sources
Example RR Lyrae light curves from Székely et al. (2007)

The Wishlist: What we’re looking for in
a DBMS
> Must be able to reliably store the data
> Must enable efficient batch processing
– I.e., ”compute this statistic over all time series”, in ~hours
> Must enable fast extraction of individual time series
– I.e., ”give me the light curve of X”, in <1s
> Must enable fast spatial queries, fast histograms
– I.e., “Give me all objects in this area on the sky”, in <1s to start
> Must enable easy “cross matching”
– Positionally cross-match N catalogs, find neighbors

The Wishlist: What we’re looking for in
a DBMS
> Must support insertions of ~300M rows/night
> Must scale to ~100TB+ catalogs in ~3 years
> Efficient in multi-user mode
> Should (must) be easy to use
– Shallow learning curve, ease of install, strong Python APIs
– Ideally easily replicated and manageable by astronomers.
– SQL-like interface is a plus (declarative queries)
> Ideally would like to be able to get it up and running in ~4-6
months.

Options We’re Looking At
> Relational Databases
– Postgres, Oracle, qserv (experimental)
– Challenging to have tables of ~100 billion rows (expectation after ~1yr)
– Slow time-series extraction
> Parquet+Spark
– Looks like it may scale.
– Not easy to set up, steep learning curve
– No native multi-user awareness
> Custom solution (”Large Survey Database”; http://lsddb.org)
– Partitioned tree of HDF5 files, Parquet before Parquet + Python client
– Special snowflake, will need eternal support, no community.

Discuss
Are there other areas that have to deal with
~billion time series of 100+ measurements?
What are the technology choices you use to
manage your data sets? What should we
be looking at?

A Related Problem: Telemetry
Databases
> ~100+ sensors, <=10 Hz sampling
– ~500 MB/night
– ~150 GB/yr
> Slightly different slicing needs
– ”Give me the data from all sensors in the following time
window”, as opposed to “give me all the data for the following
set of objects”
> Simple HDF5 may work

The Next Problem (in 2022)
The Large Synoptic Survey Telescope
An automated 8.4 meter telescope that for 10 years will
image half the sky every ~3 days, generate ~50 PB of
(raw) imaging data, issue real-time alerts to any changes
in the sky (~10 million/night), measure properties of
~40 billion objects in the sky (~1000 times
each), and make the results available
in a web-accessible database.
http://lsst.org

Round Table Introduction: Analytics on 100 TB+ catalogs

More Related Content

What's hot (20)

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs (20)

Recently uploaded (20)

Round Table Introduction: Analytics on 100 TB+ catalogs