SlideShare a Scribd company logo
Analytics on
100 TB+ catalogs
Enabling astronomy in the
era of massive survey
telescopesMario Juric <mjuric@astro.washington.edu>
UW Astronomy | DIRAC | eScience
@mjuric
Zwicky Transient Facility
> 1000 images/night, 576 mpix
> 300 M detected sources/night
> 1 billion objects, 75-250 mea/obj/year
> 1 M alerts/night
http://ztf.caltech.edu
Zwicky Transient Facility
> 1 TB/night (raw), 10 TB (processed)
> 150 GB sources/night
> 20-40 GB alerts/night
http://ztf.caltech.edu
Zwicky Transient Facility
> 2.5 PB images/yr
> 37.5 TB of sources/year
> 5-10 TB of alerts/year
Starting: ~NOW!
Uncertainty: ~3x!
http://ztf.caltech.edu
Spatial Extent: the Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
Spatial Extent: the ~Entire Sky
Example: The sky footprint of early Pan-STARRS PS1 data
(zoomed in on a ”medium deep field”)
New Science: the Time Component
> Time series analysis
(classification)
> Rapid identification and
alerting on “interesting”
variability
> Identification of moving
sources
Example RR Lyrae light curves from Székely et al. (2007)
The Wishlist: What we’re looking for in
a DBMS
> Must be able to reliably store the data
> Must enable efficient batch processing
– I.e., ”compute this statistic over all time series”, in ~hours
> Must enable fast extraction of individual time series
– I.e., ”give me the light curve of X”, in <1s
> Must enable fast spatial queries, fast histograms
– I.e., “Give me all objects in this area on the sky”, in <1s to start
> Must enable easy “cross matching”
– Positionally cross-match N catalogs, find neighbors
The Wishlist: What we’re looking for in
a DBMS
> Must support insertions of ~300M rows/night
> Must scale to ~100TB+ catalogs in ~3 years
> Efficient in multi-user mode
> Should (must) be easy to use
– Shallow learning curve, ease of install, strong Python APIs
– Ideally easily replicated and manageable by astronomers.
– SQL-like interface is a plus (declarative queries)
> Ideally would like to be able to get it up and running in ~4-6
months.
Options We’re Looking At
> Relational Databases
– Postgres, Oracle, qserv (experimental)
– Challenging to have tables of ~100 billion rows (expectation after ~1yr)
– Slow time-series extraction
> Parquet+Spark
– Looks like it may scale.
– Not easy to set up, steep learning curve
– No native multi-user awareness
> Custom solution (”Large Survey Database”; http://lsddb.org)
– Partitioned tree of HDF5 files, Parquet before Parquet + Python client
– Special snowflake, will need eternal support, no community.
Discuss
Are there other areas that have to deal with
~billion time series of 100+ measurements?
What are the technology choices you use to
manage your data sets? What should we
be looking at?
A Related Problem: Telemetry
Databases
> ~100+ sensors, <=10 Hz sampling
– ~500 MB/night
– ~150 GB/yr
> Slightly different slicing needs
– ”Give me the data from all sensors in the following time
window”, as opposed to “give me all the data for the following
set of objects”
> Simple HDF5 may work
The Next Problem (in 2022)
The Large Synoptic Survey Telescope
An automated 8.4 meter telescope that for 10 years will
image half the sky every ~3 days, generate ~50 PB of
(raw) imaging data, issue real-time alerts to any changes
in the sky (~10 million/night), measure properties of
~40 billion objects in the sky (~1000 times
each), and make the results available
in a web-accessible database.
http://lsst.org

More Related Content

PDF
A Recommender Story: Improving Backend Data Quality While Reducing Costs
PDF
Deep Learning in Deep Space
PPTX
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
PDF
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
PDF
Astronomical Data Processing on the LSST Scale with Apache Spark
PPTX
Deep Learning on Aerial Imagery: What does it look like on a map?
PDF
Polar Domain Discovery with Sparkler - EarthCube
PPTX
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
A Recommender Story: Improving Backend Data Quality While Reducing Costs
Deep Learning in Deep Space
Novel Techniques & Connections Between High-Pressure Mineral Physics, Microto...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Astronomical Data Processing on the LSST Scale with Apache Spark
Deep Learning on Aerial Imagery: What does it look like on a map?
Polar Domain Discovery with Sparkler - EarthCube
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li

What's hot (20)

PDF
The big data Universe. Literally.
PDF
Autoencoding RNN for inference on unevenly sampled time-series data
PPTX
The Pacific Research Platform
 Two Years In
PPTX
Big Data for Big Discoveries
PDF
Q4 2016 GeoTrellis Presentation
PPTX
NERSC, AI and the Superfacility, Debbie Bard
PPTX
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
PDF
SkyhookDM - Towards an Arrow-Native Storage System
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
PDF
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
PPT
Many Task Applications for Grids and Supercomputers
PPTX
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
PDF
GaiaCal2014: Creating and Calibrating LSST Data Product
PDF
Talk for "The X-ray Universe 2014, Dublin"
PPT
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
PPTX
LocationTech Projects
PDF
Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...
PDF
Climate data in r with the raster package
PDF
LSST/DM: Building a Next Generation Survey Data Processing System
PPTX
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
The big data Universe. Literally.
Autoencoding RNN for inference on unevenly sampled time-series data
The Pacific Research Platform
 Two Years In
Big Data for Big Discoveries
Q4 2016 GeoTrellis Presentation
NERSC, AI and the Superfacility, Debbie Bard
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
SkyhookDM - Towards an Arrow-Native Storage System
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Many Task Applications for Grids and Supercomputers
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
GaiaCal2014: Creating and Calibrating LSST Data Product
Talk for "The X-ray Universe 2014, Dublin"
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
LocationTech Projects
Application of HDF/HDF-EOS data to atmospheric and climate sciences at Univer...
Climate data in r with the raster package
LSST/DM: Building a Next Generation Survey Data Processing System
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
Ad

Similar to Round Table Introduction: Analytics on 100 TB+ catalogs (20)

PPT
World widetelescopetecfest
PDF
VO Course 11: Spatial indexing
ODP
Lofar python meetup jan9 2013
PDF
2014-04-09, Data mining demo for astronomy researchers
PDF
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
PDF
Data Science Education: Needs & Opportunities in Astronomy
PPT
Google Techtalk 2006
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PDF
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
PDF
201506 OSIsoft Garter Big Data.pdf
PPT
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
PDF
VO Course 02: Astronomy & Standards
PDF
Vaex pygrunn
PDF
Is this normal?
PDF
Talk about T.C.P. for CDI inter-departmental workshop at UC Berkeley. 20090911.
PDF
Visualisation of Big Imaging Data
PDF
Karolina Zawada: Toruń University’s Open Access Data Project – the new role f...
PDF
Emission Line Objects in Integral Field Spectroscopic Datacubes
KEY
The Changing Face(s) of Astronomy
PPTX
Sky Arrays - ArrayDB in action for Sky View Factor Computation
World widetelescopetecfest
VO Course 11: Spatial indexing
Lofar python meetup jan9 2013
2014-04-09, Data mining demo for astronomy researchers
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
Data Science Education: Needs & Opportunities in Astronomy
Google Techtalk 2006
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
201506 OSIsoft Garter Big Data.pdf
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
VO Course 02: Astronomy & Standards
Vaex pygrunn
Is this normal?
Talk about T.C.P. for CDI inter-departmental workshop at UC Berkeley. 20090911.
Visualisation of Big Imaging Data
Karolina Zawada: Toruń University’s Open Access Data Project – the new role f...
Emission Line Objects in Integral Field Spectroscopic Datacubes
The Changing Face(s) of Astronomy
Sky Arrays - ArrayDB in action for Sky View Factor Computation
Ad

Recently uploaded (20)

PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
famous lake in india and its disturibution and importance
PDF
diccionario toefl examen de ingles para principiante
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
The scientific heritage No 166 (166) (2025)
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
An interstellar mission to test astrophysical black holes
PPTX
Cell Membrane: Structure, Composition & Functions
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
famous lake in india and its disturibution and importance
diccionario toefl examen de ingles para principiante
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
2. Earth - The Living Planet Module 2ELS
AlphaEarth Foundations and the Satellite Embedding dataset
The scientific heritage No 166 (166) (2025)
ECG_Course_Presentation د.محمد صقران ppt
Introduction to Fisheries Biotechnology_Lesson 1.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The KM-GBF monitoring framework – status & key messages.pptx
HPLC-PPT.docx high performance liquid chromatography
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
An interstellar mission to test astrophysical black holes
Cell Membrane: Structure, Composition & Functions

Round Table Introduction: Analytics on 100 TB+ catalogs

  • 1. Analytics on 100 TB+ catalogs Enabling astronomy in the era of massive survey telescopesMario Juric <mjuric@astro.washington.edu> UW Astronomy | DIRAC | eScience @mjuric
  • 2. Zwicky Transient Facility > 1000 images/night, 576 mpix > 300 M detected sources/night > 1 billion objects, 75-250 mea/obj/year > 1 M alerts/night http://ztf.caltech.edu
  • 3. Zwicky Transient Facility > 1 TB/night (raw), 10 TB (processed) > 150 GB sources/night > 20-40 GB alerts/night http://ztf.caltech.edu
  • 4. Zwicky Transient Facility > 2.5 PB images/yr > 37.5 TB of sources/year > 5-10 TB of alerts/year Starting: ~NOW! Uncertainty: ~3x! http://ztf.caltech.edu
  • 5. Spatial Extent: the Entire Sky Example: The sky footprint of early Pan-STARRS PS1 data
  • 6. Spatial Extent: the ~Entire Sky Example: The sky footprint of early Pan-STARRS PS1 data (zoomed in on a ”medium deep field”)
  • 7. New Science: the Time Component > Time series analysis (classification) > Rapid identification and alerting on “interesting” variability > Identification of moving sources Example RR Lyrae light curves from Székely et al. (2007)
  • 8. The Wishlist: What we’re looking for in a DBMS > Must be able to reliably store the data > Must enable efficient batch processing – I.e., ”compute this statistic over all time series”, in ~hours > Must enable fast extraction of individual time series – I.e., ”give me the light curve of X”, in <1s > Must enable fast spatial queries, fast histograms – I.e., “Give me all objects in this area on the sky”, in <1s to start > Must enable easy “cross matching” – Positionally cross-match N catalogs, find neighbors
  • 9. The Wishlist: What we’re looking for in a DBMS > Must support insertions of ~300M rows/night > Must scale to ~100TB+ catalogs in ~3 years > Efficient in multi-user mode > Should (must) be easy to use – Shallow learning curve, ease of install, strong Python APIs – Ideally easily replicated and manageable by astronomers. – SQL-like interface is a plus (declarative queries) > Ideally would like to be able to get it up and running in ~4-6 months.
  • 10. Options We’re Looking At > Relational Databases – Postgres, Oracle, qserv (experimental) – Challenging to have tables of ~100 billion rows (expectation after ~1yr) – Slow time-series extraction > Parquet+Spark – Looks like it may scale. – Not easy to set up, steep learning curve – No native multi-user awareness > Custom solution (”Large Survey Database”; http://lsddb.org) – Partitioned tree of HDF5 files, Parquet before Parquet + Python client – Special snowflake, will need eternal support, no community.
  • 11. Discuss Are there other areas that have to deal with ~billion time series of 100+ measurements? What are the technology choices you use to manage your data sets? What should we be looking at?
  • 12. A Related Problem: Telemetry Databases > ~100+ sensors, <=10 Hz sampling – ~500 MB/night – ~150 GB/yr > Slightly different slicing needs – ”Give me the data from all sensors in the following time window”, as opposed to “give me all the data for the following set of objects” > Simple HDF5 may work
  • 13. The Next Problem (in 2022) The Large Synoptic Survey Telescope An automated 8.4 meter telescope that for 10 years will image half the sky every ~3 days, generate ~50 PB of (raw) imaging data, issue real-time alerts to any changes in the sky (~10 million/night), measure properties of ~40 billion objects in the sky (~1000 times each), and make the results available in a web-accessible database. http://lsst.org