SlideShare a Scribd company logo
Processing of raw astronomical data of large volume by
MapReduce model
Gerasimov S.V., Kolosov I.Y., Glotov E.S., Popov I.S.
Faculty of Computational Mathematics and Cybernetics Lomonosov Moscow State University
Meshcheryakov A.V.
Space Research Institute of the Russian Academy of Sciences
Digital Sky Surveys
pipeline
Telescope Raw images
Catalogue
Multiwavelength astronomy
Data volumes
2010-2014 (PS1)
● 1400 mega-pixels
● 0,4 PB / year
● 2PB totally
1998-2009
● 120 mega-pixels
● 0.08PB totally
Data volumes
2022-2032
● 3200 mega-pixels
● 15 TB data every night
● 60PB of raw data in 10 years
● 15PB catalogue
2013-2018
● 570 mega-pixels
● 0.1 PB/year
● 0.5 PB in total
The astronomical science of big data sets
Biggest questions
● nature of dark energy
● nature of dark matter
Small effects
● requires large volumes
(all-sky and high depth
imaging)
● systematics are
important
Small teams
● require ability to analyse
big data sets
Our research
● Research of big data technologies to process and store raw astrophysical data
(current report)
● Creation of experimental prototype of configurable astronomical image data
pipeline based on MapReduce (current report)
● Research & developement of machine learning algorithms and their “big”
versions to solve actual astrophysical tasks:
○ extragalactic objects distance (and other properties) estimation
○ star/galaxy/quasar classification
○ transient sky objects detection and classification
○ exploration of hidden structure in the sky objects distribution
… to help astrophysicists to do their job better!
Pipeline steps
Step Task
Co-addition
Projection
Background estimation and subtraction
Stacking
Catalogue creation
Background estimation and subtraction
Image areas filtering
Image segmentation (object groups extraction on
filtered images)
Deblending (object extraction inside groups)
Artifacts removal (cleaning)
Basic objects features measurement
Star/galaxy classification
Creation of PSF-model on stars
Accurate measurement of all features of objects
taking into account PSF-model
Step Description
Raw processing Removal of CCD noise and artefacts
Astronomical calibration Detection of World Coordinate System (WCS)
Photometric calibration Objects intensity callibration
Telescope / sky survey side steps Intelligent steps
Co-addition depth effect
+ 53 more
images
Pipiline tools: astromatic.org
Image co-addition
Objects extraction,
measurement of their
features
Point Spread Function
(PSF) -modelling
Impact of telescope and atmospheric effects on astronomical
images
Courtesy of LSST PhotoSIM
PSFEx uses stars detected on
image for empirical PSF
modeling.
Distributed pipelines
● Astromatic-Wrapper https://github.com/fred3m/astromatic_wrapper
● Wiley et al. Astronomy in the Cloud: Using MapReduce for Image Coaddition
● Farivar et al. Cloud Based Processing of Large Photometric Surveys
● Montage: an astronomical image mosaic engine http://montage.ipac.caltech.edu/
● LSST www.lsst.org http://confluence.lsstcorp.org/
Proposed approach
pipeline machine learning, data miningcatalogue
Proposed approach
MapReduce + HDFS
head node
SWarp,
SExtractor
PSFEx
config files
Co-addition step
1,1
2,1 2,2
1,2 map(filename, image):
if doesn’t belong to target sky fragment
return
image’=projection_and_background_subtraction (image) # SWarp
return ((i,j), image’)
reduce((i,j), list<image>):
cell=stacking(list<image>) # SWarp
return ((i,j), cell)
Catalogue creation step
map((i,j), stacked_image):
basic_object_features, object_icons=basic_extract(stacked_image) #SExtractor
psf_model=create_psf_model(basic_object_features, object_icons) #PSFEx
object_features=extract(stacked_image, psf_model) #SExtractor
return ((i,j), objects_features)
Objects on cell frontier
1,1
2,1 2,2
1,2
Experiments
Stripe82 SDSS DR 12
● stripe
● run (north / south)
Experiments
D12 (4 cores, 28GB RAM, 200GB SSD) x 12
Co-addition
cell-size: 0,7o
x0,7o
(0,5o
+0,2o
)
cell-count: 60x5
mapreduce.map.memory.mb=5GB
mapreduce.reduce.memory.mb=5GB
Catalogue creation
mapreduce.map.memory.mb=5GB
Source FITS-files (size of each ~12MB) were initially converted to SequenceFile format
to suite HDFS block size.
Scalability: co-addition
Scalability: catalogue creation
Co-addition: MapReduce metrics
Volume
Processing time, min.
map shuffle sort reduce total
14 GB 16 17 1 10 27
21 GB 17 17 1 14 33
33 GB 23 27 1 21 49
Results & next steps
● Current numbers showed MapReduce pipeline is practically useful for
astrophysicists
● Further experiments scheduled to measure scalability on larger data volumes
(1+TB)
● Some enhancements in the MapReduce pipeline will be implemented and
their impact on performance will be researched
● 2016-2017: research and development of “big” versions of machine learning
and data mining algorithms for astrophysics
Acknowledgment
The project is supported by RFBR grant #15-29-07085 офи_м
Cloud resources were provided as grant “Microsoft Azure for Research”

More Related Content

PDF
CLIM: Transition Workshop - Optimization Methods in Remote Sensing - Jessica...
PDF
AI Research and OpenPOWER at the NASA Frontier Development Lab
PDF
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
PDF
Application of terrestrial 3D laser scanning in building information modellin...
PPTX
Integrating eo with official statistics using machine learning in mexico geo ...
PPTX
Machine learning and Satellite Images
PDF
12 SuperAI on Supercomputers
PDF
SFScon17 - Markus Neteler: "Leveraging the Copernicus Sentinel satellite data...
CLIM: Transition Workshop - Optimization Methods in Remote Sensing - Jessica...
AI Research and OpenPOWER at the NASA Frontier Development Lab
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
Application of terrestrial 3D laser scanning in building information modellin...
Integrating eo with official statistics using machine learning in mexico geo ...
Machine learning and Satellite Images
12 SuperAI on Supercomputers
SFScon17 - Markus Neteler: "Leveraging the Copernicus Sentinel satellite data...

What's hot (20)

PDF
02 Modelling strategies for Nuclear Probabilistic Safety Assessment in case o...
PPTX
The Seismic Hazard Modeller’s Toolkit: An Open-Source Library for the Const...
PPTX
The UAE solar Atlas
PDF
Deep Learning Applications to Satellite Imagery
PDF
Combining remote sensing earth observations and in situ networks: detection o...
PDF
3d Modelling of Structures using terrestrial laser scanning technique
PPTX
Lec_11_Intro to Raster
PPTX
Milton analyticalconstellation
PPTX
Using Deep Learning to Derive 3D Cities from Satellite Imagery
PDF
GI2015 ppt hladikova copernicus_agriculture_forestry_lh
PDF
2020 ml swarm ascend presentation
PPTX
EMME Earthquake Model of Middle East
PPTX
Operational Data Fusion Framework for Building Frequent Land sat-Like Imagery
PDF
AASWinter2016
PDF
Copernicus and AI workshop 2020
PDF
PPTX
GEM Risk: main achievements during the first implementation phase
PDF
Surveying Areas in Developing Regions Through Context Aware Drone Mobility
PPT
SAGA GIS 2.0.7
PPT
TU1.L10 - Globwave and applications of global satellite wave observations
02 Modelling strategies for Nuclear Probabilistic Safety Assessment in case o...
The Seismic Hazard Modeller’s Toolkit: An Open-Source Library for the Const...
The UAE solar Atlas
Deep Learning Applications to Satellite Imagery
Combining remote sensing earth observations and in situ networks: detection o...
3d Modelling of Structures using terrestrial laser scanning technique
Lec_11_Intro to Raster
Milton analyticalconstellation
Using Deep Learning to Derive 3D Cities from Satellite Imagery
GI2015 ppt hladikova copernicus_agriculture_forestry_lh
2020 ml swarm ascend presentation
EMME Earthquake Model of Middle East
Operational Data Fusion Framework for Building Frequent Land sat-Like Imagery
AASWinter2016
Copernicus and AI workshop 2020
GEM Risk: main achievements during the first implementation phase
Surveying Areas in Developing Regions Through Context Aware Drone Mobility
SAGA GIS 2.0.7
TU1.L10 - Globwave and applications of global satellite wave observations
Ad

Similar to Processing of raw astronomical data of large volume by map reduce model (20)

PPTX
LSST Solar System Science: MOPS Status, the Science, and Your Questions
PDF
Astroimagej Image Processing And Photometric Extraction For Ultra-Precise As...
PPTX
Self Automated Rovers
PDF
Resume_optics_Gupta Roy
PDF
Astronomical Data Processing on the LSST Scale with Apache Spark
PPTX
Machine Learning for LST Prediction .pptx
PDF
Spatio-Spectral Multichannel Reconstruction from few Low-Resolution Multispec...
PDF
IRJET- Deep Convolution Neural Networks for Galaxy Morphology Classification
PDF
Portfolio in Cartography and Remote Sensing
PPTX
cg.30.pptx
PPTX
PhD Projects in Geoscience Research Assistance
PPT
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
PPTX
understanding the planet using satellites and deep learning
PDF
Icy Moon Surface Simulation and Stereo Depth Estimation for Sampling Autonomy...
PPT
Final presentation for Ordinance Survey sponsored MSc Project
PDF
Data Science Education: Needs & Opportunities in Astronomy
PPT
Presentation Serov
PDF
Unsupervised Building Extraction from High Resolution Satellite Images Irresp...
PDF
Cnn acuracia remotesensing-08-00329
PDF
Efficient data reduction and analysis of DECam images using multicore archite...
LSST Solar System Science: MOPS Status, the Science, and Your Questions
Astroimagej Image Processing And Photometric Extraction For Ultra-Precise As...
Self Automated Rovers
Resume_optics_Gupta Roy
Astronomical Data Processing on the LSST Scale with Apache Spark
Machine Learning for LST Prediction .pptx
Spatio-Spectral Multichannel Reconstruction from few Low-Resolution Multispec...
IRJET- Deep Convolution Neural Networks for Galaxy Morphology Classification
Portfolio in Cartography and Remote Sensing
cg.30.pptx
PhD Projects in Geoscience Research Assistance
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
understanding the planet using satellites and deep learning
Icy Moon Surface Simulation and Stereo Depth Estimation for Sampling Autonomy...
Final presentation for Ordinance Survey sponsored MSc Project
Data Science Education: Needs & Opportunities in Astronomy
Presentation Serov
Unsupervised Building Extraction from High Resolution Satellite Images Irresp...
Cnn acuracia remotesensing-08-00329
Efficient data reduction and analysis of DECam images using multicore archite...
Ad

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
annual-report-2024-2025 original latest.
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
annual-report-2024-2025 original latest.
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
ISS -ESG Data flows What is ESG and HowHow
1_Introduction to advance data techniques.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu

Processing of raw astronomical data of large volume by map reduce model

  • 1. Processing of raw astronomical data of large volume by MapReduce model Gerasimov S.V., Kolosov I.Y., Glotov E.S., Popov I.S. Faculty of Computational Mathematics and Cybernetics Lomonosov Moscow State University Meshcheryakov A.V. Space Research Institute of the Russian Academy of Sciences
  • 4. Data volumes 2010-2014 (PS1) ● 1400 mega-pixels ● 0,4 PB / year ● 2PB totally 1998-2009 ● 120 mega-pixels ● 0.08PB totally
  • 5. Data volumes 2022-2032 ● 3200 mega-pixels ● 15 TB data every night ● 60PB of raw data in 10 years ● 15PB catalogue 2013-2018 ● 570 mega-pixels ● 0.1 PB/year ● 0.5 PB in total
  • 6. The astronomical science of big data sets Biggest questions ● nature of dark energy ● nature of dark matter Small effects ● requires large volumes (all-sky and high depth imaging) ● systematics are important Small teams ● require ability to analyse big data sets
  • 7. Our research ● Research of big data technologies to process and store raw astrophysical data (current report) ● Creation of experimental prototype of configurable astronomical image data pipeline based on MapReduce (current report) ● Research & developement of machine learning algorithms and their “big” versions to solve actual astrophysical tasks: ○ extragalactic objects distance (and other properties) estimation ○ star/galaxy/quasar classification ○ transient sky objects detection and classification ○ exploration of hidden structure in the sky objects distribution … to help astrophysicists to do their job better!
  • 8. Pipeline steps Step Task Co-addition Projection Background estimation and subtraction Stacking Catalogue creation Background estimation and subtraction Image areas filtering Image segmentation (object groups extraction on filtered images) Deblending (object extraction inside groups) Artifacts removal (cleaning) Basic objects features measurement Star/galaxy classification Creation of PSF-model on stars Accurate measurement of all features of objects taking into account PSF-model Step Description Raw processing Removal of CCD noise and artefacts Astronomical calibration Detection of World Coordinate System (WCS) Photometric calibration Objects intensity callibration Telescope / sky survey side steps Intelligent steps
  • 9. Co-addition depth effect + 53 more images
  • 10. Pipiline tools: astromatic.org Image co-addition Objects extraction, measurement of their features Point Spread Function (PSF) -modelling
  • 11. Impact of telescope and atmospheric effects on astronomical images Courtesy of LSST PhotoSIM PSFEx uses stars detected on image for empirical PSF modeling.
  • 12. Distributed pipelines ● Astromatic-Wrapper https://github.com/fred3m/astromatic_wrapper ● Wiley et al. Astronomy in the Cloud: Using MapReduce for Image Coaddition ● Farivar et al. Cloud Based Processing of Large Photometric Surveys ● Montage: an astronomical image mosaic engine http://montage.ipac.caltech.edu/ ● LSST www.lsst.org http://confluence.lsstcorp.org/
  • 13. Proposed approach pipeline machine learning, data miningcatalogue
  • 14. Proposed approach MapReduce + HDFS head node SWarp, SExtractor PSFEx config files
  • 15. Co-addition step 1,1 2,1 2,2 1,2 map(filename, image): if doesn’t belong to target sky fragment return image’=projection_and_background_subtraction (image) # SWarp return ((i,j), image’) reduce((i,j), list<image>): cell=stacking(list<image>) # SWarp return ((i,j), cell)
  • 16. Catalogue creation step map((i,j), stacked_image): basic_object_features, object_icons=basic_extract(stacked_image) #SExtractor psf_model=create_psf_model(basic_object_features, object_icons) #PSFEx object_features=extract(stacked_image, psf_model) #SExtractor return ((i,j), objects_features)
  • 17. Objects on cell frontier 1,1 2,1 2,2 1,2
  • 18. Experiments Stripe82 SDSS DR 12 ● stripe ● run (north / south)
  • 19. Experiments D12 (4 cores, 28GB RAM, 200GB SSD) x 12 Co-addition cell-size: 0,7o x0,7o (0,5o +0,2o ) cell-count: 60x5 mapreduce.map.memory.mb=5GB mapreduce.reduce.memory.mb=5GB Catalogue creation mapreduce.map.memory.mb=5GB Source FITS-files (size of each ~12MB) were initially converted to SequenceFile format to suite HDFS block size.
  • 22. Co-addition: MapReduce metrics Volume Processing time, min. map shuffle sort reduce total 14 GB 16 17 1 10 27 21 GB 17 17 1 14 33 33 GB 23 27 1 21 49
  • 23. Results & next steps ● Current numbers showed MapReduce pipeline is practically useful for astrophysicists ● Further experiments scheduled to measure scalability on larger data volumes (1+TB) ● Some enhancements in the MapReduce pipeline will be implemented and their impact on performance will be researched ● 2016-2017: research and development of “big” versions of machine learning and data mining algorithms for astrophysics
  • 24. Acknowledgment The project is supported by RFBR grant #15-29-07085 офи_м Cloud resources were provided as grant “Microsoft Azure for Research”