SlideShare a Scribd company logo
Analyzing Andromeda
Galaxy data using Spark
Jose Nandez
SHARCNET – University of Western Ontario
jnandez@sharcnet.ca
What is ?
• Shared Hierarchical Academic Research
NETwork,
• A consortium of 18 Ontario academic
institutions, lead by University of Western
Ontario
• Partner of Compute Canada that
oversees funding and distribution of
equipment.
• Sysadmins and HPC specialist, 20 in
total, distributed across 6 institutions.
What does SHARCNET do?
• Provides service and support to all SHARCNET
researchers in High Performance Computing.
• Researchers are part of partner universities across
Ontario.
• Starting to provide service for large data needs:
– With storage and processing of large data sets
– Data processing using Spark, Hadoop, etc
– Data mining and Machine Learning
What is the Andromeda Galaxy?
• Known as M31, or Messier 31
• Spiral galaxy
• 2.5 million light-years
• Closest galaxy
• Bigger galaxy than ours
Why Andromeda galaxy?
• Cool wallpaper
• t-shirts,
• Mugs …
• Science?
Andromeda Galaxy in Science
• It has a ~ trillion stars
• 2.5 times longer than our galaxy
• Thought to have merged with another galaxy
• It contains about 26 known black holes
• It can be used as a galaxy laboratory for
extragalactic astronomy
• Our galaxy will collide with it (in about 4 billion
years)
Why Andromeda?
Particularly…
• It has been recognized the
extension of Andromeda.
• The area shows the
extension of the galaxy,
further than thought before.
• M. Rafiei Ravandi et al
2016.
Extended Andromeda
• They were taken from
Spitzer-IRAC which is
an Infrared telescope.
• It has 426,529 new
sources.
• Extends observations
for disc and halo.
Classification of these objects
• Do all these sources (426,529) are part
of Andromeda?
• Are they all known from previous
catalogs?
• What type of object (such as Black holes,
galaxies, etc) are those new sources?
• What can we learn from these new
objects?
Which catalogs?
• Astronomical databases :
– SIMBAD (39,022)
– NED (126,862)
– MAST (118,854,914)
• Sources only around M31,
sources are in different
wavelengths (IR, Optical, UV)
• Then compare them with the
observed objects.
How hard could it be?
𝑊𝑒	𝑑𝑒𝑓𝑖𝑛𝑒𝑑	2))
𝑜𝑟	2	arcsec	as	a	good	match. 	Arcsec = 1/3600∘
,
angular measurement, not linear measurement (such as miles/km).
𝑘𝑒𝑦 − 𝑣𝑎𝑙𝑢𝑒 =
𝑅𝐴, 𝐷𝐸𝐶 , 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 →
join(), groupByKey(),
filter(), map(), sortByKey()
NED+SIMBAD+IRAC
Counts?
• 613 Stars
• 70 Globular Cluster
• 63 X-rays sources
• 62 Galaxies
• 52 Star clusters
• Total known
sources: 1,391
And the rest?
• They are not part of SIMBAD, NED or MAST
• What about other catalog?
• Can we classify them?
• Can we use machine learning?
Conclusions
• MAST has a higher resolution than IRAC-catalog,
SIMBAD and NED.
• Only 1,391 known sources from a matched between
NED + SIMBAD + IRAC-catalog.
• The rest could be classified using ML using the
known object features in order to give a
classification.
• We need more data for a better classification.
Thank You!Collaborator: Prof. Pauline Barmby, Department of Physics,
University of Western Ontario
Photos:
Mainly from NASA, ESO, EarthSky, MacOS.

More Related Content

PDF
B0DEGA 3D VO Archive - IVOA 2010 Fall Interop
PDF
A Recommender Story: Improving Backend Data Quality While Reducing Costs
PDF
Astronomical Data Processing on the LSST Scale with Apache Spark
PPTX
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
PDF
Visualisation of Big Imaging Data
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
B0DEGA 3D VO Archive - IVOA 2010 Fall Interop
A Recommender Story: Improving Backend Data Quality While Reducing Costs
Astronomical Data Processing on the LSST Scale with Apache Spark
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...
Visualisation of Big Imaging Data
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Viewers also liked (20)

PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
PDF
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
PDF
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
PDF
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Ad

Similar to Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Nandez (20)

PDF
SKA Regional Sciences Centres - A Platform for Global Astronomy
PPT
Presentation
PPT
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
PPT
Hubble Telescope
PPTX
Arduino-KickSat-ArduSat Knowledge Update
PDF
Computational Tools for Multimessenger Astronomy in the Gravitational Wave Era
PPT
Cosmology Research in China
PDF
The Square Kilometre Array: Overview and Engineering Update
PPTX
Data Mining The Sky
PPTX
Data Mining The Sky
PPTX
Astronomy databases ppt
PDF
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
PPTX
Ska zomsi and tholo
PPT
TeraGrid and Physics Research
PPT
SPIE scanning microscopy
PPTX
Starlink satellites Report for alls.pptx
PDF
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
PPTX
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
PDF
MeerKAT AP1 Info Sheet
PPTX
Applications Of Computer Science in Astronomy
SKA Regional Sciences Centres - A Platform for Global Astronomy
Presentation
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Hubble Telescope
Arduino-KickSat-ArduSat Knowledge Update
Computational Tools for Multimessenger Astronomy in the Gravitational Wave Era
Cosmology Research in China
The Square Kilometre Array: Overview and Engineering Update
Data Mining The Sky
Data Mining The Sky
Astronomy databases ppt
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Ska zomsi and tholo
TeraGrid and Physics Research
SPIE scanning microscopy
Starlink satellites Report for alls.pptx
AstroAccelerate - GPU Accelerated Signal Processing on the Path to the Square...
AstroInformatics 2015: Large Sky Surveys: Entering the Era of Software-Bound ...
MeerKAT AP1 Info Sheet
Applications Of Computer Science in Astronomy
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
DOCX
Factor Analysis Word Document Presentation
PPTX
Leprosy and NLEP programme community medicine
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business_Capability_Map_Collection__pptx
PDF
Transcultural that can help you someday.
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Business Analytics and business intelligence.pdf
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Factor Analysis Word Document Presentation
Leprosy and NLEP programme community medicine
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Introduction to Data Science and Data Analysis
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business_Capability_Map_Collection__pptx
Transcultural that can help you someday.
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Pilar Kemerdekaan dan Identi Bangsa.pptx
IMPACT OF LANDSLIDE.....................
SAP 2 completion done . PRESENTATION.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
ISS -ESG Data flows What is ESG and HowHow
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
annual-report-2024-2025 original latest.
Business Analytics and business intelligence.pdf
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...

Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Nandez

  • 1. Analyzing Andromeda Galaxy data using Spark Jose Nandez SHARCNET – University of Western Ontario jnandez@sharcnet.ca
  • 2. What is ? • Shared Hierarchical Academic Research NETwork, • A consortium of 18 Ontario academic institutions, lead by University of Western Ontario • Partner of Compute Canada that oversees funding and distribution of equipment. • Sysadmins and HPC specialist, 20 in total, distributed across 6 institutions.
  • 3. What does SHARCNET do? • Provides service and support to all SHARCNET researchers in High Performance Computing. • Researchers are part of partner universities across Ontario. • Starting to provide service for large data needs: – With storage and processing of large data sets – Data processing using Spark, Hadoop, etc – Data mining and Machine Learning
  • 4. What is the Andromeda Galaxy? • Known as M31, or Messier 31 • Spiral galaxy • 2.5 million light-years • Closest galaxy • Bigger galaxy than ours
  • 5. Why Andromeda galaxy? • Cool wallpaper • t-shirts, • Mugs … • Science?
  • 6. Andromeda Galaxy in Science • It has a ~ trillion stars • 2.5 times longer than our galaxy • Thought to have merged with another galaxy • It contains about 26 known black holes • It can be used as a galaxy laboratory for extragalactic astronomy • Our galaxy will collide with it (in about 4 billion years)
  • 8. Particularly… • It has been recognized the extension of Andromeda. • The area shows the extension of the galaxy, further than thought before. • M. Rafiei Ravandi et al 2016.
  • 9. Extended Andromeda • They were taken from Spitzer-IRAC which is an Infrared telescope. • It has 426,529 new sources. • Extends observations for disc and halo.
  • 10. Classification of these objects • Do all these sources (426,529) are part of Andromeda? • Are they all known from previous catalogs? • What type of object (such as Black holes, galaxies, etc) are those new sources? • What can we learn from these new objects?
  • 11. Which catalogs? • Astronomical databases : – SIMBAD (39,022) – NED (126,862) – MAST (118,854,914) • Sources only around M31, sources are in different wavelengths (IR, Optical, UV) • Then compare them with the observed objects.
  • 12. How hard could it be? 𝑊𝑒 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 2)) 𝑜𝑟 2 arcsec as a good match. Arcsec = 1/3600∘ , angular measurement, not linear measurement (such as miles/km).
  • 13. 𝑘𝑒𝑦 − 𝑣𝑎𝑙𝑢𝑒 = 𝑅𝐴, 𝐷𝐸𝐶 , 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 → join(), groupByKey(), filter(), map(), sortByKey()
  • 15. Counts? • 613 Stars • 70 Globular Cluster • 63 X-rays sources • 62 Galaxies • 52 Star clusters • Total known sources: 1,391
  • 16. And the rest? • They are not part of SIMBAD, NED or MAST • What about other catalog? • Can we classify them? • Can we use machine learning?
  • 17. Conclusions • MAST has a higher resolution than IRAC-catalog, SIMBAD and NED. • Only 1,391 known sources from a matched between NED + SIMBAD + IRAC-catalog. • The rest could be classified using ML using the known object features in order to give a classification. • We need more data for a better classification.
  • 18. Thank You!Collaborator: Prof. Pauline Barmby, Department of Physics, University of Western Ontario Photos: Mainly from NASA, ESO, EarthSky, MacOS.