SlideShare a Scribd company logo
FAIRSpectra
Towards a common data file format
for SIMS images
Alex Henderson
University of Manchester, UK
Office for Open Research
https://fairspectra.net
https://alexhenderson.info
Thanks…
• For financial support
• University of Manchester’s Office for Open Research
• SurfaceSpectra Ltd.
• For in-kind support
• 101st IUVSTA Workshop (metadata workshop)
• UK Surface Analysis Users Forum (UKSAF) (free exhibition space)
• SIMS Europe (free exhibition space)
• SpringSciX 2024 (free exhibition space)
SIMS Europe
Office for Open Research
What is FAIR?
The FAIR Guiding Principles
Findable
Accessible
Interoperable
Reusable https://www.go-fair.org
Interoperable
• Integration with other data, applications
and workflows for analysis, storage and processing
Reusable
• Well-described so they can be replicated
and/or combined in different settings
What is FAIRSpectra?
Community-driven initiative
Focus on hyperspectral imaging techniques
• File formats for hyperspectral imaging
• No standards exist right now
• Software tools to support these
• Metadata requirements
• Education and training
• Raising awareness
What is FAIRSpectra?
https://fairspectra.net https://fairspectra.zulipchat.com https://github.com/FAIRSpectra
Survey from SIMS Europe, UKSAF, and SpringSciX
Positives
• Everyone wanted to see something done, not sure about how
Barriers
• People have difficulty sharing
• Poor documentation
• Proprietary file formats – loss of information
• Raw data vs. processed data – large file size
• Gazumping / IP & prior art / confidentiality
• Time consuming
Feedback from the community
What are the issues?
…for academia
Funders require ‘data’ to be deposited in (open) repositories
But…
• No dedicated repositories
• Metadata terms are patchy
• Instrument data in proprietary file formats
• Many software packages not compatible with open formats
Researchers willing to share, but don’t know how
What are the issues?
…for industry
Barriers
• FAIR often confused with Open
• In-house processes considered good enough
• Worry about certain metadata usage giving secrets away
Benefits
• Easier to share data in-house, between labs and (overseas) sites
• FAIR practises lead to better records retention
• Acquisitions and mergers become more straightforward
• Third-party (open source) software becomes more easily accessible
• Incoming staff already familiar with systems
What are the issues?
…for instrument vendors
Barriers
• Concern about giving away commercial advantage
• Internal effort to support additional export format
Benefits
• No need to be a ‘data science house’
• ‘Outsource‘ multivariate statistics/machine learning/AI to academia
• Cherry-pick externally developed methods into their software
• Software team can concentrate on instrument-specific tasks
• New & exciting software solutions sell the technique in new areas
→ drives instrument sales
Instrument manufacturer buy-in is vital
The problem
with SIMS
Answers on a postcard to…
fairspectra.net
Photo by Kelly Sikkema on Unsplash
SIMS data characteristics
• Huge number of data channels
• Very sparse – almost all data channels are zero counts
• Non-zero values are ‘clumped‘ together → peaks
• Example from Gus
• File (IONTOF in .grd format) is 306 MB
• 256 × 256 pixels × 65 layers × ~1 million channels, would be 15.5 TB
• Only holds locations (4D space: Z, Y, X, Ch) of non-zero positions
• Lossless compression
Solutions to the sparsity problem
Lossless compression
• Compress and decompress returns original data unchanged
• Use off-the-shelf lossless compression (e.g. ZIP)
• Only record non-zero events as coordinates
Lossy compression
• Some data discarded during compression
• Use off-the-shelf lossy compression (e.g. MP3-like, or JPEG-like)
• Down-bin spectra to lower mass resolution
• Use peak detection and only record centroid and area
Problems with lossless
compression
• May need to unpack/unzip data to work on it
• Compression methods not tuned to SIMS data
• Designed for floating point numbers, or
text
• Bespoke encoding requires bespoke software
• Need to reinvent the wheel for each
algorithm
Image by Freeimages.com
Problems with
lossy compression
• Throwing away data
• Possible loss of mass or
spatial resolution
• Not possible to ‘round-trip’ data
What does an ideal scenario look like?
Depends on the context
• Visualisation
• Multivariate statistical analysis (central limit theorem-based)
• Machine learning analysis (AI?)
• Library search
• Data fusion with other techniques
• Use existing software from other techniques
• Fast to read and write
• Yet to be determined…
Accessible by
novice users
in addition to experts
Potential
solutions
Photo by Neil Thomas on Unsplash
imzML
• Developed by MALDI community
• Metadata – largely biological (HUPO)
• Data stored in two flavours
• Processed – discrete peaks/regions
• Who decided the peak positions? Parameters?
• Data locked in
• Continuous – all spectral channels, no compression
• Data becomes unwieldy/impossible to store
• No facility for 3D data
Aside on peak detection
• Many methods available
• Unclear what instrument vendors do
• My method (in ChiToolbox on GitHub)
• Determine total ion spectrum
• Gaussian smooth
• Second derivative
• Determine zero-crossing points to get peak limits
• Determine channel containing centroid of total ion spectrum peak
• Apply these limits to each pixel
• Calculate area of discovered peak
https://github.com/AlexHenderson/ChiToolbox/blob/master/ChiToolbox/%40ChiMSCharacter/peakdetect.m
Issues with peak detection
• Works well for intense peaks
• Noisy peaks become ‘perfect’
• Noise characteristics/statistics lost
• Peaks could have shoulders included
• Moves centroid: OK for visualisation, bad for library search
• Parameters usually not shared
• Original data may no longer be available/shared/readable
• Loss of peak shape
• Detecting on each pixel means centroids not aligned in image data
• Don’t know what we don’t know…
Photo by Louise Tollisen on Unsplash
Methods from other fields
Astronomy – HDF5
Climate research – netCDF, HDF5, Zarr
Microscopy – OME-NGFF, OME-ZARR
• Chunked formats
• Built-in lossless compression
• Plugin compression methods
Alternative
data encoding
Formats like HDF5 and Zarr can
be chunked and compressed
Photo by Markus Spiske on Unsplash
Chunking of a hyper-mango
https://www.ambitiouskitchen.com/how-to-cut-a-mango/
Each mini-cube is separately addressable
https://www.blosc.org/posts/blosc2-ndim-intro/
Only relevant segments are loaded to RAM
→ cloud storage friendly
Spectral range Image
X
Y
Segments are cached and garbage collected
https://commons.wikimedia.org/wiki/File:OLAPcube.png
Different chunk sizes
Smaller chunks gives higher granularity
in selection, but more addressable
segments increases data size
Chunks can be compressed.
Contents of a chunk changes
compressibility.
Difficult to predict overall file size
https://www.bbc.co.uk/bitesize/guides/zjs9dxs/revision/3
Compromise
This Photo by Unknown Author is licensed under CC BY-NC-ND
Peaks vs sticks
• Intermediate encoding
• Pseudo thresholding
• Use peak detection limits to separate ‘interesting’
from ‘uninteresting’ spectral ranges
• Not the same as intensity thresholding
• Peak detection can still be performed on regions
• Compression of entire spectrum more efficient
Photo by Stéphane Fellay on Unsplash
Encapsulate in chunked format
• Hold the full spectral resolution
of the detected peaks
• Discard data between peaks
• Start with chunked data format
• Develop compression plugin for
these formats
• Produces a pseudo-continuous
spectrum with acceptable
compression
Sean Lucas
Summary
The researchers are willing, but their resources are weak
• Few solutions currently exist
• Metadata terms missing
• Proprietary file formats are a barrier
• Instrument vendor buy-in required
• Lack of awareness persists
But…
• Some low-hanging fruit
• Opportunity to make an impact
• Even Closed FAIR can still have benefits to industry
There’s lots to do, but FAIRSpectra is just getting started!
https://fairspectra.net
https://alexhenderson.info

More Related Content

PDF
FAIRSpectra - Towards a common data file format for SIMS images
PDF
Towards a common data file format for hyperspectral images
PDF
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
PPTX
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
PPTX
"Filling the Digital Preservation Gap" with Archivematica
PPTX
Introduction to Big Data/Machine Learning
PDF
John morrissey c3 dis fair working data.pptx
PPT
Elag workshop sessie 1 en 2 v10
FAIRSpectra - Towards a common data file format for SIMS images
Towards a common data file format for hyperspectral images
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
"Filling the Digital Preservation Gap" with Archivematica
Introduction to Big Data/Machine Learning
John morrissey c3 dis fair working data.pptx
Elag workshop sessie 1 en 2 v10

Similar to FAIRSpectra - Towards a common data file format for SIMS images (20)

PPT
Large scale computing
PDF
2010 AIRI Petabyte Challenge - View From The Trenches
PPTX
Spectra Logic
PPTX
Research Data (and Software) Management at Imperial: (Everything you need to ...
PPTX
Service and Support for Science IT -Peter Kunzst, University of Zurich
PPTX
Starfish-A self tuning system for bigdata analytics
PPTX
GTU GeekDay 2019 Limitations of Artificial Intelligence
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
The Future of Semantics on the Web
PPTX
“Filling the digital preservation gap” an update from the Jisc Research Data ...
PPT
data analytics lecture3.ppt
PDF
The Internet-of-things: Architecting for the deluge of data
PDF
The Hadoop Ecosystem for Developers
PDF
Big data and cloud computing 9 sep-2017
PPTX
Presentation on Big Data Analytics
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PDF
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Large scale computing
2010 AIRI Petabyte Challenge - View From The Trenches
Spectra Logic
Research Data (and Software) Management at Imperial: (Everything you need to ...
Service and Support for Science IT -Peter Kunzst, University of Zurich
Starfish-A self tuning system for bigdata analytics
GTU GeekDay 2019 Limitations of Artificial Intelligence
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
The Future of Semantics on the Web
“Filling the digital preservation gap” an update from the Jisc Research Data ...
data analytics lecture3.ppt
The Internet-of-things: Architecting for the deluge of data
The Hadoop Ecosystem for Developers
Big data and cloud computing 9 sep-2017
Presentation on Big Data Analytics
Hadoop for Bioinformatics: Building a Scalable Variant Store
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Ad

More from Alex Henderson (14)

PDF
FAIRSpectra - Enabling the FAIRification of Analytical Science
PDF
FAIRSpectra - Enabling the FAIRification of Analytical Science
PDF
Hyperspectral Data Issues
PPTX
The Class Imbalance Problem: AdaBoost to the Rescue?
PDF
Getting started with chemometric classification
PPTX
Too good to be true? How validate your data
PDF
2020 Vision (Dubious Design Decisions)
PDF
To bag, or to boost? A question of balance
PDF
Digging into Data: Analysis and Visualisation in 3D
PDF
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
PDF
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
PDF
How to validate your model
PDF
Interpretation of Static SIMS Spectra
PDF
Secondary Ion Mass Spectrometry
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
Hyperspectral Data Issues
The Class Imbalance Problem: AdaBoost to the Rescue?
Getting started with chemometric classification
Too good to be true? How validate your data
2020 Vision (Dubious Design Decisions)
To bag, or to boost? A question of balance
Digging into Data: Analysis and Visualisation in 3D
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
How to validate your model
Interpretation of Static SIMS Spectra
Secondary Ion Mass Spectrometry
Ad

Recently uploaded (20)

PDF
The Land of Punt — A research by Dhani Irwanto
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
Microbes in human welfare class 12 .pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
A powerpoint on colorectal cancer with brief background
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPTX
endocrine - management of adrenal incidentaloma.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Science Form five needed shit SCIENEce so
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
gene cloning powerpoint for general biology 2
PPT
veterinary parasitology ````````````.ppt
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPT
Presentation of a Romanian Institutee 2.
PPTX
perinatal infections 2-171220190027.pptx
The Land of Punt — A research by Dhani Irwanto
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Microbes in human welfare class 12 .pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
A powerpoint on colorectal cancer with brief background
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
endocrine - management of adrenal incidentaloma.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Science Form five needed shit SCIENEce so
BODY FLUIDS AND CIRCULATION class 11 .pptx
gene cloning powerpoint for general biology 2
veterinary parasitology ````````````.ppt
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Presentation of a Romanian Institutee 2.
perinatal infections 2-171220190027.pptx

FAIRSpectra - Towards a common data file format for SIMS images

  • 1. FAIRSpectra Towards a common data file format for SIMS images Alex Henderson University of Manchester, UK Office for Open Research https://fairspectra.net https://alexhenderson.info
  • 2. Thanks… • For financial support • University of Manchester’s Office for Open Research • SurfaceSpectra Ltd. • For in-kind support • 101st IUVSTA Workshop (metadata workshop) • UK Surface Analysis Users Forum (UKSAF) (free exhibition space) • SIMS Europe (free exhibition space) • SpringSciX 2024 (free exhibition space) SIMS Europe Office for Open Research
  • 3. What is FAIR? The FAIR Guiding Principles Findable Accessible Interoperable Reusable https://www.go-fair.org Interoperable • Integration with other data, applications and workflows for analysis, storage and processing Reusable • Well-described so they can be replicated and/or combined in different settings
  • 4. What is FAIRSpectra? Community-driven initiative Focus on hyperspectral imaging techniques • File formats for hyperspectral imaging • No standards exist right now • Software tools to support these • Metadata requirements • Education and training • Raising awareness
  • 5. What is FAIRSpectra? https://fairspectra.net https://fairspectra.zulipchat.com https://github.com/FAIRSpectra
  • 6. Survey from SIMS Europe, UKSAF, and SpringSciX Positives • Everyone wanted to see something done, not sure about how Barriers • People have difficulty sharing • Poor documentation • Proprietary file formats – loss of information • Raw data vs. processed data – large file size • Gazumping / IP & prior art / confidentiality • Time consuming Feedback from the community
  • 7. What are the issues? …for academia Funders require ‘data’ to be deposited in (open) repositories But… • No dedicated repositories • Metadata terms are patchy • Instrument data in proprietary file formats • Many software packages not compatible with open formats Researchers willing to share, but don’t know how
  • 8. What are the issues? …for industry Barriers • FAIR often confused with Open • In-house processes considered good enough • Worry about certain metadata usage giving secrets away Benefits • Easier to share data in-house, between labs and (overseas) sites • FAIR practises lead to better records retention • Acquisitions and mergers become more straightforward • Third-party (open source) software becomes more easily accessible • Incoming staff already familiar with systems
  • 9. What are the issues? …for instrument vendors Barriers • Concern about giving away commercial advantage • Internal effort to support additional export format Benefits • No need to be a ‘data science house’ • ‘Outsource‘ multivariate statistics/machine learning/AI to academia • Cherry-pick externally developed methods into their software • Software team can concentrate on instrument-specific tasks • New & exciting software solutions sell the technique in new areas → drives instrument sales Instrument manufacturer buy-in is vital
  • 10. The problem with SIMS Answers on a postcard to… fairspectra.net Photo by Kelly Sikkema on Unsplash
  • 11. SIMS data characteristics • Huge number of data channels • Very sparse – almost all data channels are zero counts • Non-zero values are ‘clumped‘ together → peaks • Example from Gus • File (IONTOF in .grd format) is 306 MB • 256 × 256 pixels × 65 layers × ~1 million channels, would be 15.5 TB • Only holds locations (4D space: Z, Y, X, Ch) of non-zero positions • Lossless compression
  • 12. Solutions to the sparsity problem Lossless compression • Compress and decompress returns original data unchanged • Use off-the-shelf lossless compression (e.g. ZIP) • Only record non-zero events as coordinates Lossy compression • Some data discarded during compression • Use off-the-shelf lossy compression (e.g. MP3-like, or JPEG-like) • Down-bin spectra to lower mass resolution • Use peak detection and only record centroid and area
  • 13. Problems with lossless compression • May need to unpack/unzip data to work on it • Compression methods not tuned to SIMS data • Designed for floating point numbers, or text • Bespoke encoding requires bespoke software • Need to reinvent the wheel for each algorithm Image by Freeimages.com
  • 14. Problems with lossy compression • Throwing away data • Possible loss of mass or spatial resolution • Not possible to ‘round-trip’ data
  • 15. What does an ideal scenario look like? Depends on the context • Visualisation • Multivariate statistical analysis (central limit theorem-based) • Machine learning analysis (AI?) • Library search • Data fusion with other techniques • Use existing software from other techniques • Fast to read and write • Yet to be determined… Accessible by novice users in addition to experts
  • 16. Potential solutions Photo by Neil Thomas on Unsplash
  • 17. imzML • Developed by MALDI community • Metadata – largely biological (HUPO) • Data stored in two flavours • Processed – discrete peaks/regions • Who decided the peak positions? Parameters? • Data locked in • Continuous – all spectral channels, no compression • Data becomes unwieldy/impossible to store • No facility for 3D data
  • 18. Aside on peak detection • Many methods available • Unclear what instrument vendors do • My method (in ChiToolbox on GitHub) • Determine total ion spectrum • Gaussian smooth • Second derivative • Determine zero-crossing points to get peak limits • Determine channel containing centroid of total ion spectrum peak • Apply these limits to each pixel • Calculate area of discovered peak https://github.com/AlexHenderson/ChiToolbox/blob/master/ChiToolbox/%40ChiMSCharacter/peakdetect.m
  • 19. Issues with peak detection • Works well for intense peaks • Noisy peaks become ‘perfect’ • Noise characteristics/statistics lost • Peaks could have shoulders included • Moves centroid: OK for visualisation, bad for library search • Parameters usually not shared • Original data may no longer be available/shared/readable • Loss of peak shape • Detecting on each pixel means centroids not aligned in image data • Don’t know what we don’t know… Photo by Louise Tollisen on Unsplash
  • 20. Methods from other fields Astronomy – HDF5 Climate research – netCDF, HDF5, Zarr Microscopy – OME-NGFF, OME-ZARR • Chunked formats • Built-in lossless compression • Plugin compression methods
  • 21. Alternative data encoding Formats like HDF5 and Zarr can be chunked and compressed Photo by Markus Spiske on Unsplash
  • 22. Chunking of a hyper-mango https://www.ambitiouskitchen.com/how-to-cut-a-mango/
  • 23. Each mini-cube is separately addressable https://www.blosc.org/posts/blosc2-ndim-intro/
  • 24. Only relevant segments are loaded to RAM → cloud storage friendly Spectral range Image X Y Segments are cached and garbage collected https://commons.wikimedia.org/wiki/File:OLAPcube.png
  • 25. Different chunk sizes Smaller chunks gives higher granularity in selection, but more addressable segments increases data size Chunks can be compressed. Contents of a chunk changes compressibility. Difficult to predict overall file size https://www.bbc.co.uk/bitesize/guides/zjs9dxs/revision/3
  • 26. Compromise This Photo by Unknown Author is licensed under CC BY-NC-ND
  • 27. Peaks vs sticks • Intermediate encoding • Pseudo thresholding • Use peak detection limits to separate ‘interesting’ from ‘uninteresting’ spectral ranges • Not the same as intensity thresholding • Peak detection can still be performed on regions • Compression of entire spectrum more efficient Photo by Stéphane Fellay on Unsplash
  • 28. Encapsulate in chunked format • Hold the full spectral resolution of the detected peaks • Discard data between peaks • Start with chunked data format • Develop compression plugin for these formats • Produces a pseudo-continuous spectrum with acceptable compression Sean Lucas
  • 29. Summary The researchers are willing, but their resources are weak • Few solutions currently exist • Metadata terms missing • Proprietary file formats are a barrier • Instrument vendor buy-in required • Lack of awareness persists But… • Some low-hanging fruit • Opportunity to make an impact • Even Closed FAIR can still have benefits to industry There’s lots to do, but FAIRSpectra is just getting started! https://fairspectra.net https://alexhenderson.info