SlideShare a Scribd company logo
Generating Training Data from Noisy
Measurements
HAMED ALEMOHAMMAD
LEAD GEOSPATIAL DATA SCIENTIST
ML Hub Earth
 Machine Learning commons for EO
 Training data
 Models
 Standards and best practices
Global Land Cover Training Dataset
 Human-verified training dataset
 Using open-source Sentinel-2 imagery
 10 m spatial resolution.
 Global and geo-diverse
Workflow
S2 L2A
Reflectance
S2 L2A
Classification
GlobeLand30
Labels (2010)
Filtered Labels
Class
Predictions
Class
Verification
(Human)
Model
Training
Data
 Input Data:
 10 Sentinel-2 bands: Red, Green, Blue, Red-Edge1-3, NIR, Narrow NIR, SWIR1-2
 20 m bands scaled to 10m using bi-cubic interpolation
 Reference/Label Data:
 GlobeLand30 labels for 2010 used as a source
 Classes mapped to REF Land Cover Taxonomy
 Labels re-gridded to Sentinel-2 grid using nearest neighbor
 Labels filtered by agreement with classes from Sentinel-2’s 20m scene classification
(produced as part of atmospheric correction)
 Filtered labels used as reference labels for training
Generating Training Data from Noisy Measrements
Methodology
 A pixel-based supervised Random Forests model trained for each scene.
 Pixels without valid reflectance are excluded from training.
 Training on class-stratified samples of half the pixels in a scene with one
Sentinel-2 pixel at 10 m for each label pixel at 30 m.
 Predictions are made on all pixels marked with usable classes during Level-2A
processing, including pixels labeled as unclassified.
 Annual labels will be generated by aggregating time series of predictions and
probabilities from the same tile throughout the year.
Results
 88.75% average model accuracy across 4 diverse scenes.
 Some classes, like water and snow/ice, predicted with high accuracy and high
confidence across all scenes.
 Other classes, like wetland and (semi) natural vegetation, are subtler and were
expected to be more difficult to classify.
 Woody vegetation and cultivated vegetation were predicted relatively
accurately and not confused with each other, as a result of including 20 m red
edge bands, resampled to 10 m.
 Artificial bare ground tended to be predicted in unclassified regions (in
reference data), taking over areas of natural bare ground and cultivated
vegetation and suggesting that traces of human activity would lead to pixels
classified as artificial bare ground in off-vegetation season.
Results
Generating Training Data from Noisy Measrements
What about non-categorical variables?
 True value of categorical variables vs true value of continuous variables:
 Crop Yield
 Soil Moisture
 Temperature
 Precipitation
 All measurements of continuous variables are prone to uncertainty (noise and
bias).
 How to reduce/eliminate these uncertainties in training data?
In-SituModel Satellite
Truth
Noisy and biased measurement systems
slide courtesy of K. McColl
Generating Training Dataset
 Triple collocation (TC) is a technique for estimating the unknown error standard
deviations (or RMSEs) of three mutually independent measurement systems,
without treating any one system as zero-error “truth”.
𝑄𝑖𝑗 ≡ 𝐶𝑜𝑣 𝑋𝑖, 𝑋𝑗 𝜎𝜀𝑖
= 𝑄𝑖𝑖 −
𝑄 𝑖𝑗 𝑄𝑖𝑘
𝑄 𝑗𝑘
 TC-based RMSE estimates at each pixel are used to compute a priori probability
(𝑃𝑖) of selecting a particular dataset:
𝑃𝑖 =
1
𝜎𝜀𝑖
2
𝑖=1
3 1
𝜎𝜀𝑖
2
Sample time series of a pixel
𝑋1 𝑋2 𝑋3
𝑡1
𝑡2
𝑡3
𝑡 𝑁
𝑋 𝑇
Generating Training Data from Noisy Measrements
Generating Training Data from Noisy Measrements
Backup Slides
Alemohammad, et al., Biogeosciences, 2017
Alemohammad, et al., Biogeosciences, 2017
Things to check
 Sentinel-2 L2A classes
 What are the usable classes there?
 Plot actual scene + artificial bare ground

More Related Content

PPT
igarss11_2.ppt
DOCX
GIS work sample
PDF
Plot-Segmentation-Poster
PDF
CFD simulation as a tool for evaluation and optimization of uv reactor decont...
PPT
Andy J Humane Near Real Time Monitoring Of Deforestation Using A Neural Aug...
PPT
Andy J Humane Near Real Time Monitoring Of Deforestation Using A Neural Aug...
PPT
Andy Jarvis and Louis Reymondin - PARASID Near Real Time Monitoring Of Defo...
PPTX
Operational Data Fusion Framework for Building Frequent Land sat-Like Imagery
igarss11_2.ppt
GIS work sample
Plot-Segmentation-Poster
CFD simulation as a tool for evaluation and optimization of uv reactor decont...
Andy J Humane Near Real Time Monitoring Of Deforestation Using A Neural Aug...
Andy J Humane Near Real Time Monitoring Of Deforestation Using A Neural Aug...
Andy Jarvis and Louis Reymondin - PARASID Near Real Time Monitoring Of Defo...
Operational Data Fusion Framework for Building Frequent Land sat-Like Imagery

What's hot (19)

PDF
Investigation of Chaotic-Type Features in Hyperspectral Satellite Data
PPS
Fragmentation revisited 050902
DOC
REMOTE SENSING
PPT
Retraining maximum likelihood classifiers using low-rank model.ppt
PPTX
Распознавание облаков и теней на спутниковых изображениях с использованием гл...
PPT
Hsc 340 10 14
PDF
Maciej soja l3_poster
PPTX
Raster data analysis
PDF
10008-16.antoine_lefebvre2
PDF
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
DOCX
Robust registration of cloudy satellite images using two step segmentation
PPT
Irrera gold2010
PPTX
Digital Elevation Model (DEM)
PDF
Remote sensing e course (Geohydrology)
PPT
Pulvirenti_IGARSS2011.ppt
PDF
Af33174179
PDF
Poster: MMSP 2008
PDF
Separability Analysis of Integrated Spaceborne Radar and Optical Data: Sudan ...
PDF
geographic information system pdf
Investigation of Chaotic-Type Features in Hyperspectral Satellite Data
Fragmentation revisited 050902
REMOTE SENSING
Retraining maximum likelihood classifiers using low-rank model.ppt
Распознавание облаков и теней на спутниковых изображениях с использованием гл...
Hsc 340 10 14
Maciej soja l3_poster
Raster data analysis
10008-16.antoine_lefebvre2
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
Robust registration of cloudy satellite images using two step segmentation
Irrera gold2010
Digital Elevation Model (DEM)
Remote sensing e course (Geohydrology)
Pulvirenti_IGARSS2011.ppt
Af33174179
Poster: MMSP 2008
Separability Analysis of Integrated Spaceborne Radar and Optical Data: Sudan ...
geographic information system pdf
Ad

Similar to Generating Training Data from Noisy Measrements (20)

PPTX
Recent Advances in Crop Classification
PPT
Remote Sensing Lec 11
PPTX
groundtruth collection for remotesensing support
PDF
DSD-INT 2019 Forecasting rainfall-induced landslides in the face of climate c...
PDF
Computer model for detecting tsunami wave hazard on built-up land using machi...
PDF
CLIM: Transition Workshop - Statistical Approaches for Un-Mixing Problem and ...
PPTX
Commonly used ground truth equipments
PDF
IRJET- Tool: Segregration of Bands in Sentinel Data and Calculation of NDVI
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PDF
4 2-13-397
PPT
IGARSS 2011.ppt
PPT
IGARSS 2011.ppt
PPT
ARCHAEOLOGICAL LAND USE CHARACTERIZATION USING MULTISPECTRAL REMOTE SENSING DATA
PPT
IGARSS 2011.ppt
PPT
IGARSS 2011.ppt
PPT
IGARSS 2011.ppt
PPT
ARCHAEOLOGICAL LAND USE CHARACTERIZATION USING MULTISPECTRAL REMOTE SENSING DATA
PPT
IGARSS 2011.ppt
PPT
IGARSS 2011 Arch.ppt
PPT
IGARSS 2011.ppt
Recent Advances in Crop Classification
Remote Sensing Lec 11
groundtruth collection for remotesensing support
DSD-INT 2019 Forecasting rainfall-induced landslides in the face of climate c...
Computer model for detecting tsunami wave hazard on built-up land using machi...
CLIM: Transition Workshop - Statistical Approaches for Un-Mixing Problem and ...
Commonly used ground truth equipments
IRJET- Tool: Segregration of Bands in Sentinel Data and Calculation of NDVI
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
4 2-13-397
IGARSS 2011.ppt
IGARSS 2011.ppt
ARCHAEOLOGICAL LAND USE CHARACTERIZATION USING MULTISPECTRAL REMOTE SENSING DATA
IGARSS 2011.ppt
IGARSS 2011.ppt
IGARSS 2011.ppt
ARCHAEOLOGICAL LAND USE CHARACTERIZATION USING MULTISPECTRAL REMOTE SENSING DATA
IGARSS 2011.ppt
IGARSS 2011 Arch.ppt
IGARSS 2011.ppt
Ad

More from Louisa Diggs (20)

PDF
Workshop: Quantifying Error in Training Data for Mapping and Monitoring the E...
PDF
Using Active Learning to Quantify how Training Data Errors Impact Classificat...
PDF
Machine Learning for Better Maps
PPTX
Cropped Field Boundaries, Food Systems, & Fire
PDF
Challenges to Large Scale Mapping: Can Data Geometry Help?
PPTX
A Random Walk of Issues Related to Training Data and Land Cover Mapping
PPTX
Assessing Land Cover Change using Uncertain Data
PPTX
Informal Settlements and Cadastral Mapping
PPTX
Sources of Map Error in Public Health Activities and Operations Research
PDF
Measuring the impact of label noise on semantic segmentation using rastervision
PPTX
Mapping Smallholder Yields Using Micro-Satellite Data
PPTX
Crowdsourcing Land Cover and Land Use Data: Experiences from IIASA
PDF
IMED 2018: The use of remote sensing, geostatistical and machine learning met...
PPT
IMED 2018: Predicting the environmental suitability of podoconiosis in Ethiopia
PDF
IMED 2018: Landcover/habitat
PDF
IMED 2018: Modeled Population Estimates from Satellite Imagery and Microcensu...
PDF
IMED 2018: An intro to Remote Sensing and Machine Learning
PDF
IMED 2018: Mapping Monkeypox risk in the Congo Basin using Remote Sensing and...
PDF
IMED 2018: Predicting spatiotemporal risk of yellow fever using a machine lea...
PDF
IMED 2018: Innovations and Challenges in the Use of Open-source Remote Sensin...
Workshop: Quantifying Error in Training Data for Mapping and Monitoring the E...
Using Active Learning to Quantify how Training Data Errors Impact Classificat...
Machine Learning for Better Maps
Cropped Field Boundaries, Food Systems, & Fire
Challenges to Large Scale Mapping: Can Data Geometry Help?
A Random Walk of Issues Related to Training Data and Land Cover Mapping
Assessing Land Cover Change using Uncertain Data
Informal Settlements and Cadastral Mapping
Sources of Map Error in Public Health Activities and Operations Research
Measuring the impact of label noise on semantic segmentation using rastervision
Mapping Smallholder Yields Using Micro-Satellite Data
Crowdsourcing Land Cover and Land Use Data: Experiences from IIASA
IMED 2018: The use of remote sensing, geostatistical and machine learning met...
IMED 2018: Predicting the environmental suitability of podoconiosis in Ethiopia
IMED 2018: Landcover/habitat
IMED 2018: Modeled Population Estimates from Satellite Imagery and Microcensu...
IMED 2018: An intro to Remote Sensing and Machine Learning
IMED 2018: Mapping Monkeypox risk in the Congo Basin using Remote Sensing and...
IMED 2018: Predicting spatiotemporal risk of yellow fever using a machine lea...
IMED 2018: Innovations and Challenges in the Use of Open-source Remote Sensin...

Recently uploaded (20)

PDF
Advanced Soft Computing BINUS July 2025.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Advanced IT Governance
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Advanced Soft Computing BINUS July 2025.pdf
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
GamePlan Trading System Review: Professional Trader's Honest Take
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Advanced IT Governance
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I

Generating Training Data from Noisy Measrements

  • 1. Generating Training Data from Noisy Measurements HAMED ALEMOHAMMAD LEAD GEOSPATIAL DATA SCIENTIST
  • 2. ML Hub Earth  Machine Learning commons for EO  Training data  Models  Standards and best practices
  • 3. Global Land Cover Training Dataset  Human-verified training dataset  Using open-source Sentinel-2 imagery  10 m spatial resolution.  Global and geo-diverse
  • 4. Workflow S2 L2A Reflectance S2 L2A Classification GlobeLand30 Labels (2010) Filtered Labels Class Predictions Class Verification (Human) Model Training
  • 5. Data  Input Data:  10 Sentinel-2 bands: Red, Green, Blue, Red-Edge1-3, NIR, Narrow NIR, SWIR1-2  20 m bands scaled to 10m using bi-cubic interpolation  Reference/Label Data:  GlobeLand30 labels for 2010 used as a source  Classes mapped to REF Land Cover Taxonomy  Labels re-gridded to Sentinel-2 grid using nearest neighbor  Labels filtered by agreement with classes from Sentinel-2’s 20m scene classification (produced as part of atmospheric correction)  Filtered labels used as reference labels for training
  • 7. Methodology  A pixel-based supervised Random Forests model trained for each scene.  Pixels without valid reflectance are excluded from training.  Training on class-stratified samples of half the pixels in a scene with one Sentinel-2 pixel at 10 m for each label pixel at 30 m.  Predictions are made on all pixels marked with usable classes during Level-2A processing, including pixels labeled as unclassified.  Annual labels will be generated by aggregating time series of predictions and probabilities from the same tile throughout the year.
  • 8. Results  88.75% average model accuracy across 4 diverse scenes.  Some classes, like water and snow/ice, predicted with high accuracy and high confidence across all scenes.  Other classes, like wetland and (semi) natural vegetation, are subtler and were expected to be more difficult to classify.  Woody vegetation and cultivated vegetation were predicted relatively accurately and not confused with each other, as a result of including 20 m red edge bands, resampled to 10 m.  Artificial bare ground tended to be predicted in unclassified regions (in reference data), taking over areas of natural bare ground and cultivated vegetation and suggesting that traces of human activity would lead to pixels classified as artificial bare ground in off-vegetation season.
  • 11. What about non-categorical variables?  True value of categorical variables vs true value of continuous variables:  Crop Yield  Soil Moisture  Temperature  Precipitation  All measurements of continuous variables are prone to uncertainty (noise and bias).  How to reduce/eliminate these uncertainties in training data?
  • 12. In-SituModel Satellite Truth Noisy and biased measurement systems slide courtesy of K. McColl
  • 13. Generating Training Dataset  Triple collocation (TC) is a technique for estimating the unknown error standard deviations (or RMSEs) of three mutually independent measurement systems, without treating any one system as zero-error “truth”. 𝑄𝑖𝑗 ≡ 𝐶𝑜𝑣 𝑋𝑖, 𝑋𝑗 𝜎𝜀𝑖 = 𝑄𝑖𝑖 − 𝑄 𝑖𝑗 𝑄𝑖𝑘 𝑄 𝑗𝑘  TC-based RMSE estimates at each pixel are used to compute a priori probability (𝑃𝑖) of selecting a particular dataset: 𝑃𝑖 = 1 𝜎𝜀𝑖 2 𝑖=1 3 1 𝜎𝜀𝑖 2
  • 14. Sample time series of a pixel 𝑋1 𝑋2 𝑋3 𝑡1 𝑡2 𝑡3 𝑡 𝑁 𝑋 𝑇
  • 18. Alemohammad, et al., Biogeosciences, 2017
  • 19. Alemohammad, et al., Biogeosciences, 2017
  • 20. Things to check  Sentinel-2 L2A classes  What are the usable classes there?  Plot actual scene + artificial bare ground