Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence

1
Self-Supervised Radio-Visual Learning
Mo Alloulah
alloulah@outlook.com
A new sensing & perception learning paradigm

2
Outline
1. Motivation
a. Cost of labelling
b. Scale
c. Prior work
2. Foundations
3. Scaling data
4. Problem formulation for 6G
5. A new learning algorithm
6. Results
7. Summary

3
Motivation
Radio signals cannot be interpreted by inspection
Training radio sensing systems requires vision groundtruth
Typically, a vision pipeline generates labels for training radio, but with human-in-the-loop annotations
very expensive $$$ and slow manual model production

4
Motivation
Madani et al. "Radatron: Accurate detection using multi-resolution cascaded
MIMO radar." ECCV '20.

5
Motivation
Guan et al. "Through fog high-resolution imaging using millimeter wave radar."
CVPR ‘20.
Zhao et al. "Through-wall human pose estimation using radio signals."
CVPR ‘18.
Li et al. "Making the invisible visible: Action recognition
through walls and occlusions." CVPR ‘19.

6
Motivation
Guan et al. "Through fog high-resolution imaging using millimeter wave radar."
CVPR ‘20.
Zhao et al. "Through-wall human pose estimation using radio signals."
CVPR ‘18.
Li et al. "Making the invisible visible: Action recognition
through walls and occlusions." CVPR ‘19.
Predominantly MIT’s Prof. Dina Katabi and students

7
Cost of labelling
https://aws.amazon.com/sagemaker/data-labeling/pricing/

8
Cost of labelling

9
Cost of labelling

10
Say 100k sample dataset for a modest production model
100k x (1.5 + 0.04) ~= $154k
⇒
Cost of labelling

11
Scale
Handcrafting target detection and
tracking algorithms is likely to become
prohibitively complex.

12
Scale

13
Scale
2k virtual array!

14
Scale
2k virtual array!
Cat

15
Prior work
Radio classification without labels
Use paired radio-visual data to automatically learn an object classifier in radio
No human-in-the-loop manual labelling
Simply learn from information commonly present in vision & radio
M Alloulah et al. Self-Supervised Radio-Visual Representation
Learning for 6G Sensing. IEEE International Conference on
Communications (ICC). 2022.

16
Prior work
Radio classification without labels
Use paired radio-visual data to automatically learn an object classifier in radio
No human-in-the-loop manual labelling
Simply learn from information commonly present in vision & radio
M Alloulah et al. Self-Supervised Radio-Visual Representation
Learning for 6G Sensing. IEEE International Conference on
Communications (ICC). 2022.
Compelling, but can we do a better (fine-grained) job?

17
Look, Radiate, and Learn
Idea in a nutshell
Simply ingest synchronised radio and vision data to do machine learning
• no explicit labels from vision
• tap into lower-level mutual information
• use a spatial backbone neural network in order to do fine-grained
encoding of environment

19
Provable guarantees for multi-modal learning
[1] Huang et al. What makes multi-modal learning better than single (provably). NeurIPS 2021.
Mapping information onto a latent space is provably better
done with multi-modal learning than uni-modal [1].

20
, , and are images in the latent space

21
The latent representation learnt from
modalities is closer to the true latent than
learnt from modalities where

22
i.e., is better quality than

23
i.e., is better quality than

24
Mutual information
Variational bound [2]
Encoder 1 Encoder 2
[2] van den Oord et al. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
Maximise agreement
radio
vision

25
Mutual information
Encoder 1 Encoder 2
Maximise agreement
radio
vision
Derived from a discrete
Boltzmann/Gibbs
distribution
where is an energy function

26
Mutual information
Encoder 1 Encoder 2
Maximise agreement
Minimising loss maximises MI between radio & vision, where
is a sampling parameter related to hardware memory
radio
vision

27
Mutual information
Encoder 1 Encoder 2
Maximise agreement
Minimising loss maximises MI between radio & vision, where
is a sampling parameter related to hardware memory
radio
vision
i.e., we can empirically train a neural
net to maximise MI by Monte Carlo
sampling on a dataset

28
Calibrating mutual information
[3] Tian et al. What makes for good views for contrastive learning? NeurIPS 2020.
Optimal amount of MI needed for training is a
function of downstream applications [3].
• “minimal sufficient” encoding
missing info
excess info
# of bits
captured info

29
Calibrating mutual information
Optimal amount of MI needed for training is a
function of downstream applications [3].
• “minimal sufficient” encoding
missing info
excess info
# of bits
captured info
optimal

31
Dataset: MaxRay*
* M Arnold et al. MaxRay: A raytracing-based integrated sensing and communication framework. IEEE JC&S 2022.

32
Dataset: MaxRay*
30,000 datapoints built over 12-day nonstop number crunching
* M Arnold et al. MaxRay: A raytracing-based integrated sensing and communication framework. IEEE JC&S 2022.

34
Cross-Modal Training
Paired Data

35
Paired Data Test Radio Heatmaps

36
Paired Data
Radio-only Inference
Test Radio Heatmaps Target Localisation

38
Paired Data

39
Paired Data
Off-the-shelf segmentation

40
Paired Data
Off-the-shelf segmentation
Vision Masks

41
(1) Contrastive Pre-training

42
2
1
2
…
…
…
2
1
2
…
…
…

43
2
1
2
…
…
…
2
1
2
…
…
…
Spatial “binning” dimensions

44
2
1
2
…
…
…
2
1
2
…
…
…
Feature “coding”
dimension

45

50
Optimisation tricks (bonus)

51
Contrastive learning dimensionality
Images
= 3x640x480

52
2
1
2 …
…
…
2
1
2 …
…
…
Images
= 3x640x480
Encodings
= 128x60x80 = 614,400
614,400 is quite large gradients don’t fit in 8-GPU memory
during backpropagation

53
2
1
2 …
…
…
2
1
2 …
…
…
layer 3
layer 2
layer 1
layer 0
Radio
Encoder
layer 3
layer 2
layer 1
layer 0
Visual
Encoder
Forward pass
Backward pass
Contrastive loss
Images
= 3x640x480
Encodings
= 128x60x80 = 614,400
614,400 is quite large gradients don’t fit in 8-GPU memory
during backpropagation

54
2
1
2 …
…
…
2
1
2 …
…
…
layer 3
layer 2
layer 1
layer 0
Radio
Encoder
layer 3
layer 2
layer 1
layer 0
Visual
Encoder
Forward pass 1
Backward pass
1
Forward pass 2
Backward pass 2
Contrastive loss
[4] Xiong et al. Loco: Local contrastive representation learning. NeurIPS 2020.
Images
= 3x640x480
Encodings
= 128x60x80 = 614,400
614,400 is quite large
Trick to fix: Break backpropagation chain [4] by alternating radio and
vision encoder updates

56
Results
Overall localisation performance

57
Results
MCL masked contrastive learning
:=

58
Results
Our method outperforms all baselines in median localisation
accuracy

59
Results
Our method outperforms all baselines in median localisation
accuracy including genie-aided CFAR by ~2x and across two datasets

60
Further Analysis
Effect of number of training labels on localisation

61
Further Analysis
MCL masked contrastive learning
SCL spatial contrastive learning
:=
:=

62
Further Analysis
Localisation benefits from training on more self-labels [5], which
reaffirms the vast label scalability advantage of our self-supervised
method.
[5] Guan et al. Who said what: Modeling individual labelers improves classification. AAAI 2018.

63
Further Analysis
Effect of radio-visual mutual information on localisation
mask padding

64
mask padding
Further Analysis
Wasserstein distance
:=

65
mask padding
Further Analysis

66
mask padding
Further Analysis
It is important to calibrate radio-visual mutual information during
self-supervised learning for optimal downstream performance [3].

67
mask padding
Further Analysis
excess info
missing info
optimal

68
mask padding
Further Analysis
excess info
missing info
optimal
missing
info
excess
info
# of bits

70
Takeaways
Machine learning-based RF sensing works provided that
• goals are designed appropriately and measured with proper metrics
• radio kit is good
• data is abundant
• data is well curated

71
Takeaways
Machine learning-based RF sensing works provided that
• goals are designed appropriately and measured with proper metrics
• radio kit is good
• data is abundant
• data is well curated
Fine-grained self-supervised radio-visual learning is a powerful learning paradigm
• automatic target localisation
• vast data scalability (you only have to: look, radiate, & learn)
• key technology for building perception models for next-gen high-res radars

Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence

More Related Content

Similar to Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence (20)

Recently uploaded (20)

Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence