This document proposes a new self-supervised learning paradigm for radio-visual perception that does not require human-generated labels. Typically, training radio sensing systems requires vision labels, but generating these labels is expensive. The proposed approach uses synchronized radio and vision data to learn representations without explicit labels. Mutual information is maximized between encoded radio and vision streams to learn cross-modal features. This is done by spatial binning of encodings and contrastive predictive coding. Self-labels can then be generated to allow downstream training for radio-only deployment, addressing the challenges of scale and cost for next-generation radio systems.