This document proposes a bilayer graph-based learning framework for multimodal image classification using limited labeled pixels. It constructs a simple graph in the first layer where each vertex is a pixel and edge weights encode pixel similarity. Unsupervised learning estimates grouping relations among pixels. These relations form a hypergraph in the second layer, on which semisupervised learning classifies pixels to address challenges of complex relationships and limited labels in multimodal images. The framework effectively exploits the underlying data structure.