Resolution-agnostic Remote Sensing Scene Classification with Implicit Neural Representations

Keyan Chen1,2,3,4
Wenyuan Li1,2,3,4
Jianqi Chen1,2,3,4
Zhengxia Zou✉ 1,4
Zhenwei Shi 1,2,3,4

Beihang University1
Beijing Key Laboratory of Digital Media2
State Key Laboratory of Virtual Reality Technology and Systems3
Shanghai Artificial Intelligence Laboratory4
Code [GitHub]
Paper [PDF]
Cite [BibTeX]


Remote sensing scene classification is an important yet challenging task. In recent years, the excellent feature representation ability of Convolutional Neural Networks (CNNs) has led to substantial improvements in scene classification accuracy. However, handling resolution variations of remote sensing images is still challenging because CNNs are not inherently capable of modeling multi-resolution input images. In this letter, we propose a novel scene classification method with scale and resolution adaptation ability by leveraging the recent advances in Implicit Neural Representations (INRs). Unlike previous CNN-based methods that make predictions based on rasterized image inputs, the proposed method converts the images as continuous functions with INRs optimization and then performs classification within the function space. When the image is represented as a function, the image resolution can be decoupled from the pixel values so that the resolution does not have much impact on the classification performance. Our method also shows great potential for multi-resolution remote sensing scene classification. Using only a simple Multilayer Perceptron (MLP) classifier in the proposed function space, our method achieves classification accuracy comparable to deep CNNs but exhibits better adaptability to image scale and resolution changes.


An overview of the Modulator and Synthesizer in our method. The proposed RASNet decomposes scene classification into two subtasks: 1. Optimize each image as a data point in function space. 2. Create a classifier in the function space. The proposed Modulator and Synthesizer constitute subtask 1, as shown in the above figure. The Modulator converts the unique latent code into the shift of bias of each fully connected (FC) layer in the Synthesizer, a process known as shift modulation.

In subtask 1, the Synthesizer directly maps the coordinates of image pixels to the corresponding pixel's RGB values. In addition to the Modulator and the Synthesizer, we also design a Preceptor to increase the semantic expression capability. Both the pixel-level consistency and the semantic-level consistency are considered in the optimization. Since fitting each sample demands a substantial amount of computation, meta-learning is used to learn a better initialization to accelerate the optimization of the latent code in RASNet. The parameters of Modulator and Synthesizer are shared across data to describe the common structure of images, which not only reduces the dimension of data points in the function space but also mines the differences between data points.

In subtask 2, we aim to categorize the data (latent codes) in the function space. We show that with the help of implicit modulation and meta learning, only a few basic MLP layers are sufficient to get considerable classification performance.

Quantitative Results

R1: Benchmark on GID dataset

We compare the proposed RASNet with other deep learning-based image classification methods, such as ResNet18, VGG16, and INR-based Functa on different test datasets. The following Tab. shows the comparison results except for Functa as a part of ablation study. We have the following observations. 1) When evaluated on Dte28, RASNet has an up-to-par performance, but is inferior to VGG16. This is because we only utilize a tiny MLP classifier and there is still room for accuracy improvement in data space transfer. 2) When the spatial resolution is held constant, Res. = 8, and only the scale of the scene varies in Dte56 and Dte84, the performance of CNN-based methods drops significantly, whereas RASNet still maintains the performance at a high score. This suggests that RASNet can enhance the adaptability of various spatial ranges, i.e., spatial dimension agnostic. 3) When we downsize images from Dte56 and Dte84 to 28 × 28 (the same size as training set Dtr28), and modify the spatial resolutions, we observe that CNN-based classifiers perform better, but still experience a considerable performance decrease. RASNet still maintains performance despite a slight drop, demonstrating that RASNet can expand its adaptability at multiple resolutions, i.e., resolution dimension agnostic. In contrast to CNN-based approaches, the modest decline of RASNet may be due to the loss of image details at a low resolution, whereas this can be easily fixed by encoding the input image at a higher resolution. 4) With perceptor loss, RASNet* achieves sota.

R2: Confusion Matrix on GID dataset

The following Fig. reports the confusion matrix, where the entry in the i-th row and j-th column denotes the rate of images from the i-th class classified as the j-th class.