Building Extraction from Remote Sensing Images with Sparse Token Transformers

Keyan Chen1,2,3
Zhengxia Zou1,2,3
Zhenwei Shi ✉ 1,2,3

Beihang University1
Beijing Key Laboratory of Digital Media2
State Key Laboratory of Virtual Reality Technology and Systems3
Code [GitHub]
Paper [PDF]
Cite [BibTeX]


The Fig. shows the speed–accuracy trade-off and some comparable results between the proposed method and other state-of-the-art segmentation methods. Throughput (images with 512 × 512 pixels per second on a 2080Ti GPU) versus accuracy (IoU) on WHU aerial building extraction test set. Here, we only calculate the model inference time, not including the time to read images. Our model (SST) outperforms other segmentation methods with a clear margin. For STT, Base (S4), and Base (S5), points on the line from the left to the right refers to the models with different CNN feature extractor of ResNet50, VGG16 and ResNet18, respectively.


Deep learning methods have achieved considerable progress in remote sensing image building extraction. Most building extraction methods are based on Convolutional Neural Networks (CNN). Recently, vision transformers have provided a better perspective for modeling long-range context in images, but usually suffer from high computational complexity and memory usage. In this paper, we explored the potential of using transformers for efficient building extraction. We design an efficient dual-pathway transformer structure that learns the long-term dependency of tokens in both their spatial and channel dimensions and achieves state-of-the-art accuracy on benchmark building extraction datasets. Since single buildings in remote sensing images usually only occupy a very small part of the image pixels, we represent buildings as a set of "sparse" feature vectors in their feature space by introducing a new module called "sparse token sampler". With such a design, the computational complexity in transformers can be greatly reduced over an order of magnitude. We refer to our method as Sparse Token Transformers (STT). Experiments conducted on the Wuhan University Aerial Building Dataset (WHU) and the Inria Aerial Image Labeling Dataset (INRIA) suggest the effectiveness and efficiency of our method. Compared with some widely used segmentation methods and some state-of-the-art building extraction methods, STT has achieved the best performance with low time cost.


An overview of the proposed method. Our method consists of a CNN feature extractor, a spatial/channel sparse token sampler, a transformer-based encoder/decoder, and a prediction head. In the transformer part, when modeling the long-range dependency between different channels/spatial locations of the input image, instead of using all features vectors, we select the most important ks (ksHW) spatial tokens and kc (kcC) channel tokens. The sparse tokens greatly reduce the computational and memory consumption. Based on the semantic tokens generated by the encoder, a transformer decoder is employed to refine the original features. Finally, in our prediction head, we apply two upsampling layers to produce high-resolution building extraction results.

Quantitative Results

R1: Benchmark on WHU and INRIA building datasets

Comparison with some well-known image labeling methods and state-of-the-art building extraction methods on the WHU and INRIA building datasets. UNet, SegNet, DANet, and DeepLabV3 are commonly used methods for segmentation tasks in CNN framework. SETR is a transformer-based method for segmentation. The methods mentioned in the second row are all based on the CNN framework for specific building extraction tasks. To validate the efficiency, we report number of parameters (Params.), multiply-accumulate operations (MACs) and images with 512 × 512 pixels per second on a 2080Ti GPU (Throughput).


In the following Fig., we show the qualitative results of STT on WHU and INRIA benchmarks. The results of different methods on samples from the WHU (a-e) and INRIA (f-j) building datasets are visualized. The figure is colored differently to facilitate viewing, with white representing true positive pixels, black representing true negative pixels, red representing false positive pixels, and green representing false negative pixels.