Transformer-based hierarchical dynamic decoders for salient object detection

Qingping Zheng, Ling Zheng, Jiankang Deng, Ying Li*, Changjing Shang, Qiang Shen

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)
12 Downloads (Pure)

Abstract

Global context and global contrast are crucial clues for Salient Object Detection (SOD) in images. Most advanced SOD methods exploit CNN-based architectures, achieving impressive results. However, these methods have intrinsic limitations in capturing long-range global information since a CNN extracts feature in local sliding windows. In contrast, transformers exploit a self-attention mechanism to extract features, gaining a powerful capability of learning global cues. Nonetheless, a pure transformer-based network consumes a large computational overhead and easily suffers from attention collapse, as it goes deeper. To address this issue, in this paper, we propose a Transformer-based Hierarchical Dynamic Decoder (T-HDDNet) for image salient object detection. Specifically, our T-HDDNet employs the transformer to encode each image patch into multi-level and multi-resolution features based on the long-range dependencies among pixels. To obtain an accurate saliency map of high resolution, we develop a dynamic dual upsampling mechanism to enlarge feature spatial size in a data-driven manner, together with a dynamic feature fusion unit. Ultimately, the hierarchical dynamic decoders built on the basis of these two units are used to attain the final saliency progressively. Extensive experimental results show that the proposed method achieves the best performance on all benchmarks, in comparison with state-of-the-art technologies.

Original languageEnglish
Article number111075
Number of pages11
JournalKnowledge-Based Systems
Volume282
Early online date30 Oct 2023
DOIs
Publication statusPublished - 20 Dec 2023

Keywords

  • Deep learning
  • Hierarchical dynamic decoders
  • Salient object detection
  • Vision transformer

Fingerprint

Dive into the research topics of 'Transformer-based hierarchical dynamic decoders for salient object detection'. Together they form a unique fingerprint.

Cite this