TY - JOUR
T1 - Transformer-based hierarchical dynamic decoders for salient object detection
AU - Zheng, Qingping
AU - Zheng, Ling
AU - Deng, Jiankang
AU - Li, Ying
AU - Shang, Changjing
AU - Shen, Qiang
N1 - Funding Information:
This work is supported by the National Natural Science Foundation of China under Grant ( 62271400 ) and the Key R & D projects of Shaanxi Province, China ( 2023-GHZD-02 ).
Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2023/12/20
Y1 - 2023/12/20
N2 - Global context and global contrast are crucial clues for Salient Object Detection (SOD) in images. Most advanced SOD methods exploit CNN-based architectures, achieving impressive results. However, these methods have intrinsic limitations in capturing long-range global information since a CNN extracts feature in local sliding windows. In contrast, transformers exploit a self-attention mechanism to extract features, gaining a powerful capability of learning global cues. Nonetheless, a pure transformer-based network consumes a large computational overhead and easily suffers from attention collapse, as it goes deeper. To address this issue, in this paper, we propose a Transformer-based Hierarchical Dynamic Decoder (T-HDDNet) for image salient object detection. Specifically, our T-HDDNet employs the transformer to encode each image patch into multi-level and multi-resolution features based on the long-range dependencies among pixels. To obtain an accurate saliency map of high resolution, we develop a dynamic dual upsampling mechanism to enlarge feature spatial size in a data-driven manner, together with a dynamic feature fusion unit. Ultimately, the hierarchical dynamic decoders built on the basis of these two units are used to attain the final saliency progressively. Extensive experimental results show that the proposed method achieves the best performance on all benchmarks, in comparison with state-of-the-art technologies.
AB - Global context and global contrast are crucial clues for Salient Object Detection (SOD) in images. Most advanced SOD methods exploit CNN-based architectures, achieving impressive results. However, these methods have intrinsic limitations in capturing long-range global information since a CNN extracts feature in local sliding windows. In contrast, transformers exploit a self-attention mechanism to extract features, gaining a powerful capability of learning global cues. Nonetheless, a pure transformer-based network consumes a large computational overhead and easily suffers from attention collapse, as it goes deeper. To address this issue, in this paper, we propose a Transformer-based Hierarchical Dynamic Decoder (T-HDDNet) for image salient object detection. Specifically, our T-HDDNet employs the transformer to encode each image patch into multi-level and multi-resolution features based on the long-range dependencies among pixels. To obtain an accurate saliency map of high resolution, we develop a dynamic dual upsampling mechanism to enlarge feature spatial size in a data-driven manner, together with a dynamic feature fusion unit. Ultimately, the hierarchical dynamic decoders built on the basis of these two units are used to attain the final saliency progressively. Extensive experimental results show that the proposed method achieves the best performance on all benchmarks, in comparison with state-of-the-art technologies.
KW - Deep learning
KW - Hierarchical dynamic decoders
KW - Salient object detection
KW - Vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85175426777&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2023.111075
DO - 10.1016/j.knosys.2023.111075
M3 - Article
AN - SCOPUS:85175426777
SN - 0950-7051
VL - 282
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 111075
ER -