Most existing RGB-D salient object detection (SOD) models adopt a two-stream structure to extract the information from the input RGB and depth images. Since they use two subnetworks for unimodal feature extraction and multiple multi-modal feature fusion modules for extracting cross-modal complementary information, these models require a huge number of parameters, thus hindering their real-life applications. To remedy this situation, we propose a novel middle-level feature fusion structure that allows to design a lightweight RGB-D SOD model. Specifically, the proposed structure first employs two shallow subnetworks to extract low- and middle-level unimodal RGB and depth features, respectively. Afterward, instead of integrating middle-level unimodal features multiple times at different layers, we just fuse them once via a specially designed fusion module. On top of that, high-level multi-modal semantic features are further extracted for final salient object detection via an additional subnetwork. This will greatly reduce the network’s parameters. Moreover, to compensate for the performance loss due to parameter deduction, a relation-aware multi-modal feature fusion module is specially designed to effectively capture the cross-modal complementary information during the fusion of middle-level multi-modal features. By enabling the feature-level and decision-level information to interact, we maximize the usage of the fused cross-modal middle-level features and the extracted cross-modal high-level features for saliency prediction. Experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods. Remarkably, our proposed model has only 3.9M parameters and runs at 33 FPS.
|Number of pages||14|
|Journal||IEEE Transactions on Image Processing|
|Early online date||18 Oct 2022|
|Publication status||Published - 01 Nov 2022|
- Lightweight RGB-D salient object detection
- feature-level and decision-level information mutual guidance
- relation-aware multi-modal feature fusion