TY - JOUR
T1 - Dual Attention with the Self-Attention Alignment for Efficient Video Super-resolution
AU - Chu, Yuezhong
AU - Qiao, Yunan
AU - Liu, Heng
AU - Han, Jungong
N1 - Funding Information:
This work was funded in part by the National Natural Science Foundation of China under (Grant No. 61971004), the Natural Science Foundation of Anhui Province (Grant No. 2008085MF190), and the Key Project of Natural Science of Anhui Provincial Department of Education (Grant No. KJ2019A0083).
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2022/5/1
Y1 - 2022/5/1
N2 - By selectively enhancing the features extracted from convolution networks, the attention mechanism has shown its effectiveness for low-level visual tasks, especially for image super-resolution (SR). However, due to the spatiotemporal continuity of video sequences, simply applying image attention to a video does not seem to obtain good SR results. At present, there is still a lack of suitable attention structure to achieve efficient video SR. In this work, building upon the dual attention, i.e., position attention and channel attention, we proposed deep dual attention, underpinned by self-attention alignment (DASAA), for video SR. Specifically, we start by constructing a dual attention module (DAM) to strengthen the acquired spatiotemporal features and adopt a self-attention structure with the morphological mask to achieve attention alignment. Then, on top of the attention features, we utilize the up-sampling operation to reconstruct the super-resolved video images and introduce the LSTM (long short-time memory) network to guarantee the coherent consistency of the generated video frames both temporally and spatially. Experimental results and comparisons on the actual Youku-VESR dataset and the typical benchmark dataset-Vimeo-90 k demonstrate that our proposed approach achieves the best video SR effect while taking the least amount of computation. Specifically, in the Youku-VESR dataset, our proposed approach achieves a test PSNR of 35.290db and a SSIM of 0.939, respectively. In the Vimeo-90 k dataset, the PSNR/SSIM indexes of our approach are 32.878db and 0.774. Moreover, the FLOPS (float-point operations per second) of our approach is as low as 6.39G. The proposed DASAA method surpasses all video SR algorithms in the comparison. It is also revealed that there is no linear relationship between positional attention and channel attention. It suggests that our DASAA with LSTM coherent consistency architecture may have great potential for many low-level vision video applications.
AB - By selectively enhancing the features extracted from convolution networks, the attention mechanism has shown its effectiveness for low-level visual tasks, especially for image super-resolution (SR). However, due to the spatiotemporal continuity of video sequences, simply applying image attention to a video does not seem to obtain good SR results. At present, there is still a lack of suitable attention structure to achieve efficient video SR. In this work, building upon the dual attention, i.e., position attention and channel attention, we proposed deep dual attention, underpinned by self-attention alignment (DASAA), for video SR. Specifically, we start by constructing a dual attention module (DAM) to strengthen the acquired spatiotemporal features and adopt a self-attention structure with the morphological mask to achieve attention alignment. Then, on top of the attention features, we utilize the up-sampling operation to reconstruct the super-resolved video images and introduce the LSTM (long short-time memory) network to guarantee the coherent consistency of the generated video frames both temporally and spatially. Experimental results and comparisons on the actual Youku-VESR dataset and the typical benchmark dataset-Vimeo-90 k demonstrate that our proposed approach achieves the best video SR effect while taking the least amount of computation. Specifically, in the Youku-VESR dataset, our proposed approach achieves a test PSNR of 35.290db and a SSIM of 0.939, respectively. In the Vimeo-90 k dataset, the PSNR/SSIM indexes of our approach are 32.878db and 0.774. Moreover, the FLOPS (float-point operations per second) of our approach is as low as 6.39G. The proposed DASAA method surpasses all video SR algorithms in the comparison. It is also revealed that there is no linear relationship between positional attention and channel attention. It suggests that our DASAA with LSTM coherent consistency architecture may have great potential for many low-level vision video applications.
KW - Dual attention
KW - FLOPS
KW - Self-attention alignment
KW - Video super-resolution
UR - http://www.scopus.com/inward/record.url?scp=85106001358&partnerID=8YFLogxK
U2 - 10.1007/s12559-021-09874-1
DO - 10.1007/s12559-021-09874-1
M3 - Article
AN - SCOPUS:85106001358
SN - 1866-9956
VL - 14
SP - 1140
EP - 1151
JO - Cognitive Computation
JF - Cognitive Computation
IS - 3
ER -