TY - JOUR
T1 - Regional Attention Network (RAN) for Head Pose and Fine-Grained Gesture Recognition
AU - Behera, Ardhendu
AU - Wharton, Zachary
AU - Liu, Yonghuai
AU - Ghahremani, Morteza
AU - Kumar, Swagat
AU - Bessis, Nik
N1 - Funding Information:
ACKNOWLEDGMENTS The authors would like to express special thanks to Andrew Gidney, who, although no longer with us, contributed to the original idea published as a conference paper [50]. This work was supported by Research Investment Fund (RIF) at the Edge Hill University (EHU) and the UKIERI-DST Grant CHARM (DST UKIERI-2018-19-10). They would also like to thank Daniel Robinson and Keiron Quinn for providing annotation to VGGFace2 and MultiLab datasets. The GPU used in this research is generously donated by the NVIDIA Corporation.
Publisher Copyright:
© 2010-2012 IEEE.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Affect is often expressed via non-verbal body language such as actions/gestures, which are vital indicators for human behaviors. Recent studies on recognition of fine-grained actions/gestures in monocular images have mainly focused on modeling spatial configuration of body parts representing body pose, human-objects interactions and variations in local appearance. The results show that this is a brittle approach since it relies on accurate body parts/objects detection. In this work, we argue that there exist local discriminative semantic regions, whose 'informativeness' can be evaluated by the attention mechanism for inferring fine-grained gestures/actions. To this end, we propose a novel end-to-end regional attention network (RAN), which is a fully convolutional neural network (CNN) to combine multiple contextual regions through attention mechanism, focusing on parts of the images that are most relevant to a given task. Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor. The model is extensively evaluated on ten datasets belonging to 3 different scenarios: 1) head pose recognition, 2) drivers state recognition, and 3) human action and facial expression recognition. The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.
AB - Affect is often expressed via non-verbal body language such as actions/gestures, which are vital indicators for human behaviors. Recent studies on recognition of fine-grained actions/gestures in monocular images have mainly focused on modeling spatial configuration of body parts representing body pose, human-objects interactions and variations in local appearance. The results show that this is a brittle approach since it relies on accurate body parts/objects detection. In this work, we argue that there exist local discriminative semantic regions, whose 'informativeness' can be evaluated by the attention mechanism for inferring fine-grained gestures/actions. To this end, we propose a novel end-to-end regional attention network (RAN), which is a fully convolutional neural network (CNN) to combine multiple contextual regions through attention mechanism, focusing on parts of the images that are most relevant to a given task. Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor. The model is extensively evaluated on ten datasets belonging to 3 different scenarios: 1) head pose recognition, 2) drivers state recognition, and 3) human action and facial expression recognition. The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.
KW - Annotations
KW - Attention Mechanism
KW - Computer Vision
KW - Convolutional Neural Network
KW - Face recognition
KW - Facial Expressions Recognition
KW - Fine-grained Gesture Recognition
KW - Gesture Recognition
KW - Head
KW - Head Pose Recognition
KW - Human-Object Interaction
KW - Image recognition
KW - Magnetic heads
KW - Pose estimation
KW - Regional Attention Network
KW - Task analysis
KW - head pose recognition
KW - regional attention network
KW - human-object interaction
KW - attention mechanism
KW - convolutional neural network
KW - Gesture recognition
KW - facial expressions recognition
KW - computer vision
KW - fine-grained gesture recognition
UR - http://www.scopus.com/inward/record.url?scp=85092905226&partnerID=8YFLogxK
U2 - 10.1109/TAFFC.2020.3031841
DO - 10.1109/TAFFC.2020.3031841
M3 - Article
AN - SCOPUS:85092905226
SN - 1949-3045
VL - 14
SP - 549
EP - 562
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
IS - 1
ER -