Temporal Saliency Query Network for Efficient Video Recognition

Boyang Xia, Zhihao Wang, Wenhao Wu, Haoran Wang, Jungong Han

Research output: Chapter in Book/Report/Conference proceedingConference Proceeding (Non-Journal item)

4 Citations (SciVal)

Abstract

Efficient video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices. Most existing methods select the salient frames without awareness of the class-specific saliency scores, which neglect the implicit association between the saliency of frames and its belonging category. To alleviate this issue, we devise a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement. Specifically, we model the class-specific saliency measuring process as a query-response task. For each category, the common pattern of it is employed as a query and the most salient frames are responded to it. Then, the calculated similarities are adopted as the frame saliency scores. To achieve it, we propose a Temporal Saliency Query Network (TSQNet) that includes two instantiations of the TSQ mechanism based on visual appearance similarities and textual event-object relations. Afterward, cross-modality interactions are imposed to promote the information exchange between them. Finally, we use the class-specific saliencies of the most confident categories generated by two modalities to perform the selection of salient frames. Extensive experiments demonstrate the effectiveness of our method by achieving state-of-the-art results on ActivityNet, FCVID and Mini-Kinetics datasets. Our project page is at https://lawrencexia2008.github.io/projects/tsqnet.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, Proceedings
Subtitle of host publication17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Pages741-759
Number of pages19
DOIs
Publication statusPublished - 22 Oct 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13694 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Keywords

  • Temporal sampling
  • Transformer
  • Video recognition

Fingerprint

Dive into the research topics of 'Temporal Saliency Query Network for Efficient Video Recognition'. Together they form a unique fingerprint.

Cite this