Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition

Yu, Zitong; Zhou, Benjia; Wan, Jun; Wang, Pichao; Liu, Xin; Li, Stan Z.; Zhao, Guoying

Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition

Yu, Zitong; Zhou, Benjia; Wan, Jun; Wang, Pichao; Liu, Xin; Li, Stan Z.; Zhao, Guoying (2021-06-14)

Avaa tiedosto

nbnfi-fe2021090144890.pdf (3.705Mt)

nbnfi-fe2021090144890_meta.xml (42.67Kt)

nbnfi-fe2021090144890_solr.xml (37.41Kt)

Lataukset:

URL:

https://doi.org/10.1109/TIP.2021.3087348

Yu, Zitong

Zhou, Benjia

Wan, Jun

Wang, Pichao

Liu, Xin

Li, Stan Z.

Zhao, Guoying

Institute of Electrical and Electronics Engineers

14.06.2021

Z. Yu et al., "Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition," in IEEE Transactions on Image Processing, vol. 30, pp. 5626-5640, 2021, doi: 10.1109/TIP.2021.3087348

https://creativecommons.org/licenses/by/4.0/
© 2021 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
https://creativecommons.org/licenses/by/4.0/

doi:https://doi.org/10.1109/TIP.2021.3087348

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2021090144890

Tiivistelmä

Abstract

Gesture recognition has attracted considerable attention owing to its great potential in applications. Although the great progress has been made recently in multi-modal learning methods, existing methods still lack effective integration to fully explore synergies among spatio-temporal modalities effectively for gesture recognition. The problems are partially due to the fact that the existing manually designed network architectures have low efficiency in the joint learning of multi-modalities. In this paper, we propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the proposed 3D Central Difference Convolution (3D-CDC) family, which is able to capture rich temporal context via aggregating temporal difference information; and 2) optimized backbones for multi-sampling-rate branches and lateral connections among varied modalities. The resultant multi-modal multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics. Comprehensive experiments are performed on three benchmark datasets (IsoGD, NvGesture, and EgoGesture), demonstrating the state-of-the-art performance in both single- and multi-modality settings. The code is available at https://github.com/ZitongYu/3DCDC-NAS.

Kokoelmat

Avoin saatavuus [32049]

Ellei muuten mainita, aineiston lisenssi on https://creativecommons.org/licenses/by/4.0/