Vision-based multi-modal framework for action recognition

Romaissa, Beddiar Djamila; Mourad, Oussalah; Brahim, Nini

Vision-based multi-modal framework for action recognition

Romaissa, Beddiar Djamila; Mourad, Oussalah; Brahim, Nini (2021-05-06)

Avaa tiedosto

nbnfi-fe2021102552121.pdf (415.7Kt)

nbnfi-fe2021102552121_meta.xml (32.79Kt)

nbnfi-fe2021102552121_solr.xml (35.18Kt)

Lataukset:

URL:

https://doi.org/10.1109/ICPR48806.2021.9412863

Romaissa, Beddiar Djamila

Mourad, Oussalah

Brahim, Nini

IEEE Computer Society

06.05.2021

B. D. Romaissa, O. Mourad and N. Brahim, "Vision-Based Multi-Modal Framework for Action Recognition," 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp. 5859-5866, doi: 10.1109/ICPR48806.2021.9412863

https://rightsstatements.org/vocab/InC/1.0/
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/

doi:https://doi.org/10.1109/ICPR48806.2021.9412863

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2021102552121

Tiivistelmä

Abstract

Human activity recognition plays a central role in the development of intelligent systems for video surveillance, public security, health care and home monitoring, where detection and recognition of activities can improve the quality of life and security of humans. Typically, automated, intuitive and real-time systems are required to recognize human activities and identify accurately unusual behaviors in order to prevent dangerous situations. In this work, we explore the combination of three modalities (RGB, depth and skeleton data) to design a robust multi-modal framework for vision-based human activity recognition. Especially, spatial information, body shape/posture and temporal evolution of actions are highlighted using illustrative representations obtained from a combination of dynamic RGB images, dynamic depth images and skeleton data representations. Therefore, each video is represented with three images that summarize the ongoing action. Our framework takes advantage of transfer learning from pre-trained models to extract significant features from these newly created images. Next, we fuse extracted features using Canonical Correlation Analysis and train a Long Short-Term Memory network to classify actions from visual descriptive images. Experimental results demonstrated the reliability of our feature-fusion framework that allows us to capture highly significant features and enables us to achieve the state-of-the-art performance on the public UTD-MHAD and NTU RGB+D datasets.

Kokoelmat

Avoin saatavuus [31995]