Campus violence detection based on artificial intelligent interpretation of surveillance video sequences

Ye, Liang; Liu, Tong; Han, Tian; Ferdinando, Hany; Seppänen, Tapio; Alasaarela, Esko

Campus violence detection based on artificial intelligent interpretation of surveillance video sequences

Ye, Liang; Liu, Tong; Han, Tian; Ferdinando, Hany; Seppänen, Tapio; Alasaarela, Esko (2021-02-09)

Avaa tiedosto

nbnfi-fe202103298633.pdf (3.025Mt)

nbnfi-fe202103298633_meta.xml (42.26Kt)

nbnfi-fe202103298633_solr.xml (40.40Kt)

Lataukset:

URL:

https://doi.org/10.3390/rs13040628

Ye, Liang

Liu, Tong

Han, Tian

Ferdinando, Hany

Seppänen, Tapio

Alasaarela, Esko

Multidisciplinary Digital Publishing Institute

09.02.2021

Ye, L.; Liu, T.; Han, T.; Ferdinando, H.; Seppänen, T.; Alasaarela, E. Campus Violence Detection Based on Artificial Intelligent Interpretation of Surveillance Video Sequences. Remote Sens. 2021, 13, 628. https://doi.org/10.3390/rs13040628

https://creativecommons.org/licenses/by/4.0/
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
https://creativecommons.org/licenses/by/4.0/

doi:https://doi.org/10.3390/rs13040628

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe202103298633

Tiivistelmä

Abstract

Campus violence is a common social phenomenon all over the world, and is the most harmful type of school bullying events. As artificial intelligence and remote sensing techniques develop, there are several possible methods to detect campus violence, e.g., movement sensor-based methods and video sequence-based methods. Sensors and surveillance cameras are used to detect campus violence. In this paper, the authors use image features and acoustic features for campus violence detection. Campus violence data are gathered by role-playing, and 4096-dimension feature vectors are extracted from every 16 frames of video images. The C3D (Convolutional 3D) neural network is used for feature extraction and classification, and an average recognition accuracy of 92.00% is achieved. Mel-frequency cepstral coefficients (MFCCs) are extracted as acoustic features, and three speech emotion databases are involved. The C3D neural network is used for classification, and the average recognition accuracies are 88.33%, 95.00%, and 91.67%, respectively. To solve the problem of evidence conflict, the authors propose an improved Dempster–Shafer (D–S) algorithm. Compared with existing D–S theory, the improved algorithm increases the recognition accuracy by 10.79%, and the recognition accuracy can ultimately reach 97.00%.

Kokoelmat

Avoin saatavuus [32005]

Ellei muuten mainita, aineiston lisenssi on https://creativecommons.org/licenses/by/4.0/