Evaluating state-of-the-art vision-language models for video recognition on real world dataset
Khan, Sheraz (2023-06-15)
Khan, Sheraz
S. Khan
15.06.2023
© 2023 Sheraz Khan. Tämä Kohde on tekijänoikeuden ja/tai lähioikeuksien suojaama. Voit käyttää Kohdetta käyttöösi sovellettavan tekijänoikeutta ja lähioikeuksia koskevan lainsäädännön sallimilla tavoilla. Muunlaista käyttöä varten tarvitset oikeudenhaltijoiden luvan.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202306152534
https://urn.fi/URN:NBN:fi:oulu-202306152534
Tiivistelmä
One of the main challenges in Computer Vision is the training of custom models from scratch. This process is highly computer-intensive, time-consuming, and requires vast amount of labeled datasets to achieve reasonable results. Recently, various foundation models trained using self-supervised learning techniques have been proposed, claiming to achieve good results for downstream tasks after fine-tuning. This document aims to discuss the results obtained by three multi-class video recognition methods based on such vision-language foundation models using a dataset that closely corresponds to real-world. The primary objective of this work was to investigate the number of instances required by these models to provide competitive results.
Three models, namely VideoMAE, X-CLIP, and Text4Vis, are chosen for the evaluation in this study. Their performance is assessed using YT8M dataset which include YouTube videos captured in uncontrolled environments, closely resembling real-world settings. Notably, Text4Vis stood out by achieving an impressive weighted F1-score of 0.87 after fine-tuning with just 1142 videos. The results of X-CLIP are also competitive with Text4Vis \cite{Text4Vis}, while VideoMAE exhibits comparatively lower performance.
Three models, namely VideoMAE, X-CLIP, and Text4Vis, are chosen for the evaluation in this study. Their performance is assessed using YT8M dataset which include YouTube videos captured in uncontrolled environments, closely resembling real-world settings. Notably, Text4Vis stood out by achieving an impressive weighted F1-score of 0.87 after fine-tuning with just 1142 videos. The results of X-CLIP are also competitive with Text4Vis \cite{Text4Vis}, while VideoMAE exhibits comparatively lower performance.
Kokoelmat
- Avoin saatavuus [32049]