Evaluating state-of-the-art vision-language models for video recognition on real world dataset

Khan, Sheraz

Evaluating state-of-the-art vision-language models for video recognition on real world dataset

Khan, Sheraz (2023-06-15)

Avaa tiedosto

nbnfioulu-202306152534.pdf (4.694Mt)

nbnfioulu-202306152534_pdfa_report.xml (338.7Kt)

nbnfioulu-202306152534_mods.xml (11.60Kt)

nbnfioulu-202306152534_solr.xml (25.67Kt)

Lataukset:

Khan, Sheraz

S. Khan

15.06.2023

© 2023 Sheraz Khan. Tämä Kohde on tekijänoikeuden ja/tai lähioikeuksien suojaama. Voit käyttää Kohdetta käyttöösi sovellettavan tekijänoikeutta ja lähioikeuksia koskevan lainsäädännön sallimilla tavoilla. Muunlaista käyttöä varten tarvitset oikeudenhaltijoiden luvan.

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202306152534

Tiivistelmä

One of the main challenges in Computer Vision is the training of custom models from scratch. This process is highly computer-intensive, time-consuming, and requires vast amount of labeled datasets to achieve reasonable results. Recently, various foundation models trained using self-supervised learning techniques have been proposed, claiming to achieve good results for downstream tasks after fine-tuning. This document aims to discuss the results obtained by three multi-class video recognition methods based on such vision-language foundation models using a dataset that closely corresponds to real-world. The primary objective of this work was to investigate the number of instances required by these models to provide competitive results.

Three models, namely VideoMAE, X-CLIP, and Text4Vis, are chosen for the evaluation in this study. Their performance is assessed using YT8M dataset which include YouTube videos captured in uncontrolled environments, closely resembling real-world settings. Notably, Text4Vis stood out by achieving an impressive weighted F1-score of 0.87 after fine-tuning with just 1142 videos. The results of X-CLIP are also competitive with Text4Vis \cite{Text4Vis}, while VideoMAE exhibits comparatively lower performance.

Kokoelmat

Avoin saatavuus [32049]