University of Oulu

Evaluating state-of-the-art vision-language models for video recognition on real world dataset

Saved in:
Author: Khan, Sheraz1
Organizations: 1University of Oulu, Faculty of Information Technology and Electrical Engineering, Computer Science
Format: ebook
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 4.7 MB)
Pages: 55
Persistent link:
Language: English
Published: Oulu : S. Khan, 2023
Publish Date: 2023-06-16
Thesis type: Master's thesis (tech)
Tutor: Zhao, Guoying
Mo, Hanlin
Reviewer: Zhao, Guoying
Mo, Hanlin


One of the main challenges in Computer Vision is the training of custom models from scratch. This process is highly computer-intensive, time-consuming, and requires vast amount of labeled datasets to achieve reasonable results. Recently, various foundation models trained using self-supervised learning techniques have been proposed, claiming to achieve good results for downstream tasks after fine-tuning. This document aims to discuss the results obtained by three multi-class video recognition methods based on such vision-language foundation models using a dataset that closely corresponds to real-world. The primary objective of this work was to investigate the number of instances required by these models to provide competitive results.

Three models, namely VideoMAE, X-CLIP, and Text4Vis, are chosen for the evaluation in this study. Their performance is assessed using YT8M dataset which include YouTube videos captured in uncontrolled environments, closely resembling real-world settings. Notably, Text4Vis stood out by achieving an impressive weighted F1-score of 0.87 after fine-tuning with just 1142 videos. The results of X-CLIP are also competitive with Text4Vis \cite{Text4Vis}, while VideoMAE exhibits comparatively lower performance.

see all

Copyright information: © Sheraz Khan, 2023. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited.