Evaluating state-of-the-art vision-language models for video recognition on real world dataset |
|
Author: | Khan, Sheraz1 |
Organizations: |
1University of Oulu, Faculty of Information Technology and Electrical Engineering, Computer Science |
Format: | ebook |
Version: | published version |
Access: | open |
Online Access: | PDF Full Text (PDF, 4.7 MB) |
Pages: | 55 |
Persistent link: | http://urn.fi/URN:NBN:fi:oulu-202306152534 |
Language: | English |
Published: |
Oulu : S. Khan,
2023
|
Publish Date: | 2023-06-16 |
Thesis type: | Master's thesis (tech) |
Tutor: |
Zhao, Guoying Mo, Hanlin |
Reviewer: |
Zhao, Guoying Mo, Hanlin |
Description: |
Abstract One of the main challenges in Computer Vision is the training of custom models from scratch. This process is highly computer-intensive, time-consuming, and requires vast amount of labeled datasets to achieve reasonable results. Recently, various foundation models trained using self-supervised learning techniques have been proposed, claiming to achieve good results for downstream tasks after fine-tuning. This document aims to discuss the results obtained by three multi-class video recognition methods based on such vision-language foundation models using a dataset that closely corresponds to real-world. The primary objective of this work was to investigate the number of instances required by these models to provide competitive results. Three models, namely VideoMAE, X-CLIP, and Text4Vis, are chosen for the evaluation in this study. Their performance is assessed using YT8M dataset which include YouTube videos captured in uncontrolled environments, closely resembling real-world settings. Notably, Text4Vis stood out by achieving an impressive weighted F1-score of 0.87 after fine-tuning with just 1142 videos. The results of X-CLIP are also competitive with Text4Vis \cite{Text4Vis}, while VideoMAE exhibits comparatively lower performance. see all
|
Subjects: | |
Copyright information: |
© Sheraz Khan, 2023. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited. |