Temporal self-ensembling teacher for semi-supervised object detection |
|
Author: | Chen, Cong1; Dong, Shouyang2; Tian, Ye3; |
Organizations: |
1Keya Medical Technology, ShenZhen 518116, China 2Software Department at Cambricon, Beijing 100010, China 3Hippocrates Research Laboratory at Tencent, Shenzhen 518052, China
4Colloge of System Engineering, National University of Defense Technology, Changsha 410073, China
5Center for Machine Vision and Signal analysis at the University of Oulu, 90570 Oulu, Finland 6Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China |
Format: | article |
Version: | accepted version |
Access: | open |
Online Access: | PDF Full Text (PDF, 6.9 MB) |
Persistent link: | http://urn.fi/urn:nbn:fi-fe2023040535144 |
Language: | English |
Published: |
Institute of Electrical and Electronics Engineers,
2022
|
Publish Date: | 2023-04-05 |
Description: |
AbstractThis paper focuses on the semi-supervised object detection (SSOD) which makes good use of unlabeled data to boost performance. We face the following obstacles when adapting the knowledge distillation (KD) framework in SSOD. (1) The teacher model serves a dual role as a teacher and a student, such that the teacher predictions on unlabeled images may limit the upper bound of the student. (2) The data imbalance issue caused by the large quantity of consistent predictions between the teacher and student hinders an efficient knowledge transfer between them. To mitigate these issues, we propose a novel SSOD model called Temporal Self-Ensembling Teacher (TSET). Our teacher model ensembles its temporal predictions for unlabeled images under stochastic perturbations. Then, our teacher model ensembles its model weights with those of the student model by an exponential moving average. These ensembling strategies ensure data and model diversity, and lead to better teacher predictions for unlabeled images. In addition, we adapt the focal loss to formulate the consistency loss for handling the data imbalance issue. Together with a thresholding method, the focal loss automatically reweights the inconsistent predictions, which preserves the knowledge for difficult objects to detect in the unlabeled images. The mAP of our model reaches 80.73% and 40.52% on the VOC2007 test set and the COCO2014 minival5k set, respectively, and outperforms a strong fully supervised detector by 2.37% and 1.49%, respectively. Furthermore, the mAP of our model (80.73%) sets a new state-of-the-art performance in SSOD on the VOC2007 test set. see all
|
Series: |
IEEE transactions on multimedia |
ISSN: | 1520-9210 |
ISSN-E: | 1941-0077 |
ISSN-L: | 1520-9210 |
Volume: | 24 |
Pages: | 3679 - 3692 |
DOI: | 10.1109/tmm.2021.3105807 |
OADOI: | https://oadoi.org/10.1109/tmm.2021.3105807 |
Type of Publication: |
A1 Journal article – refereed |
Field of Science: |
113 Computer and information sciences |
Subjects: | |
Funding: |
This work was supported by the National Natural Science Foundation of China (No. 12073047). |
Copyright information: |
© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. |