University of Oulu

C. Chen, S. Dong, Y. Tian, K. Cao, L. Liu and Y. Guo, "Temporal Self-Ensembling Teacher for Semi-Supervised Object Detection," in IEEE Transactions on Multimedia, vol. 24, pp. 3679-3692, 2022, doi: 10.1109/TMM.2021.3105807

Temporal self-ensembling teacher for semi-supervised object detection

Saved in:
Author: Chen, Cong1; Dong, Shouyang2; Tian, Ye3;
Organizations: 1Keya Medical Technology, ShenZhen 518116, China
2Software Department at Cambricon, Beijing 100010, China
3Hippocrates Research Laboratory at Tencent, Shenzhen 518052, China
4Colloge of System Engineering, National University of Defense Technology, Changsha 410073, China
5Center for Machine Vision and Signal analysis at the University of Oulu, 90570 Oulu, Finland
6Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 6.9 MB)
Persistent link:
Language: English
Published: Institute of Electrical and Electronics Engineers, 2022
Publish Date: 2023-04-05


This paper focuses on the semi-supervised object detection (SSOD) which makes good use of unlabeled data to boost performance. We face the following obstacles when adapting the knowledge distillation (KD) framework in SSOD. (1) The teacher model serves a dual role as a teacher and a student, such that the teacher predictions on unlabeled images may limit the upper bound of the student. (2) The data imbalance issue caused by the large quantity of consistent predictions between the teacher and student hinders an efficient knowledge transfer between them. To mitigate these issues, we propose a novel SSOD model called Temporal Self-Ensembling Teacher (TSET). Our teacher model ensembles its temporal predictions for unlabeled images under stochastic perturbations. Then, our teacher model ensembles its model weights with those of the student model by an exponential moving average. These ensembling strategies ensure data and model diversity, and lead to better teacher predictions for unlabeled images. In addition, we adapt the focal loss to formulate the consistency loss for handling the data imbalance issue. Together with a thresholding method, the focal loss automatically reweights the inconsistent predictions, which preserves the knowledge for difficult objects to detect in the unlabeled images. The mAP of our model reaches 80.73% and 40.52% on the VOC2007 test set and the COCO2014 minival5k set, respectively, and outperforms a strong fully supervised detector by 2.37% and 1.49%, respectively. Furthermore, the mAP of our model (80.73%) sets a new state-of-the-art performance in SSOD on the VOC2007 test set.

see all

Series: IEEE transactions on multimedia
ISSN: 1520-9210
ISSN-E: 1941-0077
ISSN-L: 1520-9210
Volume: 24
Pages: 3679 - 3692
DOI: 10.1109/tmm.2021.3105807
Type of Publication: A1 Journal article – refereed
Field of Science: 113 Computer and information sciences
Funding: This work was supported by the National Natural Science Foundation of China (No. 12073047).
Copyright information: © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.