Adversarial Erasing Framework via Triplet with Gated Pyramid Pooling Layer for Weakly Supervised Semantic Segmentation
– Published Date : TBD
– Category : Weakly Supervised Semantic Segmentation (WSSS)
– Place of publication : European Conference on Computer Vision (ECCV) 2022
Abstract:
Weakly supervised semantic segmentation (WSSS) with image- level labels attracts much interest due to its practicality. Though most WSSS methods utilize Class Activation Maps (CAMs) to localize the target object when only image-level labels are given, the CAMs typically produce imprecise results that do not fit along the object boundaries, and highlight only on the most-discriminative regions. One main reason for the imprecise CAMs is the usage of Global Average Pooling (GAP). To resolve this problem, we propose a Gated Pyramid Pooling (GPP) layer that decouples CAMs to form the pyramid features at multiple spatial resolutions. By using them as weights to make class prediction, CAMs are trained not only to capture the global context but also to pre- serve fine-details from the image. On the other hand, to handle the bias to the most-discriminative regions of CAMs, Adversarial Erasing (AE) methods have been proposed. The AE methods effectively extend CAMs to proper object regions by erasing the most-discriminative regions of an image but usually suffer from an over-expansion problem due to an ab- sence of guidelines when to stop erasing. To guide CAMs to explore the less-discriminative regions while preventing the over-erasing problem, we propose a novel Adversarial Erasing Framework via Triplet (AEFT), re- formulating the Adversarial Erasing (AE) method with metric learning (especially a triplet loss). Unlike the previous AE methods based on rigid classification, the proposed AEFT prevents the over-expansion problem by using the triplet loss, which is a more flexible criterion for applying the AE method. By utilizing the distance between GPP features as a met- ric in AEFT, we achieve new state-of-art results both on the PASCAL VOC 2012 val/test and MS-COCO 2014 val set only with image-level supervision by 70.9%/71.7% and 40.6% in mIoU, respectively.