Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence.
The Cityscapes dataset is a large and famous city street scene semantic segmentation dataset. 19 classes of which 30 classes of this dataset are considered for training and...