Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection

Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities such as RGB, optical flow, and audio, while only video-level annotations are available. In the pursuit of effective multimodal violence detection (MVD), information redundancy, modality imbalance, and modality asynchrony are identified as three key challenges.

Data and Resources

Cite this as

Shengyang Sun, Xiaojin Gong (2024). Dataset: Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection. https://doi.org/10.57702/4v7170o5

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Author Shengyang Sun
More Authors
Xiaojin Gong
Homepage https://github.com/shengyangsun/MSBT