Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Temporal action localization is a task to localize the start and end timestamps of action instances and recognize their categories. In recent years, many works put effort into the fully supervised manner and gain great achievements. However, these fully supervised methods require extensive manual frame/snippet level annotations. To address this problem, many weakly supervised temporal action localization (WS-TAL) methods are proposed to explore an efficient way to detect the action instances in the given videos with only video-level supervision which is more easily obtained by the annotator.

BibTex: