-
ToG: Text of Gaze Dataset
Text-to-gaze dataset containing over 90k text descriptions of human gaze behavior. -
Localizing moments in video with natural language
Localizing moments in video with natural language -
InterHuman
Humanoid Reaction Synthesis is pivotal for creating highly interactive and empathetic robots that can seamlessly integrate into human environments, enhancing the way we live,... -
Clotho: An audio captioning dataset
Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural... -
Natural Image-Text Dataset
The dataset used for training the Vary-base model, containing natural image-text pairs.