Clotho: An audio captioning dataset

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, and other words are infrequent but informative.

Data and Resources

Cite this as

Emre C¸akır, Konstantinos Drossos, Tuomas Virtanen (2024). Dataset: Clotho: An audio captioning dataset. https://doi.org/10.57702/wmmj8pxy

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Defined In https://doi.org/10.48550/arXiv.2007.04660
Author Emre C¸akır
More Authors
Konstantinos Drossos
Tuomas Virtanen
Homepage https://arxiv.org/abs/1910.09387