Clotho: An audio captioning dataset

doi:doi:10.57702/wmmj8pxy

Clotho: An audio captioning dataset

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, and other words are infrequent but informative.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Emre C¸akır, Konstantinos Drossos, Tuomas Virtanen (2024). Dataset: Clotho: An audio captioning dataset. https://doi.org/10.57702/wmmj8pxy

DOI retrieved: December 2, 2024

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Defined In	https://doi.org/10.48550/arXiv.2007.04660
Author	Emre C¸akır
More Authors	Konstantinos Drossos Tuomas Virtanen
Homepage	https://arxiv.org/abs/1910.09387