Clotho: An audio captioning dataset

doi:doi:10.57702/wmmj8pxy

Clotho: An audio captioning dataset

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, and other words are infrequent but informative.

BibTex:

@dataset{Emre_C¸akır_and_Konstantinos_Drossos_and_Tuomas_Virtanen_2024,
    abstract = {Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, and other words are infrequent but informative.},
    author = {Emre C¸akır and Konstantinos Drossos and Tuomas Virtanen},
    doi = {10.57702/wmmj8pxy},
    institution = {No Organization},
    keyword = {'audio captioning', 'deep neural networks', 'natural language'},
    month = {dec},
    publisher = {TIB},
    title = {Clotho: An audio captioning dataset},
    url = {https://service.tib.eu/ldmservice/dataset/clotho--an-audio-captioning-dataset},
    year = {2024}
}