-
CommonVoice
The sequence-to-sequence approach is widely used in speech recognition (SR) nowadays, and many research works are dedicated to show that their capabilities relying on a single... -
Europarl-v7
Multilingual document classification task, where labeled data is available only for one language (e.g. English) while classification must be performed in a different language... -
Dictation dataset
The dictation dataset across 39 locales, including Latin (Albanian, Icelandic, Slovak), Arabic (Levant, Maghrebi), Cyrillic (Macedonian, Kazakh), Devanagari (Nepali), etc. -
LJ Speech Dataset
The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books. -
Wikipedia Comparable Corpora
Multilingual dataset for topic modeling based on aligned Wikipedia articles extracted from Wikipedia Comparable Corpora -
Librispeech
The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers. -
LibriLight
The dataset used in this paper is a large-scale production ASR system, which includes multi-domain (MD) data sets in English. The MD data sets include medium-form (MF) and...