49 datasets found

Tags: Multilingual

Filter Results
  • CommonVoice

    The sequence-to-sequence approach is widely used in speech recognition (SR) nowadays, and many research works are dedicated to show that their capabilities relying on a single...
  • MuST-C

    MuST-C is a multilingual speech translation dataset, which contains at least 385 hours of audio recordings from TED Talks, with their manual transcriptions and translations at...
  • Europarl-v7

    Multilingual document classification task, where labeled data is available only for one language (e.g. English) while classification must be performed in a different language...
  • Dictation dataset

    The dictation dataset across 39 locales, including Latin (Albanian, Icelandic, Slovak), Arabic (Levant, Maghrebi), Cyrillic (Macedonian, Kazakh), Devanagari (Nepali), etc.
  • LJ Speech Dataset

    The LJ speech dataset is a dataset of speech samples recorded from a single speaker reading passages from 7 non-fiction books.
  • Wikipedia Comparable Corpora

    Multilingual dataset for topic modeling based on aligned Wikipedia articles extracted from Wikipedia Comparable Corpora
  • VATEX

    The dataset used in the paper is a video question answering dataset, which is a large-scale video-language pre-training task.
  • Librispeech

    The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.
  • LibriLight

    The dataset used in this paper is a large-scale production ASR system, which includes multi-domain (MD) data sets in English. The MD data sets include medium-form (MF) and...
You can also access this registry using the API (see API Docs).