Dataset - LDM

M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written...
- Dataset
- JSON
BABEL-Pashto

The BABEL-Pashto dataset is a multilingual speech recognition dataset containing Pashto speech recordings.
- Dataset
- JSON
MCV-10

This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset MCV-10 is a multilingual dataset that contains 50 hours of...
- Dataset
- JSON
TransMuCoRes

Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages.
- Dataset
- JSON
Very Deep Multilingual Convolutional Neural Networks for LVCSR

Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in...
- Dataset
- JSON
CommonVoice

The sequence-to-sequence approach is widely used in speech recognition (SR) nowadays, and many research works are dedicated to show that their capabilities relying on a single...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

6 datasets found