Dataset - LDM

Corpus of Spontaneous Japanese

The Corpus of Spontaneous Japanese: Its design and evaluation [30] is a dataset of spontaneous Japanese speech.
- Dataset
- JSON
MSR

The MSR dataset is a widely used vulnerability detection dataset, consisting of 10,900 vulnerable examples and 177,736 non-vulnerable examples.
- Dataset
- JSON
K-NCT

The Korean GEC dataset, which includes text-level errors and is used to construct the ASR Post-Processing dataset.
- Dataset
- JSON
Text-Level Error Type Classiﬁcation Criteria

The proposed text-level error type classiﬁcation criteria, which considers 13 text-level errors that can occur in speech recognition situations.
- Dataset
- JSON
Speech-Level Error Type Classiﬁcation Criteria

The proposed speech-level error type classiﬁcation criteria, which considers 24 sub-types for noise error and 13 sub-types for speaker characteristics.
- Dataset
- JSON
Error Explainable Benchmark (EEB) dataset

The proposed Error Explainable Benchmark (EEB) dataset, which considers both speech- and text-level error types, to diagnose and validate ASR models and post-processors.
- Dataset
- JSON
SEAME

The dataset used for the code-switched speech recognition task, which consists of Mandarin-English code-switched corpora.
- Dataset
- JSON
ASCEND

The dataset used for the code-switched speech recognition task, which consists of Mandarin-English code-switched corpora.
- Dataset
- JSON
DeepSpeech

The DeepSpeech dataset used for evaluation of the proposed watermarking scheme.
- Dataset
- JSON
Speech Pattern Based Black-Box Model Watermarking for Automatic Speech Recogn...

The proposed black-box model watermarking framework for protecting the IP of ASR models.
- Dataset
- JSON
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction o...

Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- Dataset
- JSON
wav2vec: Unsupervised Pre-Training for Speech Recognition

Unsupervised Pre-Training for Speech Recognition
- Dataset
- JSON
AISHELL-1

The AISHELL-1 dataset is a Mandarin speech corpus, consisting of 178 hours of speech, with 11 domains and 400 speakers from different accent areas in China.
- Dataset
- JSON
Speech Intelligibility Prediction with DNN-based Performance Measures

The dataset used for speech intelligibility prediction with DNN-based performance measures
- Dataset
- JSON
Transformer based Whisper Bangla ASR model

A transformer-based Whisper Bangla ASR model
- Dataset
- JSON
Fleurs

Few-shot learning evaluation of universal representations of speech
- Dataset
- JSON
ALLSSTAR

Large-scale dataset of L1 and L2 scripted and spontaneous transcripts and recordings
- Dataset
- JSON
BD-4SK-ASR

The dataset used in this paper is BD-4SK-ASR, an experimental dataset which is used in the first attempt in developing an ASR system for Sorani Kurdish.
- Dataset
- JSON
IWSLT2018 Speech Translation Task

The dataset used in the paper is the IWSLT2018 speech translation task, which consists of five parts: TED corpus, Speech-translation TED corpus, TED LIUM corpus, WMT18 data and...
- Dataset
- JSON
Wall Street Journal

The Wall Street Journal dataset is used for syntactic linearization. It contains a large corpus of news articles with their corresponding syntactic trees.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

46 datasets found