Speech Recognition - Groups

FedNST: Federated Noisy Student Training for Automatic Speech Recognition

Federated Noisy Student Training for Automatic Speech Recognition

Dataset
JSON

VCTK Dataset

The VCTK dataset is a large corpus of speech recordings, each containing a single speaker and a single sentence.

Dataset
JSON

Switchboard

Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment.

Dataset
JSON

TIMIT, Aurora-4, AMI, and LibriSpeech

Four different corpora are used for our experiments, which are TIMIT, Aurora-4, AMI, and LibriSpeech. TIMIT contains broadband 16kHz recordings of phonetically-balanced read...

Dataset
JSON

BABEL dataset

The dataset used in this paper is the BABEL dataset, which contains 10881 motion sequences, with 65926 subsequences and the corresponding textual labels.

Dataset
JSON

Librispeech

The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.

Dataset
JSON

LibriLight

The dataset used in this paper is a large-scale production ASR system, which includes multi-domain (MD) data sets in English. The MD data sets include medium-form (MF) and...

Dataset
JSON

CREMA-D

The CREMA-D dataset is an audio-visual dataset for emotion recognition task, each video in which consists of both facial and acoustic emotional expressions.

Dataset
JSON

MSR-VTT

The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...

Dataset
JSON

Generating holistic 3D human motion from speech

Generating holistic 3D human motion from speech.

Dataset
JSON

VoxCeleb

Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature...

Dataset
JSON

UCONV-CONFORMER: HIGH REDUCTION OF INPUT SEQUENCE LENGTH FOR END-TO-END SPEEC...

Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a...

Dataset
JSON

VCTK

Voice conversion (VC) is a technique that alters the voice of a source speaker to a target style, such as speaker identity, prosody, and emotion, while keeping the linguistic...

Dataset
JSON

VoxCeleb1

Speaker recognition aims to identify speaker information from input speech. A type of speaker recognition is speaker verification (SV). It determines whether the test speaker's...

Dataset
JSON

AudioMNIST

The AudioMNIST dataset consists of 60 speakers, 33% female, who were recorded speaking individual digits (0-9) 50 times each.

Dataset
JSON

Google Speech Commands Dataset Version II

The Google Speech Commands Dataset Version II contains 105,829 utterances of 35 words from 2,618 speakers with a sampling rate of 16 kHz.

Dataset
JSON

TED-LIUM dataset

Dataset
JSON

LibriSpeech dataset

The dataset used in the paper is the LibriSpeech dataset, which contains about 1,000 hours of English speech derived from audiobooks.

Dataset
JSON

Deep Speech model

Dataset
JSON

LibriTTS

A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation.

Dataset
JSON

160 datasets found