-
Internal Dataset
The internal dataset contains 6 million real-world driving scenarios from Las Vegas (LV), Seattle (SEA), San Francisco (SF), and the campus of the Stanford Linear Accelerator... -
Corpus and voices for Catalan speech synthesis
Corpus and voices for Catalan speech synthesis. -
Voice Bank Corpus
The Voice Bank Corpus is a large regional accent speech database containing over 10 hours of speech data from 20 speakers. -
Streamwise StyleMelGAN vocoder for wideband speech coding at very low bit rate
A GAN vocoder which is able to generate wideband speech wave-forms from parameters coded at 1.6 kbit/s. -
A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognitio...
The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as... -
TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. -
Generative Pre-Training for Speech
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate... -
JSUT corpus
The dataset is a large vocabulary Japanese accent dictionary built using the proposed technique. -
ASVspoof 2019
The ASVspoof 2019 dataset is a large-scale public dataset for speaker verification and spoofing countermeasures. The dataset contains various types of audio files, including... -
KSS dataset and LJSpeech dataset
Korean Single Speaker Speech (KSS) dataset and LJSpeech dataset used for speech synthesis experiments. -
ParallelWaveGAN
ParallelWaveGAN is a wavegan-based vocoder that uses a parallel architecture. -
FastSpeech2
FastSpeech2 is a text-to-speech model that uses a wavegan-based vocoder. -
SNIPER Training: Single-Shot Sparse Training for Text-to-Speech
Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models... -
LJSpeech-1.1
The LJSpeech-1.1 dataset is a large-scale speech dataset containing approximately 24 hours of single-speaker speech recorded at 22 050 Hz. -
Reverberation and Noise Contaminated Speech Datasets
Training and test datasets were generated by contaminating the clean data with reverberation and noise. -
Proprietary Speech Dataset
Proprietary speech dataset consisted of 184 hours of high quality US English speech spoken by 11 female and 10 male speakers. -
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive No...
Proposed SpecGrad that adapts the spectral envelope of diffusion noise based on the conditioning log-mel spectrogram.