-
Neural Codec Language Models
Neural codec language models are zero-shot text to speech synthesizers. -
Chinese Prosody Prediction Dataset
The dataset used in the paper for automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. -
Aozorabunko dataset
Aozorabunko dataset used for pre-training of PnG BERT model. -
Wikipedia2 and Aozorabunko datasets
Wikipedia2 and Aozorabunko datasets used for pre-training of PnG BERT model. -
Diffusion Models for Minimally-Supervised Speech Synthesis
Minimally-supervised speech synthesis method based on diffusion models with minimal supervision. Introduces the CTAP method as an intermediate semantic representation and uses... -
Speech Corpus
A speech corpus of size 7,000 used for training and validation of the FCI module. -
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS.... -
Development of HMM-based Indonesian speech synthesis
Development of HMM-based Indonesian speech synthesis. -
TIMIT Corpus
The TIMIT corpus is a large database of speech recordings used for speaker recognition and speech synthesis tasks. -
Internal Dataset
The internal dataset contains 6 million real-world driving scenarios from Las Vegas (LV), Seattle (SEA), San Francisco (SF), and the campus of the Stanford Linear Accelerator... -
Corpus and voices for Catalan speech synthesis
Corpus and voices for Catalan speech synthesis. -
Voice Bank Corpus
The Voice Bank Corpus is a large regional accent speech database containing over 10 hours of speech data from 20 speakers. -
Streamwise StyleMelGAN vocoder for wideband speech coding at very low bit rate
A GAN vocoder which is able to generate wideband speech wave-forms from parameters coded at 1.6 kbit/s. -
A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognitio...
The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as...