Dataset - LDM

Multilingual Context-Based Pronunciation Learning for Text-to-Speech

Multilingual pronunciation learning for Text-to-Speech systems. Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
- Dataset
- JSON
Experiments with multilingual and language-specific pre-trained masked langua...

The datasets used in the experiments are annotated according to the Unimorph schema guidelines.
- Dataset
- JSON
MuST-C v1.0

MuST-C v1.0 is a multilingual corpus for end-to-end speech translation, containing 8 language pairs.
- Dataset
- JSON
mC4

Parameter-efficient fine-tuning (PEFT) using labeled task data can significantly improve the performance of large language models (LLMs) on the downstream task. However, there...
- Dataset
- JSON
OPUS-100

The dataset used in the paper is a subset of the OPUS-MT dataset, containing 1M randomly sampled examples from the OPUS-100 dataset.
- Dataset
- JSON
Europarl-ST

Europarl-ST is a multilingual speech corpus that contains transcriptions of parliamentary debates in multiple languages.
- Dataset
- JSON
MTG: A Benchmark Suite for Multilingual Text Generation

MTG is a multilingual multiway text generation benchmark suite. It is the first-proposed multilingual multiway text generation dataset with the largest human-annotated data...
- Dataset
- JSON
PMIndiaSum

The PMIndiaSum dataset contains multilingual and cross-lingual headline summarization for languages in India.
- Dataset
- JSON
Wikipedia as multilingual source of comparable corpora

Wikipedia as multilingual source of comparable corpora.
- Dataset
- JSON
M4

The M4 dataset consists of human-written texts from several data sources, including Wikipedia, Reddit, and arXiv in the English subset of the dataset. It pairs the human-written...
- Dataset
- JSON
CLIPfa

The CLIPfa dataset is a multilingual image-text dataset.
- Dataset
- JSON
SemEval-2023 Task 1: Visual Word Sense Disambiguation

The SemEval-2023 Visual Word Sense Disambiguation (V-WSD) Task dataset consists of a silver dataset with 12,869 V-WSD instances. Each sample is a 4-tuple ⟨f, c, I, i∗ ∈ I⟩ where...
- Dataset
- JSON
PAN Profiling Fake News Spreader Task

The PAN Profiling Fake News Spreader Task contains a dataset in English, whose samples were collected from Twitter.
- Dataset
- JSON
PAN Profiling Hate Speech Spreader Task

The PAN Profiling Hate Speech Spreader Task contains a dataset in English and Spanish, whose samples were collected from Twitter.
- Dataset
- JSON
BABEL-Pashto

The BABEL-Pashto dataset is a multilingual speech recognition dataset containing Pashto speech recordings.
- Dataset
- JSON
A Multilingual African Embedding for FAQ Chatbots

A multilingual African embedding for FAQ chatbots
- Dataset
- JSON
Yandex

Multilingual Neural Machine Translation datasets
- Dataset
- JSON
OSCAR corpus

The dataset used in this study is the OSCAR corpus, which is a multilingual corpus that is obtained by filtering of the Common Crawl corpus.
- Dataset
- JSON
SemEval-2024 task 1: Semantic textual relatedness for African and Asian langu...

A collection of semantic textual relatedness datasets for African and Asian languages.
- Dataset
- JSON
SemRel2024

A collection of semantic textual relatedness datasets for 14 languages.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

49 datasets found