Dataset - LDM

Multilingual CommonsenseQA

Multilingual CommonsenseQA (mCSQA) is a dataset for evaluating the common sense reasoning capabilities of multilingual LMs.
- Dataset
- JSON
MGSM

The MGSM dataset is a multilingual math reasoning dataset containing around 7,500 training samples and 1,319 testing samples.
- Dataset
- JSON
mCoT-MATH

The mCoT-MATH dataset is a large-scale multilingual math CoT reasoning dataset containing around 6.3 million samples in eleven diverse languages.
- Dataset
- JSON
mT5

A multilingual version of the seq2seq architecture trained on Colossal Clean Crawled Corpus.
- Dataset
- JSON
ConceptNet 5.5

The ConceptNet 5.5 dataset is an open multilingual graph of general knowledge.
- Dataset
- JSON
mC4

Parameter-efficient fine-tuning (PEFT) using labeled task data can significantly improve the performance of large language models (LLMs) on the downstream task. However, there...
- Dataset
- JSON
Historical texts for spelling normalization

A dataset of historical texts in eight languages, used for historical spelling normalization.
- Dataset
- JSON
XArgMining dataset

A multilingual stance detection dataset XArgMining from the IBM Debater project contains human-authored data points for stance detection in English, as well as such data points...
- Dataset
- JSON
BELEBELE Benchmark

A multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants.
- Dataset
- JSON
BabelSememe

The BabelSememe dataset is a multilingual sememe knowledge base built on BabelNet, containing over 15 thousand BabelNet synsets manually annotated with sememes.
- Dataset
- JSON
Multilingual dataset

A high-quality dataset in English and 12 other languages, augmented with rhyme schema at the paragraph level.
- Dataset
- JSON
A dataset and baselines for multilingual reply suggestion

A dataset and baselines for multilingual reply suggestion.
- Dataset
- JSON
xDial-Eval

A multilingual open-domain dialogue evaluation benchmark featuring 14930 annotated turns and 8691 dialogues in 10 languages.
- Dataset
- JSON
Multilingual Offensive Language Identiﬁcation Dataset (OLID)

The dataset is a multilingual offensive language identification dataset for social media, containing posts from Arabic, Danish, English, Greek, and Turkish.
- Dataset
- JSON
Multilingual Eye-movement Corpus (MECO)

The Multilingual Eye-movement Corpus (MECO) is a collection of eye-tracking data that has been collected from participants reading texts in 13 languages.
- Dataset
- JSON
DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction

DiS-ReX: A multilingual dataset for distantly supervised relation extraction.
- Dataset
- JSON
MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources...
- Dataset
- JSON
MuLiMa

MuLiMa is a multilingual dictionary of mathematics curated by mathematicians.
- Dataset
- JSON
Xl-sum: Large-scale multilingual abstractive summarization

The Xl-sum dataset for multilingual abstractive summarization
- Dataset
- JSON
Cross-Lingual Ability of Multilingual BERT

The Cross-Lingual Ability of Multilingual BERT dataset
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

42 datasets found