Dataset - LDM

MCV-10

This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset MCV-10 is a multilingual dataset that contains 50 hours of...
- Dataset
- JSON
TransMuCoRes

Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages.
- Dataset
- JSON
DBP15KZH-EN, DBP15KJA-EN, and DBP15KFR-EN datasets

The DBP15KZH-EN, DBP15KJA-EN, and DBP15KFR-EN datasets are used for cross-lingual entity alignment. The datasets contain entities, relations, and attributes, and are used to...
- Dataset
- JSON
Multilingual Text Classification Dataset

Multilingual text classification dataset with 17 different languages
- Dataset
- JSON
MARC

The MARC dataset is a multilingual text classification dataset that contains 6 languages.
- Dataset
- JSON
MEANTIME

MEANTIME, the NewsReader Multilingual Event and Time Corpus.
- Dataset
- JSON
Robust Multilingual Named Entity Recognition with Shallow Semi-Supervised Fea...

Multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets.
- Dataset
- JSON
UD Dataset v2.7

The UD Dataset v2.7 is a multilingual dataset for dependency parsing.
- Dataset
- JSON
WikiANN

The WikiANN dataset is a multilingual dataset for named entity recognition.
- Dataset
- JSON
Parallel Meaning Bank

A semantically annotated parallel corpus for English, German, Italian, and Dutch where sentences are aligned with scoped meaning representations in order to capture the...
- Dataset
- JSON
CLEF 2003

The dataset used for the experiments in the paper.
- Dataset
- JSON
PaLI

The PaLI dataset is a large-scale multilingual language-image model.
- Dataset
- JSON
WikiMatrix

The WikiMatrix dataset is a multilingual dataset that contains parallel texts between English and other languages.
- Dataset
- JSON
DBpedia

DBpedia is a public knowledge graph which is derived from structured information in Wikipedia, mainly infoboxes.
- Dataset
- JSON
MMLU

The dataset is used for instruction-tuning of LLMs in multiple languages using reinforcement learning from human feedback.
- Dataset
- JSON
HellaSwag

The dataset is used for instruction-tuning of LLMs in multiple languages using reinforcement learning from human feedback.
- Dataset
- JSON
ARC

The dataset is used for instruction-tuning of LLMs in multiple languages using reinforcement learning from human feedback.
- Dataset
- JSON
BABEL

The BABEL dataset is a multilingual speech recognition dataset containing over 1,000 hours of speech from 6 languages.
- Dataset
- JSON
Very Deep Multilingual Convolutional Neural Networks for LVCSR

Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in...
- Dataset
- JSON
XNLI

The XNLI dataset comprises pairs of sentences with a label categorizing the semantic relationship between the two sentences into one of three classifications: entailment,...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

49 datasets found