Dataset - LDM

No language left behind: Scaling human-centered machine translation

The dataset is used for training and testing the performance of multilingual language models.
- Dataset
- JSON
Sumerian Cuneiform Dataset

The dataset used for the study of Sumerian cuneiform, including part-of-speech tagging, named entity recognition, and machine translation.
- Dataset
- JSON
AfriSenti

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
- Dataset
- JSON
AfriSenti-SemEval-2023 Task 12

AfriSenti-SemEval-2023 Task 12: Multilingual fine-tuning for sentiment classification in low-resource languages
- Dataset
- JSON
UK-PODS-ALIGN

This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS-ALIGN is a dataset that features modern conversational...
- Dataset
- JSON
UK-PODS

This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS features modern conversational Ukrainian language.
- Dataset
- JSON
Ligurian Monolingual Corpus

The first open source monolingual corpus for Ligurian.
- Dataset
- JSON
Normalized Ligurian Corpus

A dataset of 4,394 Ligurian sentences in different spelling systems paired with normalized versions.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

8 datasets found