-
No language left behind: Scaling human-centered machine translation
The dataset is used for training and testing the performance of multilingual language models. -
Sumerian Cuneiform Dataset
The dataset used for the study of Sumerian cuneiform, including part-of-speech tagging, named entity recognition, and machine translation. -
AfriSenti-SemEval-2023 Task 12
AfriSenti-SemEval-2023 Task 12: Multilingual fine-tuning for sentiment classification in low-resource languages -
UK-PODS-ALIGN
This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS-ALIGN is a dataset that features modern conversational... -
Ligurian Monolingual Corpus
The first open source monolingual corpus for Ligurian. -
Normalized Ligurian Corpus
A dataset of 4,394 Ligurian sentences in different spelling systems paired with normalized versions.