-
Umsuka English-isiZulu Parallel Corpus
The Umsuka English-isiZulu Parallel Corpus provides a novel, high-quality parallel dataset for machine translation, containing English sentences sampled from both News Crawl... -
MADAR dataset
The MADAR dataset is a parallel corpus for low-resource languages. -
Sumerian Cuneiform Dataset
The dataset used for the study of Sumerian cuneiform, including part-of-speech tagging, named entity recognition, and machine translation. -
UK-PODS-ALIGN
This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS-ALIGN is a dataset that features modern conversational... -
Ligurian Monolingual Corpus
The first open source monolingual corpus for Ligurian. -
Normalized Ligurian Corpus
A dataset of 4,394 Ligurian sentences in different spelling systems paired with normalized versions. -
BABEL dataset
The dataset used in this paper is the BABEL dataset, which contains 10881 motion sequences, with 65926 subsequences and the corresponding textual labels.