Low-Resource Languages - Groups

Umsuka English-isiZulu Parallel Corpus

The Umsuka English-isiZulu Parallel Corpus provides a novel, high-quality parallel dataset for machine translation, containing English sentences sampled from both News Crawl...

Dataset
JSON

MADAR dataset

The MADAR dataset is a parallel corpus for low-resource languages.

Dataset
JSON

Sumerian Cuneiform Dataset

The dataset used for the study of Sumerian cuneiform, including part-of-speech tagging, named entity recognition, and machine translation.

Dataset
JSON

AfriSenti

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Dataset
JSON

UK-PODS-ALIGN

This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS-ALIGN is a dataset that features modern conversational...

Dataset
JSON

UK-PODS

This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS features modern conversational Ukrainian language.

Dataset
JSON

Ligurian Monolingual Corpus

The first open source monolingual corpus for Ligurian.

Dataset
JSON

Normalized Ligurian Corpus

A dataset of 4,394 Ligurian sentences in different spelling systems paired with normalized versions.

Dataset
JSON

BABEL dataset

The dataset used in this paper is the BABEL dataset, which contains 10881 motion sequences, with 65926 subsequences and the corresponding textual labels.

Dataset
JSON

9 datasets found