9 datasets found

Formats: JSON

Filter Results
  • Umsuka English-isiZulu Parallel Corpus

    The Umsuka English-isiZulu Parallel Corpus provides a novel, high-quality parallel dataset for machine translation, containing English sentences sampled from both News Crawl...
  • MADAR dataset

    The MADAR dataset is a parallel corpus for low-resource languages.
  • Sumerian Cuneiform Dataset

    The dataset used for the study of Sumerian cuneiform, including part-of-speech tagging, named entity recognition, and machine translation.
  • AfriSenti

    AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
  • UK-PODS-ALIGN

    This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS-ALIGN is a dataset that features modern conversational...
  • UK-PODS

    This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset UK-PODS features modern conversational Ukrainian language.
  • Ligurian Monolingual Corpus

    The first open source monolingual corpus for Ligurian.
  • Normalized Ligurian Corpus

    A dataset of 4,394 Ligurian sentences in different spelling systems paired with normalized versions.
  • BABEL dataset

    The dataset used in this paper is the BABEL dataset, which contains 10881 motion sequences, with 65926 subsequences and the corresponding textual labels.