Dataset - LDM

Umsuka English-isiZulu Parallel Corpus

The Umsuka English-isiZulu Parallel Corpus provides a novel, high-quality parallel dataset for machine translation, containing English sentences sampled from both News Crawl...
- Dataset
- JSON
MADAR dataset

The MADAR dataset is a parallel corpus for low-resource languages.
- Dataset
- JSON
MASSIVE

The MASSIVE dataset is a comprehensive collection of approximately one million annotated utterances for various natural language understanding tasks such as slot-filling, intent...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

3 datasets found