Dataset - LDM

English-Hindi Parallel Corpus

The dataset used for training and testing the machine translation systems.
- Dataset
- JSON
OpenSubtitles2018

This dataset is used to evaluate the performance of context-aware machine translation systems. It consists of English-Russian subtitles with varying levels of context.
- Dataset
- JSON
IWSLT17

The IWSLT17 dataset is a multilingual parallel corpus of 5 languages.
- Dataset
- JSON
PC32

The dataset is a multilingual parallel corpus of 32 English-centric language pairs.
- Dataset
- JSON
WikiMatrix

The WikiMatrix dataset is a multilingual dataset that contains parallel texts between English and other languages.
- Dataset
- JSON
United Nations Parallel Corpus

High-quality human translations from books, leveraging the induction bias that high-quality human translations are superior to machine-generated translations.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

6 datasets found