-
Clinical Notes Dataset
A large corpus of clinical notes from the clinical data warehouse of a local hospital in France, used to learn word embeddings for enhancing model performance. -
Generated Training Dataset for Biomedical NLU
The dataset consists of user utterances for querying Electronic Health Records (EHRs) in the biomedical domain, generated using templates and augmented with paraphrases. A total... -
HASOC 2019 Dataset
The HASOC dataset includes abusive language data in Hindi, English, and German, designed for identifying hate speech and offensive content with various subtasks for classification. -
British National Corpus
The British National Corpus (BNC) contains spontaneous conversations from UK English speakers collected with portable tape recorders in the early 1990s, featuring significant... -
Europarl v7 Corpus
The Europarl v7 corpus, a parallel corpus of European Parliament proceedings, used for training a Transformer-based NMT system for German-to-English translation. -
Wikipedia Sequence Dataset
The dataset consists of 992 sequences extracted from Wikipedia, where each sequence consists of two consecutive paragraphs, in the form of: [CLS] paragraph1 [SEP] paragraph2... -
DailyDialog Dataset
The DailyDialog dataset consists of dialogues from daily communication and serves as a benchmark for dialog response generation tasks. -
NLU-Benchmark
The NLU-Benchmark dataset is annotated with scenarios, actions, and entities for various home assistant tasks. It contains 25,716 utterances categorized into 64 intents and 54... -
GIGA-CM dataset
GIGA-CM is a large-scale dataset comprising millions of documents, created to facilitate the pre-training of hierarchical document encoding models for summarization tasks. -
New York Times dataset
The New York Times dataset is used for summarization tasks and consists of articles from the New York Times, with summaries created by editors, enabling the assessment of... -
WMT18 English-Turkish Translation Dataset
The WMT18 dataset is utilized for English-Turkish translation and serves as a standard low-resource scenario dataset. -
WAT English-Japanese Translation Dataset
The WAT dataset contains English-Japanese sentence pairs for translation tasks, focusing particularly on low-resource scenarios. -
WMT17 English-German Translation Dataset
The WMT17 dataset serves as a benchmark for translation tasks involving English and German, containing a large set of parallel sentences. -
WMT16 English-German Translation Dataset
The WMT16 dataset is used for English-German translation, including a number of parallel sentences for robust machine translation tasks. -
News Commentary v11 (NC11)
NC11 dataset encompasses translations for low-resource English↔German tasks, demonstrating improvements in machine translation in resource-limited scenarios.