-
French Wikipedia
French Wikipedia corpus -
Asian Scientific Paper Excerpt Corpus (ASPEC)
Asian Scientific Paper Excerpt Corpus (ASPEC) -
Swiss SMS corpus
Swiss SMS corpus dataset -
Penn Discourse Treebank 2.0
The Penn Discourse Treebank 2.0 (PDTB 2.0) is a large scale corpus containing 2,312 Wall Street Journal (WSJ) articles. -
MS MARCO V1 corpus
MS MARCO V1 corpus -
Speech Corpus
A speech corpus of size 7,000 used for training and validation of the FCI module. -
OSCAR corpus
The dataset used in this study is the OSCAR corpus, which is a multilingual corpus that is obtained by filtering of the Common Crawl corpus. -
Gutenberg Corpus
A dataset of 2,857 books written by 141 authors, used for pre-training and fine-tuning a language model for author-stylized text generation. -
Brown Corpus
The Brown corpus is an out-of-domain dataset. -
Parallel Meaning Bank
A semantically annotated parallel corpus for English, German, Italian, and Dutch where sentences are aligned with scoped meaning representations in order to capture the... -
Temple University Hospital EEG Seizure Corpus
The Temple University Hospital EEG Seizure corpus (TUSZ) v1.5.2 is a publicly available seizure corpus to date with 5, 612 EDF files, 3, 050 annotated seizures from clinical... -
JSUT corpus
The dataset is a large vocabulary Japanese accent dictionary built using the proposed technique. -
Switchboard Corpus
The Switchboard corpus is a dataset of speech recordings from a switchboard, which is a device that allows multiple people to speak at the same time. -
Corpus of Regional African American Language (CORAAL)
This dataset comprises more than 150 socio-linguistic interviews with African-American English speakers born between 1891 and 2005. -
Voice Bank speech corpus
The Voice Bank speech corpus is a selection of ten British English speakers – both male and female – from the Voice Bank speech corpus, each of which has around 400 clean... -
Offensive Hebrew Corpus
A new offensive language corpus in Hebrew, manually annotated with a label, targets, topics, and offensive phrases.