-
Grammarly Argument Quality Corpus (GAQCorpus)
A large, domain-diverse annotated corpus of theory-based argument quality assessment. -
Presto: A Multilingual Dataset for Task-Oriented Dialogue Parsing
A multilingual dataset for task-oriented dialogue parsing. -
Diabla: A Corpus of Bilingual Spontaneous Written Dialogues
A corpus of bilingual spontaneous written dialogues for machine translation. -
DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction
A large-scale human-annotated corpus for disfluency correction in four Indo-European languages: English, Hindi, German, and French. -
Semantic Scholar Open Research Corpus
The Semantic Scholar Open Research Corpus contains meta-data of 46,947,044 published research papers in Computer Science, Neuroscience, and Bio-medicine from 1936 to 2019. -
ROC-Stories: A Corpus for Evaluating Story Generation Models
ROC-Stories: A Corpus for Evaluating Story Generation Models -
PropBank.Br
The PropBank.Br corpus is a corpus of Brazilian Portuguese texts annotated with semantic roles. -
French Wikipedia
French Wikipedia corpus -
Asian Scientific Paper Excerpt Corpus (ASPEC)
Asian Scientific Paper Excerpt Corpus (ASPEC) -
Swiss SMS corpus
Swiss SMS corpus dataset -
Penn Discourse Treebank 2.0
The Penn Discourse Treebank 2.0 (PDTB 2.0) is a large scale corpus containing 2,312 Wall Street Journal (WSJ) articles. -
MS MARCO V1 corpus
MS MARCO V1 corpus -
Speech Corpus
A speech corpus of size 7,000 used for training and validation of the FCI module. -
OSCAR corpus
The dataset used in this study is the OSCAR corpus, which is a multilingual corpus that is obtained by filtering of the Common Crawl corpus. -
Gutenberg Corpus
A dataset of 2,857 books written by 141 authors, used for pre-training and fine-tuning a language model for author-stylized text generation.