-
Causal-TimeBank (CausalTB)
CausalTB comprises 2,470 sentences, of which 244 are identified as causal. It was created using causal signal and causal link tags, focusing on extracting causal sentences from... -
SemEval-2010 (Task 8)
The SemEval-2010 dataset contains 10,674 samples, of which 1,325 are causal sentences annotated with a pair of entities and the type of their relationship, focused on multi-way... -
eSCAPE-NMT
The eSCAPE-NMT dataset comprises a large-scale synthetic corpus designed for training and fine-tuning models in Automatic Post-Editing. -
WMT19 English-German APE Dataset
The WMT19 English-German APE dataset consists of a training and development set used for Automatic Post-Editing tasks. -
Write & Improve (W&I)+LOCNESS corpus
The W&I+LOCNESS corpus combines data from the Write & Improve platform and the LOCNESS corpus, which contains learner essays with annotations for GEC. -
National University of Singapore Corpus of Learner English (NUCLE)
The NUCLE corpus is a collection of essays written by learners of English that includes grammatical annotations for various error types. -
Lang-8 learner corpus
The Lang-8 learner corpus consists of sentences written by learners of English, providing a rich source for analyzing grammatical errors. -
First Certificate in English (FCE) corpus
The FCE corpus is used for grammatical error correction tasks, containing sentences written by learners along with annotations for erroneous structures. -
Coronary Arteriography Reports
The dataset consists of coronary arteriography reports collected from Shuguang Hospital, including five types of entities and five relations relevant to medical text processing. -
Stanford Natural Language Inference Corpus (SNLI)
The Stanford Natural Language Inference Corpus (SNLI) dataset is used for natural language inference tasks. -
Stanford Sentiment Treebank (SST-5)
The SST-5 dataset is a sentiment analysis dataset consisting of movie reviews with five labels for sentiment classification. -
WNUT16 NER
WNUT16 is a shared task dataset for named entity recognition over Twitter, consisting of annotated tweets used for identifying named entities in informal digital text. -
CoNLL 2003 NER dataset
The CoNLL 2003 shared task dataset is focused on named entity recognition tasks. -
CoNLL 2000 chunking dataset
The CoNLL 2000 shared task dataset is used for chunking tasks in natural language processing. -
Universal Dependencies v. 1.3
This dataset contains part-of-speech tags for English, derived from the first 500 sentences of the Universal Dependencies corpus, reducing the training set to increase difficulty. -
ACE Entities/Events
The ACE 2005 dataset consists of annotated documents for event and entity detection, with a focus on various domains including newswire and blogs. -
IMDB Movie Reviews
The IMDB dataset consists of 54000 movie reviews intended as a background corpus for evaluating spell correction models, containing a larger vocabulary for robust word recognition.