No Organization - Organizations

Causal-TimeBank (CausalTB)

CausalTB comprises 2,470 sentences, of which 244 are identified as causal. It was created using causal signal and causal link tags, focusing on extracting causal sentences from...

Dataset
JSON

SemEval-2010 (Task 8)

The SemEval-2010 dataset contains 10,674 samples, of which 1,325 are causal sentences annotated with a pair of entities and the type of their relationship, focused on multi-way...

Dataset
JSON

eSCAPE-NMT

The eSCAPE-NMT dataset comprises a large-scale synthetic corpus designed for training and fine-tuning models in Automatic Post-Editing.

Dataset
JSON

WMT19 English-German APE Dataset

The WMT19 English-German APE dataset consists of a training and development set used for Automatic Post-Editing tasks.

Dataset
JSON

Write & Improve (W&I)+LOCNESS corpus

The W&I+LOCNESS corpus combines data from the Write & Improve platform and the LOCNESS corpus, which contains learner essays with annotations for GEC.

Dataset
JSON

National University of Singapore Corpus of Learner English (NUCLE)

The NUCLE corpus is a collection of essays written by learners of English that includes grammatical annotations for various error types.

Dataset
JSON

Lang-8 learner corpus

The Lang-8 learner corpus consists of sentences written by learners of English, providing a rich source for analyzing grammatical errors.

Dataset
JSON

First Certificate in English (FCE) corpus

The FCE corpus is used for grammatical error correction tasks, containing sentences written by learners along with annotations for erroneous structures.

Dataset
JSON

Coronary Arteriography Reports

The dataset consists of coronary arteriography reports collected from Shuguang Hospital, including five types of entities and five relations relevant to medical text processing.

Dataset
JSON

Stanford Natural Language Inference Corpus (SNLI)

The Stanford Natural Language Inference Corpus (SNLI) dataset is used for natural language inference tasks.

Dataset
JSON

Stanford Sentiment Treebank (SST-5)

The SST-5 dataset is a sentiment analysis dataset consisting of movie reviews with five labels for sentiment classification.

Dataset
JSON

WNUT16 NER

WNUT16 is a shared task dataset for named entity recognition over Twitter, consisting of annotated tweets used for identifying named entities in informal digital text.

Dataset
JSON

GENIA NER

The GENIA NER dataset consists of annotated Medline abstracts that contain information on biological entities such as proteins and genes, used for named entity recognition in...

Dataset
JSON

CoNLL 2003 NER dataset

The CoNLL 2003 shared task dataset is focused on named entity recognition tasks.

Dataset
JSON

CoNLL 2000 chunking dataset

The CoNLL 2000 shared task dataset is used for chunking tasks in natural language processing.

Dataset
JSON

Universal Dependencies v. 1.3

This dataset contains part-of-speech tags for English, derived from the first 500 sentences of the Universal Dependencies corpus, reducing the training set to increase difficulty.

Dataset
JSON

ACE Entities/Events

The ACE 2005 dataset consists of annotated documents for event and entity detection, with a focus on various domains including newswire and blogs.

Dataset
JSON

MSRA

MSRA dataset comes from the news domain and is widely used for Chinese Named Entity Recognition.

Dataset
JSON

Weibo

Weibo NER was built based on text in Chinese social media, containing various types of named entities.

Dataset
JSON

IMDB Movie Reviews

The IMDB dataset consists of 54000 movie reviews intended as a background corpus for evaluating spell correction models, containing a larger vocabulary for robust word recognition.

Dataset
JSON

24,167 datasets found