-
Corpus of Annotated Novels
The dataset comprises 13 full-text novels tagged with protagonistTagger that comprises more than 35,000 mentions of literary characters. -
Protagonists' Tagger in Literary Domain
The dataset comprises 1,300 sentences from 13 classic novels of different genres that a novel reader had manually annotated. -
FIGER dataset
The FIGER dataset contains 2M data samples labeled with 113 types. -
OntoNotes dataset
The OntoNotes dataset contains 3.4M automatically labeled entity mentions for training and 11k manually annotated instances that are split into 8k for dev set and 2k for test set. -
Recipe1M+ Dataset
The Recipe1M+ dataset is a large collection of culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. -
Assorted, Archetypal, and Annotated Two Million (3A2M) Cooking Recipe Dataset
The 3A2M dataset is a large collection of culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. -
Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking Reci...
The 3A2M+ dataset is a large collection of culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. -
SMM4H18_Test
The dataset consists of tweets posted by 212 Twitter users during and after their pregnancy. -
SMM4H18_Val
The dataset consists of tweets posted by 212 Twitter users during and after their pregnancy. -
SMM4H18_Train
The dataset consists of tweets posted by 212 Twitter users during and after their pregnancy. -
BioCreative_TrainTask3.1
The dataset consists of tweets posted by 212 Twitter users during and after their pregnancy. -
BioCreative_ValTask3
The dataset consists of tweets posted by 212 Twitter users during and after their pregnancy. -
BioCreative_TrainTask3.0
The dataset consists of all tweets posted by 212 Twitter users during and after their pregnancy. -
CMID, KUAKE-QIC, Intent-Merged
Biomedical intent detection and named entity recognition datasets -
JNLPBA, DDI, BC5CDR, NCBI-Disease, AnatEM
Biomedical intent detection and named entity recognition datasets -
The Pile dataset
The Pile dataset is a large-scale dataset containing 800GB of text data. -
LM-Extraction benchmark
The LM-Extraction benchmark is derived from The Pile (Gao et al., 2020) dataset, which contains 15,000 pairs of prefixes and suffixes derived from The Pile dataset (Gao et al.,... -
DSTC-FRAMES-ENHI
An extended dataset DSTC-FRAMES-ENHI which contains a total of 37785 samples, 7 entities with 1106 unique entities values (with IOB-prefixes). -
DSTC-FRAMES-EN
A combined dataset formed from two public English task-oriented conversational datasets belonging to travel and restaurant domains respectively. -
CoNLL 2003 dataset
The CoNLL 2003 dataset is a collection of news-wire articles used for sequence labeling tasks.