Dataset - LDM

NLPbench

The dataset is used for evaluating large language models on solving NLP problems.
- Dataset
- JSON
Dysmenorrhea Dataset

The authors used their own dataset for dysmenorrhea classification.
- Dataset
- JSON
N2C2 Smoking Challenge

The authors used the N2C2 smoking challenge data-set for smoking status classification task.
- Dataset
- JSON
N2C2 Obesity Challenge

Clinical note classification is a common clinical NLP task. However, annotated data-sets are scarce. The authors used the N2C2 obesity challenge data-set, the N2C2 smoking...
- Dataset
- JSON
ANERcorp

ANERcorp is a named entity recognition dataset.
- Dataset
- JSON
MasakhaNER 2.0

MasakhaNER 2.0 is a NER dataset in the news domain, including the annotations on 20 African languages.
- Dataset
- JSON
Sanskrit Text Annotation

The Sanskrit text is annotated with various NLP tasks, including sentence boundary detection, canonical word ordering, free-form text annotation of tokens, token classification,...
- Dataset
- JSON
Super-NaturalInstructions (SNI) dataset

The Super-NaturalInstructions (SNI) dataset is a collection of 1761 diverse NLP tasks belonging to one of 76 task types.
- Dataset
- JSON
Towards Improving Selective Prediction Ability of NLP Systems

SNLI, MNLI, Stress Test, Matched Mismatched, Competence, Distraction, and Noise datasets
- Dataset
- JSON
Universal Dependencies (UD) treebanks

The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors used the Universal Dependencies (UD) treebanks.
- Dataset
- JSON
MACHAMP

MACHAMP is a toolkit for multi-task learning in NLP, supporting a wide range of NLP tasks.
- Dataset
- JSON
Data Management Operations and Recipes

A dataset management operations and recipes for NLP data production
- Dataset
- JSON
A Workﬂow Manager for Complex NLP and Content Curation Pipelines

A workﬂow manager for the ﬂexible creation and customisation of NLP processing pipelines.
- Dataset
- JSON
MatSci-NLP

The MatSci-NLP dataset is a collection of materials science text for NLP tasks.
- Dataset
- JSON
Towards Dark Jargon Interpretation in Underground Forums

Dark jargons are benign-looking words that have hidden, sinister meanings and are used by participants of underground forums for illicit behavior.
- Dataset
- JSON
ACL Anthology

The ACL Anthology dataset contains papers on natural language processing, including citation patterns, authorship, and language use over time.
- Dataset
- JSON
Cross-lingual semantic representation for NLP with UCCA

The UCCA dataset is used to test the annotation scheme in cross-lingual semantic representation for NLP.
- Dataset
- JSON
Multilingual Misinformation & Its Evolution

The dataset used in this study is a combination of data from Google Fact-Check explorer and data directly crawled from the websites of verified signatories of the International...
- Dataset
- JSON
TEL-NLP

The TEL-NLP dataset is a collection of Telugu text data for four NLP tasks: sentiment analysis, emotion identification, hate speech detection, and sarcasm detection.
- Dataset
- JSON
GLUE benchmark

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used three downstream tasks from the GLUE benchmark: Stanford Sentiment Treebank...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

23 datasets found