Dataset - LDM

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction...

The dataset used in the paper to evaluate the effectiveness of the BEEAR method in mitigating safety backdoors in instruction-tuned LLMs.
- Dataset
- JSON
PEOPLEMAP

PEOPLEMAP is an open-source interactive web-based tool that uses natural language processing (NLP) to create visual maps for researchers based on their research interests and...
- Dataset
- JSON
Existing ACQ datasets

A few existing datasets for asking clarification questions
- Dataset
- JSON
FLM-HotpotQA

A dataset for pragmatic evaluation of clarifying questions and fact-level masking
- Dataset
- JSON
SVAMP

The SVAMP dataset contains natural language math problems from various sources, including textbooks and online resources.
- Dataset
- JSON
CSQA

The CSQA dataset is a widely used benchmark dataset for conversational KBQA, consisting of around 200K dialogues where training set, validation set and testing set contain 153K,...
- Dataset
- JSON
NLPeer dataset

A unified resource for the computational study of peer review.
- Dataset
- JSON
ASAP AEG dataset

The ASAP AEG dataset contains approximately 13,000 essays, across 8 essay sets. The dataset has approximately 13,000 essays, across 8 essay sets.
- Dataset
- JSON
Racist and sexist hate speech detection: Literature review

A review of studies on the detection of racist and sexist hate speech.
- Dataset
- JSON
YOSM: A new Yorùbá Sentiment Corpus for Movie Reviews

A dataset for sentiment analysis of Yoruba movie reviews.
- Dataset
- JSON
SemEval-2023 Task 10: Explainable Detection of Online Sexism

The dataset used for the SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) task, a shared task on offensive language (sexism) detection on English Gab and...
- Dataset
- JSON
ANTHROSCORE: A Computational Linguistic Measure of Anthropomorphism

Anthropomorphism in research papers and downstream news headlines
- Dataset
- JSON
Patent corpus

A dataset of over 100,000 patent documents from the Cooperative Patent Classification scheme (CPC) category A61.
- Dataset
- JSON
Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation

Modern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour.
- Dataset
- JSON
FIPO Dataset

The dataset used for Free-form Instruction-oriented Prompt Optimization (FIPO) with Preference Dataset and Modular Fine-tuning Schema.
- Dataset
- JSON
Identifying machine-paraphrased plagiarism

This dataset is used to identify machine-generated paraphrased plagiarism.
- Dataset
- JSON
Dialogue Dataset for Detecting Sentences that Do Not Require Factual Correctn...

A dialogue dataset annotated with fact-check-needed label (DDFC) for detecting sentences that do not require factual correctness judgment
- Dataset
- JSON
Scaling laws and ﬂuctuations in the statistics of word frequencies

The dataset consists of three large databases: Google-ngram, English Wikipedia, and a collection of scientific articles.
- Dataset
- JSON
Penn Treebank corpus

The Penn Treebank corpus contains 49,208 sentences and over 1 million words, and is used to test the proposed algorithm on a real-world dataset.
- Dataset
- JSON
Wall Street Journal (WSJ) dataset

The Wall Street Journal (WSJ) dataset is a standard benchmark dataset for coherence modeling.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

219 datasets found