Dataset - LDM

PET dataset

The PET dataset contains 45 documents with annotations for information especially useful for creating process models in BPMN.
- Dataset
- JSON
Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation

Modern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour.
- Dataset
- JSON
FIPO Dataset

The dataset used for Free-form Instruction-oriented Prompt Optimization (FIPO) with Preference Dataset and Modular Fine-tuning Schema.
- Dataset
- JSON
Identifying machine-paraphrased plagiarism

This dataset is used to identify machine-generated paraphrased plagiarism.
- Dataset
- JSON
Dialogue Dataset for Detecting Sentences that Do Not Require Factual Correctn...

A dialogue dataset annotated with fact-check-needed label (DDFC) for detecting sentences that do not require factual correctness judgment
- Dataset
- JSON
NarrativeQA

The NarrativeQA dataset is a reading comprehension challenge that focuses on questions with a single entity and relation.
- Dataset
- JSON
Qasper

A dataset of information-seeking questions and answers annotated in research papers
- Dataset
- JSON
AttenWalker

Unsupervised long-document question answering via attention-based graph walking
- Dataset
- JSON
Exponential Family Embeddings

Word embeddings are a powerful approach for capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, a class of...
- Dataset
- JSON
Reinforcement Learning from Human Feedback with Active Queries

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human...
- Dataset
- JSON
AMR Parsing using Stack-LSTMs

AMR parsing using Stack-LSTMs
- Dataset
- JSON
Interval Probabilistic Fuzzy Synsets for WordNet

Interval Probabilistic Fuzzy synsets for WordNet
- Dataset
- JSON
MNLI subsets

The MNLI subsets dataset contains subsets of the MNLI dataset, with some features being spurious.
- Dataset
- JSON
Scaling laws and ﬂuctuations in the statistics of word frequencies

The dataset consists of three large databases: Google-ngram, English Wikipedia, and a collection of scientific articles.
- Dataset
- JSON
Towards Improving Selective Prediction Ability of NLP Systems

SNLI, MNLI, Stress Test, Matched Mismatched, Competence, Distraction, and Noise datasets
- Dataset
- JSON
SafetyPrompts

The dataset used in the paper to test the safety issues of Large Language Models (LLMs).
- Dataset
- JSON
Penn Treebank corpus

The Penn Treebank corpus contains 49,208 sentences and over 1 million words, and is used to test the proposed algorithm on a real-world dataset.
- Dataset
- JSON
Synthesis Step by Step (S3)

Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language...
- Dataset
- JSON
Integer or floating point? new outlooks for low-bit quantization on large lan...

The dataset used in the paper is not explicitly described, but it is mentioned that it is a large language model dataset.
- Dataset
- JSON
A comprehensive study on post-training quantization for large language models

The ZeroQuant dataset is a large language model dataset used in the paper.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

530 datasets found