-
PET dataset
The PET dataset contains 45 documents with annotations for information especially useful for creating process models in BPMN. -
Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation
Modern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour. -
FIPO Dataset
The dataset used for Free-form Instruction-oriented Prompt Optimization (FIPO) with Preference Dataset and Modular Fine-tuning Schema. -
Identifying machine-paraphrased plagiarism
This dataset is used to identify machine-generated paraphrased plagiarism. -
Dialogue Dataset for Detecting Sentences that Do Not Require Factual Correctn...
A dialogue dataset annotated with fact-check-needed label (DDFC) for detecting sentences that do not require factual correctness judgment -
NarrativeQA
The NarrativeQA dataset is a reading comprehension challenge that focuses on questions with a single entity and relation. -
AttenWalker
Unsupervised long-document question answering via attention-based graph walking -
Exponential Family Embeddings
Word embeddings are a powerful approach for capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, a class of... -
Reinforcement Learning from Human Feedback with Active Queries
Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human... -
AMR Parsing using Stack-LSTMs
AMR parsing using Stack-LSTMs -
Interval Probabilistic Fuzzy Synsets for WordNet
Interval Probabilistic Fuzzy synsets for WordNet -
MNLI subsets
The MNLI subsets dataset contains subsets of the MNLI dataset, with some features being spurious. -
Scaling laws and fluctuations in the statistics of word frequencies
The dataset consists of three large databases: Google-ngram, English Wikipedia, and a collection of scientific articles. -
Towards Improving Selective Prediction Ability of NLP Systems
SNLI, MNLI, Stress Test, Matched Mismatched, Competence, Distraction, and Noise datasets -
SafetyPrompts
The dataset used in the paper to test the safety issues of Large Language Models (LLMs). -
Penn Treebank corpus
The Penn Treebank corpus contains 49,208 sentences and over 1 million words, and is used to test the proposed algorithm on a real-world dataset. -
Synthesis Step by Step (S3)
Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language... -
Integer or floating point? new outlooks for low-bit quantization on large lan...
The dataset used in the paper is not explicitly described, but it is mentioned that it is a large language model dataset. -
A comprehensive study on post-training quantization for large language models
The ZeroQuant dataset is a large language model dataset used in the paper.