Natural Language Processing - Groups

MNLI, QQP, and SST-2

The dataset used in this paper consists of three tasks: Multi-Genre Natural Language Inference (MNLI), Quora Question Pairs (QQP), and Stanford Sentiment Treebank (SST-2).

Dataset
JSON

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance...

Larger language models have higher accu- racy on average, but are they better on ev- ery single instance (datapoint)?

Dataset
JSON

Learning to summarize with human feedback

The paper presents a study on the impact of synthetic data on large language models (LLMs) and proposes a method to steer LLMs towards desirable non-differentiable attributes.

Dataset
JSON

Reward Model Ensembles

The authors used three datasets: TL;DR, HELPFULNESS, and XSUM/NLI.

Dataset
JSON

STAMP 4 NLP

STAMP 4 NLP is an instantiable, iterative, and incremental process model for developing natural language processing applications with a focus on quality, business value, and...

Dataset
JSON

Detecting Hallucinated Content in Conditional Neural Sequence Generation

Neural sequence models can generate highly fluent sentences, but recent studies have also shown that they are also prone to hallucinate additional content not supported by the...

Dataset
JSON

A general theoretical paradigm to understand learning from human preferences

The paper proposes a novel approach to aligning language models with human preferences, focusing on the use of preference optimization in reward-free RLHF.

Dataset
JSON

Llama: Open and efficient foundation language models

The LLaMA dataset is a large language model dataset used in the paper.

Dataset
JSON

Mixtral of Experts

The dataset used in the paper for instruction following task

Dataset
JSON

Toward an Architecture for Never-ending Language Learning

Toward an Architecture for Never-ending Language Learning.

Dataset
JSON

MISMATCH: Fine-grained Evaluation of Machine-generated Text

The dataset used in the paper for fine-grained evaluation of machine-generated text with mismatch error types.

Dataset
JSON

BERT: Pre-training of deep bidirectional transformers for language understanding

This paper proposes BERT, a pre-trained deep bidirectional transformer for language understanding.

Dataset
JSON

Training Dataset

The training dataset is a collection of the publicly available Arabic corpora listed below: The unshufﬂed OSCAR corpus (Ortiz Su´arez et al., 2020). The Arabic Wikipedia dump...

Dataset
JSON

Orca: Progressive Learning from Complex Explanation Traces

The Orca approach involves leveraging explanation tuning to generate detailed responses from a large language model.

Dataset
JSON

Evol-Instruct: A Pipeline for Automatically Evolving Instruction Datasets

The Evol-Instruct pipeline involves automatically evolving instruction datasets using large language models.

Dataset
JSON

Various Datasets

The datasets used in the paper are described as follows: WikiMIA, BookMIA, Temporal Wiki, Temporal arXiv, ArXiv-1 month, Multi-Webdata, LAION-MI, Gutenberg.

Dataset
JSON

16 datasets found