Natural Language Processing - Groups

CSQA

The CSQA dataset is a widely used benchmark dataset for conversational KBQA, consisting of around 200K dialogues where training set, validation set and testing set contain 153K,...

Dataset
JSON

Leveraging Passage Embeddings for Efficient Listwise Reranking

Passage ranking, which aims to rank each passage in a large corpus according to its relevance to the user's information need expressed in a short query.

Dataset
JSON

Synthesis Step by Step (S3)

Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language...

Dataset
JSON

Gemma: Open models based on gemini research and technology

This dataset contains a large corpus of text for training and evaluating large language models.

Dataset
JSON

Llama 2: Open foundation and fine-tuned chat models

This dataset contains a large corpus of text for training and evaluating large language models.

Dataset
JSON

Buffer of Thoughts

Buffer of Thoughts is a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs).

Dataset
JSON

Reducing Retraining by Recycling Parameter-Efﬁcient Prompts

Parameter-efﬁcient methods are able to use a single frozen pre-trained large language model to perform many tasks by learning task-speciﬁc soft prompts that modulate model...

Dataset
JSON

TruthfulQA

The TruthfulQA dataset is a dataset that contains 817 questions designed to evaluate language models' preference to mimic some human falsehoods.

Dataset
JSON

Evaluating large language models trained on code

The paper presents the results of the OpenAI Codex evaluation on generating Python code.

Dataset
JSON

Confidence Calibration in Large Language Models

The dataset used in this study to analyze the self-assessment behavior of Large language models.

Dataset
JSON

Proof-Pile-2

The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.

Dataset
JSON

Open-Orca

The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.

Dataset
JSON

Hate Speech Detection using Large Language Models

The dataset used for probing LLMs for hate speech detection, including HateXplain, implicit hate, and ToxicSpans datasets.

Dataset
JSON

TruthX: Alleviating Hallucinations by Editing Large Language Models

Dataset
JSON

Orca: Progressive Learning from Complex Explanation Traces

The Orca approach involves leveraging explanation tuning to generate detailed responses from a large language model.

Dataset
JSON

Evol-Instruct: A Pipeline for Automatically Evolving Instruction Datasets

The Evol-Instruct pipeline involves automatically evolving instruction datasets using large language models.

Dataset
JSON

16 datasets found