Natural Language Processing - Groups

Llama: Open and efficient foundation language models

The LLaMA dataset is a large language model dataset used in the paper.

Dataset
JSON

Proof-Pile-2

The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.

Dataset
JSON

Open-Orca

The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.

Dataset
JSON

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Tra...

The dataset used in this paper is ImageNet and SQuAD and GLUE datasets.

Dataset
JSON

SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification Model

This paper presents a Soft Labeling and Noisy Mixup-based open intent classification model (SNOiC). Most of the previous works have used threshold-based methods to identify open...

Dataset
JSON

Using Large Language Models to Simulate Multiple Humans

The dataset used in the paper to simulate human behavior in various experiments, including the Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of...

Dataset
JSON

Self-StrAE at SemEval-2024 Task 1: Making Self-Structuring AutoEncoders Learn...

Self-StrAE is a model that processes a given sentence to generate both multi-level embeddings and a structure over the input.

Dataset
JSON

GPTFuzzer

This dataset is used to evaluate the performance of the judgement model.

Dataset
JSON

IWSLT-14 DE-EN

The dataset used in this paper is a machine translation dataset, specifically IWSLT-14 DE-EN.

Dataset
JSON

Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sour...

Learning language-conditioned robot behavior from offline data and crowd-sourced annotation.

Dataset
JSON

Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...

Dataset
JSON

A general language assistant as a laboratory for alignment

A general language assistant for aligning language models with human users

Dataset
JSON

Realtoxicityprompts: Evaluating neural toxic degeneration in language models

A dataset for evaluating neural toxic degeneration in language models

Dataset
JSON

Alignment of language agents

A dataset for aligning language agents

Dataset
JSON

Text2Pos

Text2Pos for city-scale position localization based on textual descriptions. Given a point cloud that represents our surroundings and a query position description, Text2Pos...

Dataset
JSON

SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection

Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during...

Dataset
JSON

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese...

WanJuan: A comprehensive multimodal dataset for advancing English and Chinese large models.

Dataset
JSON

Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...

Dataset
JSON

Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture

Dataset
JSON

Essay-BR

A large corpus of essays written by Brazilian high school students that were graded by experts following the evaluation criteria of the ENEM exam.

Dataset
JSON

530 datasets found