-
Llama: Open and efficient foundation language models
The LLaMA dataset is a large language model dataset used in the paper. -
Proof-Pile-2
The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting. -
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Tra...
The dataset used in this paper is ImageNet and SQuAD and GLUE datasets. -
SNOiC: Soft Labeling and Noisy Mixup based Open Intent Classification Model
This paper presents a Soft Labeling and Noisy Mixup-based open intent classification model (SNOiC). Most of the previous works have used threshold-based methods to identify open... -
Using Large Language Models to Simulate Multiple Humans
The dataset used in the paper to simulate human behavior in various experiments, including the Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of... -
Self-StrAE at SemEval-2024 Task 1: Making Self-Structuring AutoEncoders Learn...
Self-StrAE is a model that processes a given sentence to generate both multi-level embeddings and a structure over the input. -
IWSLT-14 DE-EN
The dataset used in this paper is a machine translation dataset, specifically IWSLT-14 DE-EN. -
Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sour...
Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. -
Wikipedia Corpus
The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,... -
A general language assistant as a laboratory for alignment
A general language assistant for aligning language models with human users -
Realtoxicityprompts: Evaluating neural toxic degeneration in language models
A dataset for evaluating neural toxic degeneration in language models -
Alignment of language agents
A dataset for aligning language agents -
SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during... -
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese...
WanJuan: A comprehensive multimodal dataset for advancing English and Chinese large models. -
Wikipedia dataset
The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries... -
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture