Dataset - LDM

Orca: Progressive Learning from Complex Explanation Traces

The Orca approach involves leveraging explanation tuning to generate detailed responses from a large language model.
- Dataset
- JSON
Evol-Instruct: A Pipeline for Automatically Evolving Instruction Datasets

The Evol-Instruct pipeline involves automatically evolving instruction datasets using large language models.
- Dataset
- JSON
LaMini: A Large-Scale Instruction Dataset

The LaMini approach involves generating a large-scale instruction dataset by leveraging the outputs of a large language model, gpt-3.5-turbo.
- Dataset
- JSON
Various Datasets

The datasets used in the paper are described as follows: WikiMIA, BookMIA, Temporal Wiki, Temporal arXiv, ArXiv-1 month, Multi-Webdata, LAION-MI, Gutenberg.
- Dataset
- JSON
CCNet

The dataset used in the paper to train the Toolformer model.
- Dataset
- JSON
IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
- Dataset
- JSON
Question Classification using Convolutional Neural Networks

Question classification using Convolutional Neural Networks
- Dataset
- JSON
Penn Treebank dataset

The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.
- Dataset
- JSON
Keyphrase generation with fine-grained evaluation-guided reinforcement learning

A dataset for keyphrase generation with fine-grained evaluation-guided reinforcement learning.
- Dataset
- JSON
Unified language model pre-training for natural language understanding and ge...

A unified language model pre-training for natural language understanding and generation.
- Dataset
- JSON
Neural keyphrase generation via reinforcement learning with adaptive rewards

A dataset for neural keyphrase generation.
- Dataset
- JSON
Select, extract and generate: Neural keyphrase generation with layer-wise cov...

A dataset for neural keyphrase generation with layer-wise coverage attention.
- Dataset
- JSON
KPEVAL: Towards Fine-Grained Semantic-Based Keyphrase Evaluation

A comprehensive evaluation framework for keyphrase systems, including reference agreement, faithfulness, diversity, and utility.
- Dataset
- JSON
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efﬁcient CLIP Training

We propose DisCo-CLIP, a distributed memory-efﬁcient CLIP training approach, to reduce the memory consump- tion of contrastive loss when training contrastive learning models.
- Dataset
- JSON
Customer Service Calls Dataset

A dataset consisting of ten years of customer service calls to a fleet truck company.
- Dataset
- JSON
Ubuntu Dialogue Corpus

The Ubuntu Dialogue Corpus is the largest freely available multi-turn based dialogue corpus which consists of almost one million two-way conversations extracted from the Ubuntu...
- Dataset
- JSON
Visual Genome

The Visual Genome dataset is a large-scale visual question answering dataset, containing 1.5 million images, each with 15-30 annotated entities, attributes, and relationships.
- Dataset
- JSON
CLIP

The CLIP model and its variants are becoming the de facto backbone in many applications. However, training a CLIP model from hundreds of millions of image-text pairs can be...
- Dataset
- JSON
GLUE

Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...
- Dataset
- JSON
Interpreting Learned Feedback Patterns in Large Language Models

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a condensed representation of LLM activations obtained from sparse...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

530 datasets found