Dataset - LDM

CodeSearchNet

The dataset used in the paper is CodeSearchNet, a natural language code search benchmark for six programming languages (Python, Java, Javascript, Ruby, PHP, and Go).
- Dataset
- JSON
EmpatheticDialogues

The EmpatheticDialogues dataset is a text dataset for training empathetic AI chatbots, consisting of 25k conversations grounded in emotional situations with emotion labels.
- Dataset
- JSON
MR

The dataset used for sentiment analysis, question type classification, and subjectivity classification tasks.
- Dataset
- JSON
BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
- Dataset
- JSON
PatentEval Dataset

The PatentEval dataset is a comprehensive dataset for evaluating patent text generation.
- Dataset
- JSON
Big Patent Dataset

The Big Patent dataset is a large-scale dataset for abstractive and coherent summarization.
- Dataset
- JSON
Harvard USPTO Patent Dataset

The Harvard USPTO Dataset is a large-scale, well-structured, and multi-purpose corpus of patent applications.
- Dataset
- JSON
Training Dataset

The training dataset is a collection of the publicly available Arabic corpora listed below: The unshufﬂed OSCAR corpus (Ortiz Su´arez et al., 2020). The Arabic Wikipedia dump...
- Dataset
- JSON
RPC-Lex: A dictionary to measure German right-wing populist conspiracy discou...

A dictionary to measure German right-wing populist conspiracy discourse online.
- Dataset
- JSON
A Benchmark Dataset for Learning to Intervene in Online Hate Speech

A benchmark dataset for learning to intervene in online hate speech.
- Dataset
- JSON
Orca: Progressive Learning from Complex Explanation Traces

The Orca approach involves leveraging explanation tuning to generate detailed responses from a large language model.
- Dataset
- JSON
Evol-Instruct: A Pipeline for Automatically Evolving Instruction Datasets

The Evol-Instruct pipeline involves automatically evolving instruction datasets using large language models.
- Dataset
- JSON
LaMini: A Large-Scale Instruction Dataset

The LaMini approach involves generating a large-scale instruction dataset by leveraging the outputs of a large language model, gpt-3.5-turbo.
- Dataset
- JSON
Various Datasets

The datasets used in the paper are described as follows: WikiMIA, BookMIA, Temporal Wiki, Temporal arXiv, ArXiv-1 month, Multi-Webdata, LAION-MI, Gutenberg.
- Dataset
- JSON
CCNet

The dataset used in the paper to train the Toolformer model.
- Dataset
- JSON
IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
- Dataset
- JSON
Question Classification using Convolutional Neural Networks

Question classification using Convolutional Neural Networks
- Dataset
- JSON
Penn Treebank dataset

The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.
- Dataset
- JSON
Keyphrase generation with fine-grained evaluation-guided reinforcement learning

A dataset for keyphrase generation with fine-grained evaluation-guided reinforcement learning.
- Dataset
- JSON
Unified language model pre-training for natural language understanding and ge...

A unified language model pre-training for natural language understanding and generation.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found