-
CodeSearchNet
The dataset used in the paper is CodeSearchNet, a natural language code search benchmark for six programming languages (Python, Java, Javascript, Ruby, PHP, and Go). -
EmpatheticDialogues
The EmpatheticDialogues dataset is a text dataset for training empathetic AI chatbots, consisting of 25k conversations grounded in emotional situations with emotion labels. -
BookCorpus
The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text. -
PatentEval Dataset
The PatentEval dataset is a comprehensive dataset for evaluating patent text generation. -
Big Patent Dataset
The Big Patent dataset is a large-scale dataset for abstractive and coherent summarization. -
Harvard USPTO Patent Dataset
The Harvard USPTO Dataset is a large-scale, well-structured, and multi-purpose corpus of patent applications. -
Training Dataset
The training dataset is a collection of the publicly available Arabic corpora listed below: The unshuffled OSCAR corpus (Ortiz Su´arez et al., 2020). The Arabic Wikipedia dump... -
RPC-Lex: A dictionary to measure German right-wing populist conspiracy discou...
A dictionary to measure German right-wing populist conspiracy discourse online. -
A Benchmark Dataset for Learning to Intervene in Online Hate Speech
A benchmark dataset for learning to intervene in online hate speech. -
Orca: Progressive Learning from Complex Explanation Traces
The Orca approach involves leveraging explanation tuning to generate detailed responses from a large language model. -
Evol-Instruct: A Pipeline for Automatically Evolving Instruction Datasets
The Evol-Instruct pipeline involves automatically evolving instruction datasets using large language models. -
LaMini: A Large-Scale Instruction Dataset
The LaMini approach involves generating a large-scale instruction dataset by leveraging the outputs of a large language model, gpt-3.5-turbo. -
Various Datasets
The datasets used in the paper are described as follows: WikiMIA, BookMIA, Temporal Wiki, Temporal arXiv, ArXiv-1 month, Multi-Webdata, LAION-MI, Gutenberg. -
Question Classification using Convolutional Neural Networks
Question classification using Convolutional Neural Networks -
Penn Treebank dataset
The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset. -
Keyphrase generation with fine-grained evaluation-guided reinforcement learning
A dataset for keyphrase generation with fine-grained evaluation-guided reinforcement learning. -
Unified language model pre-training for natural language understanding and ge...
A unified language model pre-training for natural language understanding and generation.