Dataset - LDM

SlimPajama

The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification.
- Dataset
- JSON
20-Newsgroups dataset

The 20-Newsgroups dataset is a collection of text documents.
- Dataset
- JSON
REDDIT-BINARY dataset

The REDDIT-BINARY dataset contains 2,000 graphs labeled as question/answer-based or discussion-based community in the content-aggregation website Reddit.
- Dataset
- JSON
Yahoo

The Yahoo dataset used for training and testing the proposed model, containing leaked passwords.
- Dataset
- JSON
BERT

The dataset used in this paper is a pre-trained BERT model trained on English Wikipedia and Books datasets.
- Dataset
- JSON
Reuters-21578

Text classiﬁcation problem has long been an interesting research ﬁeld, the aim of text classiﬁcation is to develop algorithm to ﬁnd the categories of given documents.
- Dataset
- JSON
Amazon Review

The Amazon Review dataset is a widely used benchmark dataset for cross-domain sentiment analysis.
- Dataset
- JSON
Text Classification based on Multiple Block Convolutional Highways

Text classification based on Multiple Block Convolutional Highways
- Dataset
- JSON
OpenWebText Corpus

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
- Dataset
- JSON
SQuAD

The dataset used in the paper is a multiple-choice reading comprehension dataset, which includes a passage, question, and answer. The passage is a script, and the question is a...
- Dataset
- JSON
COPD

The dataset used in the paper for missing value imputation using feature-specific generative adversarial networks.
- Dataset
- JSON
Disin dataset

The Disin dataset is a fake news dataset on Kaggle, including 12,600 fake news articles and 12,600 truthful news articles.
- Dataset
- JSON
Natural Questions

The Natural Questions dataset consists of questions extracted from web queries, with each question accompanied by a corresponding Wikipedia article containing the answer.
- Dataset
- JSON
TriviaQA

The TriviaQA dataset is a collection of questions sourced from Quiz League websites, with sentence-level supporting facts annotation.
- Dataset
- JSON
SST-2

The dataset used for the experiments across ten models– ranging from bag-of-words models to pre-trained transformers– and ﬁnd that a model having higher AUC does not necessarily...
- Dataset
- JSON
Clothing Dataset

The Clothing dataset contains metadata, text descriptions, and images of the clothing items, with the review score as the label.
- Dataset
- JSON
COVID-19 Research Articles Classification

The dataset used for text classification to support Epistemonikos' effort to filter and categorize research articles related to COVID-19.
- Dataset
- JSON
Stanford Alpaca

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used CIFAR-10 and CIFAR-100 datasets for image classification, and ImageNet-100...
- Dataset
- JSON
AG News

The dataset used in the paper is a language domain dataset, specifically for sentiment classification, named AG News. The dataset is used to evaluate the performance of...
- Dataset
- JSON
AGNews Dataset

The AGNews dataset is a collection of news articles, where each article is labeled with a topic (e.g. politics, sports, etc.).
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

110 datasets found