Dataset - LDM

AG News

The dataset used in the paper is a language domain dataset, specifically for sentiment classification, named AG News. The dataset is used to evaluate the performance of...
- Dataset
- JSON
Jigsaw Dataset

The Jigsaw dataset is a collection of text, where each text is labeled as toxic or non-toxic.
- Dataset
- JSON
Amazon

The dataset used in the paper is a series of datasets introduced in [46], comprising large corpora of product reviews crawled from Amazon.com. Top-level product categories on...
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
CLINC

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction
- Dataset
- JSON
BANKING

The BANKING dataset is an intent classiﬁcation dataset in the banking domain.
- Dataset
- JSON
GoogleNews

The dataset used in this paper is a collection of news articles from Google News.
- Dataset
- JSON
Wiki20K

The dataset used in this paper is a collection of English Wikipedia abstracts from DBpedia.
- Dataset
- JSON
20NewsGroups

The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails.
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
Amazon review dataset

The Amazon review dataset is used for multi-source domain adaptation. It contains review texts and ratings of bought products. Products are grouped into categories. Following...
- Dataset
- JSON
BioText dataset

The BioText dataset contains more than 3,500 text samples classified into one of eight classes, which specify the type of semantic relationship between disease and treatment...
- Dataset
- JSON
Rotten Tomatoes Movie Reviews (RT) and IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a sentiment analysis task on two public benchmark datasets: Rotten Tomatoes...
- Dataset
- JSON
Book Categories

Two text classiﬁcation data sets for evaluating the quality of interpretability methods.
- Dataset
- JSON
SNLI

The dataset used in the paper is the Stanford Natural Language Inference (SNLI) dataset, which consists of 549,367 premise-hypothesis pairs for train/dev/test sets and target...
- Dataset
- JSON
Ott dataset

The dataset used in this paper for deceptive opinions detection
- Dataset
- JSON
BookCorpus

The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
- Dataset
- JSON
Reuters RCV1-v2

The Reuters RCV1-v2 contains 804,414 newswire articles. There are 103 topics which form a tree hierarchy. Thus documents typically have multiple labels. The data was randomly...
- Dataset
- JSON
IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
- Dataset
- JSON
Penn Treebank dataset

The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

103 datasets found