Dataset - LDM

OpenWebTextCorpus

The OpenWebText corpus is a collection of text data from the web.
- Dataset
- JSON
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images

The dataset used in the paper for the SemEval-2021 task 6: Detection of persuasion techniques in texts and images using CLIP features.
- Dataset
- JSON
Semantic Textual Similarity

The STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are...
- Dataset
- JSON
Tweet dataset

The dataset used in this paper is a collection of short texts, including tweets, Pascal Flickr captions, and search snippets.
- Dataset
- JSON
The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.
- Dataset
- JSON
Google News Embeddings

The dataset used in the paper is a word2vec embedding trained on a corpus of Google News texts.
- Dataset
- JSON
Twitter Dataset

The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.
- Dataset
- JSON
PubMed abstracts

The dataset used in this paper is the PubMed abstracts dataset, which contains approximately 11 million abstracts.
- Dataset
- JSON
SRLLM Training Dataset

A dataset of annotated text, used for training and evaluating the Safety and Responsible Large Language Model (SRLLM).
- Dataset
- JSON
News and Social Media Articles Dataset

A dataset of annotated news and social media articles, spanning various aspects and media.
- Dataset
- JSON
Content Moderation Dataset (CMD)

A dataset of social media content containing potentially biased (unsafe) texts, along with unbiased (safe or benign) variations.
- Dataset
- JSON
NeurIPS dataset

The NeurIPS dataset is a collection of 7241 papers published in NeurIPS from 1987 to 2016.
- Dataset
- JSON
Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON
New York Times and 20Newsgroups datasets

The dataset used in the paper is the New York Times dataset and the 20Newsgroups dataset.
- Dataset
- JSON
20News

Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging.
- Dataset
- JSON
Textual Sports Commentary Dataset

The textual dataset is a collection of live sports commentaries scraped from various sources, including live score websites and YouTube.
- Dataset
- JSON
20Newsgroups dataset

The 20Newsgroups data set is a dataset of 18,846 instances of newsgroup documents.
- Dataset
- JSON
Japanese Election Manifesto Data

The Japanese election manifesto data contains texts of Japanese election manifestos.
- Dataset
- JSON
Congressional Bills Project

The Congressional bills project dataset contains texts of congressional bills.
- Dataset
- JSON
News

The News dataset consists of 5000 randomly sampled news articles from the NY Times corpus. It simulates the opinions of media consumers on news items. The units are different...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

30 datasets found