-
OpenWebTextCorpus
The OpenWebText corpus is a collection of text data from the web. -
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images
The dataset used in the paper for the SemEval-2021 task 6: Detection of persuasion techniques in texts and images using CLIP features. -
Semantic Textual Similarity
The STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are... -
Tweet dataset
The dataset used in this paper is a collection of short texts, including tweets, Pascal Flickr captions, and search snippets. -
Google News Embeddings
The dataset used in the paper is a word2vec embedding trained on a corpus of Google News texts. -
Twitter Dataset
The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German. -
PubMed abstracts
The dataset used in this paper is the PubMed abstracts dataset, which contains approximately 11 million abstracts. -
SRLLM Training Dataset
A dataset of annotated text, used for training and evaluating the Safety and Responsible Large Language Model (SRLLM). -
News and Social Media Articles Dataset
A dataset of annotated news and social media articles, spanning various aspects and media. -
Content Moderation Dataset (CMD)
A dataset of social media content containing potentially biased (unsafe) texts, along with unbiased (safe or benign) variations. -
NeurIPS dataset
The NeurIPS dataset is a collection of 7241 papers published in NeurIPS from 1987 to 2016. -
Wikipedia Corpus
The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,... -
New York Times and 20Newsgroups datasets
The dataset used in the paper is the New York Times dataset and the 20Newsgroups dataset. -
Textual Sports Commentary Dataset
The textual dataset is a collection of live sports commentaries scraped from various sources, including live score websites and YouTube. -
20Newsgroups dataset
The 20Newsgroups data set is a dataset of 18,846 instances of newsgroup documents. -
Japanese Election Manifesto Data
The Japanese election manifesto data contains texts of Japanese election manifestos. -
Congressional Bills Project
The Congressional bills project dataset contains texts of congressional bills.