-
Individual Text Corpora Predict Openness, Interests, Knowledge and Level of E...
Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens. -
Twigraph: Discovering and Visualizing Influential Words between Twitter Profiles
The dataset used in the paper is a collection of 1.1M tweets from Twitter, with approximately 3000 tweets per user from various domains such as politics, sports, entertainment,... -
Radiology reports dataset
Radiology reports dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
NIPS dataset
NIPS dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
Hierarchical Latent Word Clustering
Hierarchical Latent Word Clustering dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
Opinion Mining and Sentiment Analysis
Opinion mining and sentiment analysis dataset -
GoEmotions
The GoEmotions dataset, a collection of 58k Reddit comments labeled according to the publisher affect from a taxonomy of 28 emotions. -
ACL Anthology Dataset
The ACL Anthology dataset contains 21,212 papers, 17,792 authors, 342 venues, and 110,975 citations. -
Twitter Dataset
The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German. -
ACL Anthology
The ACL Anthology dataset contains papers on natural language processing, including citation patterns, authorship, and language use over time. -
Penn Tree Bank
The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The... -
MS MARCO V1 corpus
MS MARCO V1 corpus -
Wikipedia Corpus
The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,... -
LiveJournal
Matrix factorization (MF) and Autoencoder (AE) are among the most successful approaches of unsupervised learning. -
Wikipedia dataset
The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries... -
BookCorpus Dataset
The dataset used in the paper is the bookcorpus dataset. -
Yelp Review Dataset
The Yelp review dataset contains hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. -
Reddit Million User Dataset
The Reddit Million User Dataset is a collection of 4 million comments from 400k different Reddit users.