-
Individual Text Corpora Predict Openness, Interests, Knowledge and Level of E...
Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens. -
OpenWebTextCorpus
The OpenWebText corpus is a collection of text data from the web. -
Twigraph: Discovering and Visualizing Influential Words between Twitter Profiles
The dataset used in the paper is a collection of 1.1M tweets from Twitter, with approximately 3000 tweets per user from various domains such as politics, sports, entertainment,... -
Semantic Textual Similarity
The STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are... -
Radiology reports dataset
Radiology reports dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
NIPS dataset
NIPS dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
Hierarchical Latent Word Clustering
Hierarchical Latent Word Clustering dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
Tweet dataset
The dataset used in this paper is a collection of short texts, including tweets, Pascal Flickr captions, and search snippets. -
Google News Embeddings
The dataset used in the paper is a word2vec embedding trained on a corpus of Google News texts. -
Twitter Dataset
The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German. -
PubMed abstracts
The dataset used in this paper is the PubMed abstracts dataset, which contains approximately 11 million abstracts. -
Penn Tree Bank
The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The... -
MS MARCO V1 corpus
MS MARCO V1 corpus -
Wikipedia Corpus
The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,... -
LiveJournal
Matrix factorization (MF) and Autoencoder (AE) are among the most successful approaches of unsupervised learning. -
Wikipedia dataset
The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...