Dataset - LDM

Individual Text Corpora Predict Openness, Interests, Knowledge and Level of E...

Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens.
- Dataset
- JSON
Twigraph: Discovering and Visualizing Influential Words between Twitter Profiles

The dataset used in the paper is a collection of 1.1M tweets from Twitter, with approximately 3000 tweets per user from various domains such as politics, sports, entertainment,...
- Dataset
- JSON
Radiology reports dataset

Radiology reports dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm.
- Dataset
- JSON
NIPS dataset

NIPS dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm.
- Dataset
- JSON
Hierarchical Latent Word Clustering

Hierarchical Latent Word Clustering dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm.
- Dataset
- JSON
Opinion Mining and Sentiment Analysis

Opinion mining and sentiment analysis dataset
- Dataset
- JSON
GoEmotions

The GoEmotions dataset, a collection of 58k Reddit comments labeled according to the publisher affect from a taxonomy of 28 emotions.
- Dataset
- JSON
Probase

Probase is a probabilistic knowledge base and it contains millions of entities and concepts. One of the advantages of Probase is that in comparison with the well-known knowledge...
- Dataset
- JSON
ACL Anthology Dataset

The ACL Anthology dataset contains 21,212 papers, 17,792 authors, 342 venues, and 110,975 citations.
- Dataset
- JSON
Twitter Dataset

The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.
- Dataset
- JSON
ACL Anthology

The ACL Anthology dataset contains papers on natural language processing, including citation patterns, authorship, and language use over time.
- Dataset
- JSON
Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...
- Dataset
- JSON
MS MARCO V1 corpus

MS MARCO V1 corpus
- Dataset
- JSON
Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...
- Dataset
- JSON
LiveJournal

Matrix factorization (MF) and Autoencoder (AE) are among the most successful approaches of unsupervised learning.
- Dataset
- JSON
Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...
- Dataset
- JSON
BookCorpus Dataset

The dataset used in the paper is the bookcorpus dataset.
- Dataset
- JSON
Yelp Review Dataset

The Yelp review dataset contains hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp.
- Dataset
- JSON
Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
- Dataset
- JSON
Reddit Million User Dataset

The Reddit Million User Dataset is a collection of 4 million comments from 400k different Reddit users.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

32 datasets found