-
Tweet2Vec: Character-Based Distributed Representations for Social Media
Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters... -
Linguistic Data Set
The dataset used in this paper is a linguistic data set consisting of co-occurrences of 54 nouns and 58 adjectives in Charles Dickens' novel David Copperfield. -
BBC News dataset
The BBC News dataset was used for sentiment analysis of news articles. -
NIPS full paper dataset
The NIPS full paper dataset is a collection of text documents. -
Scaling laws and fluctuations in the statistics of word frequencies
The dataset consists of three large databases: Google-ngram, English Wikipedia, and a collection of scientific articles. -
Leipzig Corpus Miner
The Leipzig Corpus Miner (LCM) is a decentralized SaaS application for the analysis of large amounts of news texts. -
Individual Text Corpora Predict Openness, Interests, Knowledge and Level of E...
Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens. -
OpenWebTextCorpus
The OpenWebText corpus is a collection of text data from the web. -
Twigraph: Discovering and Visualizing Influential Words between Twitter Profiles
The dataset used in the paper is a collection of 1.1M tweets from Twitter, with approximately 3000 tweets per user from various domains such as politics, sports, entertainment,... -
Semantic Textual Similarity
The STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are... -
Radiology reports dataset
Radiology reports dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
NIPS dataset
NIPS dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
Hierarchical Latent Word Clustering
Hierarchical Latent Word Clustering dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm. -
Tweet dataset
The dataset used in this paper is a collection of short texts, including tweets, Pascal Flickr captions, and search snippets. -
Google News Embeddings
The dataset used in the paper is a word2vec embedding trained on a corpus of Google News texts. -
Twitter Dataset
The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.