Text Analysis - Groups

Tweet2Vec: Character-Based Distributed Representations for Social Media

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters...

Dataset
JSON

Linguistic Data Set

The dataset used in this paper is a linguistic data set consisting of co-occurrences of 54 nouns and 58 adjectives in Charles Dickens' novel David Copperfield.

Dataset
JSON

BBC News dataset

The BBC News dataset was used for sentiment analysis of news articles.

Dataset
JSON

NIPS full paper dataset

The NIPS full paper dataset is a collection of text documents.

Dataset
JSON

Scaling laws and ﬂuctuations in the statistics of word frequencies

The dataset consists of three large databases: Google-ngram, English Wikipedia, and a collection of scientific articles.

Dataset
JSON

Leipzig Corpus Miner

The Leipzig Corpus Miner (LCM) is a decentralized SaaS application for the analysis of large amounts of news texts.

Dataset
JSON

Individual Text Corpora Predict Openness, Interests, Knowledge and Level of E...

Individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens.

Dataset
JSON

OpenWebTextCorpus

The OpenWebText corpus is a collection of text data from the web.

Dataset
JSON

RECIPES

The RECIPES dataset contains various annotated states (e.g., shape, composition, location, etc.) for ingredients in cooking recipes.

Dataset
JSON

PROPARA

The PROPARA dataset comprises procedural text about scientiﬁc processes. The location states of participant entities at each time step (sentence) in these processes are labeled...

Dataset
JSON

Twigraph: Discovering and Visualizing Influential Words between Twitter Profiles

The dataset used in the paper is a collection of 1.1M tweets from Twitter, with approximately 3000 tweets per user from various domains such as politics, sports, entertainment,...

Dataset
JSON

Semantic Textual Similarity

The STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are...

Dataset
JSON

Radiology reports dataset

Radiology reports dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm.

Dataset
JSON

NIPS dataset

NIPS dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm.

Dataset
JSON

Hierarchical Latent Word Clustering

Hierarchical Latent Word Clustering dataset is used to test the proposed Hierarchical Latent Word Clustering algorithm.

Dataset
JSON

Tweet dataset

The dataset used in this paper is a collection of short texts, including tweets, Pascal Flickr captions, and search snippets.

Dataset
JSON

The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.

Dataset
JSON

Probase

Probase is a probabilistic knowledge base and it contains millions of entities and concepts. One of the advantages of Probase is that in comparison with the well-known knowledge...

Dataset
JSON

Google News Embeddings

The dataset used in the paper is a word2vec embedding trained on a corpus of Google News texts.

Dataset
JSON

Twitter Dataset

The Twitter Dataset is a collection of tweets annotated with Plutchik's emotions, consisting of tweets in three different languages: English, Dutch, and German.

Dataset
JSON

53 datasets found