Text Mining - Groups

BioConceptVec Evaluation Datasets

The dataset contains over 25 million instances from nine independent datasets used for intrinsic and extrinsic evaluations.

Dataset
JSON

PubMed abstracts and PubMed Central (PMC) full-text articles dataset

The PubMed abstracts and PubMed Central (PMC) full-text articles dataset is used for pretraining the UBERT variants.

Dataset
JSON

BIOPAK FLASHER: EPIDEMIC DISEASE MONITORING AND DETECTION IN PAKISTAN USING T...

The dataset used in the paper is a collection of Urdu news articles related to epidemic diseases in Pakistan. The dataset is used to train a text mining model to extract...

Dataset
JSON

DBLP papers

The dataset used in this paper is a collection of papers from the DBLP conferences between 2004 and 2014.

Dataset
JSON

NIPS papers

The dataset used in this paper is a collection of papers from the NIPS conferences between 1987 and 1999.

Dataset
JSON

iLCM

The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a “Software as a Service” architecture...

Dataset
JSON

Stack Overflow Performance Discussions

The dataset used for the study, containing 2,304 posts related to performance of software components

Dataset
JSON

MSR Mining Challenge 2015

The dataset used for the MSR Mining Challenge in 2015 containing 43,336,603 posts

Dataset
JSON

Mining and summarizing customer reviews

Dataset
JSON

Music Corpus

The dataset used for term clustering to build a modular ontology according to core ontology from domain-specific text.

Dataset
JSON

Russian Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the Russian Web corpus.

Dataset
JSON

Spanish Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the Spanish Gigaword corpus.

Dataset
JSON

English Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the British National Corpus (BNC) and the English Gigaword corpus.

Dataset
JSON

Turkish Tweets Dataset

A collection of Turkish tweets about three different Turkish telecommunication brands gathered over one month.

Dataset
JSON

CSL

The CSL dataset is a large-scale Chinese scientific literature dataset obtained from the "Qianyan" open-source NLP platform. It consists of 396,209 Chinese core journal papers'...

Dataset
JSON

15 datasets found