Dataset - LDM

PubMed abstracts and PubMed Central (PMC) full-text articles dataset

The PubMed abstracts and PubMed Central (PMC) full-text articles dataset is used for pretraining the UBERT variants.
- Dataset
- JSON
BIOPAK FLASHER: EPIDEMIC DISEASE MONITORING AND DETECTION IN PAKISTAN USING T...

The dataset used in the paper is a collection of Urdu news articles related to epidemic diseases in Pakistan. The dataset is used to train a text mining model to extract...
- Dataset
- JSON
DBLP papers

The dataset used in this paper is a collection of papers from the DBLP conferences between 2004 and 2014.
- Dataset
- JSON
NIPS papers

The dataset used in this paper is a collection of papers from the NIPS conferences between 1987 and 1999.
- Dataset
- JSON
Big data and big values: When companies need to rethink themselves

The dataset contains more than 94,000 tweets related to the core values of the firms listed in Fortune’s ranking of the World’s Most Admired Companies (2013-2017).
- Dataset
- JSON
iLCM

The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a “Software as a Service” architecture...
- Dataset
- JSON
Mining and summarizing customer reviews

Mining and summarizing customer reviews
- Dataset
- JSON
The Online Pivot: Lessons Learned from Teaching a Text and Data Mining Course...

A text and data mining course on Natural Language Processing, adapted for online teaching during the COVID-19 pandemic.
- Dataset
- JSON
Russian Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the Russian Web corpus.
- Dataset
- JSON
Spanish Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the Spanish Gigaword corpus.
- Dataset
- JSON
English Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the British National Corpus (BNC) and the English Gigaword corpus.
- Dataset
- JSON
CSL

The CSL dataset is a large-scale Chinese scientific literature dataset obtained from the "Qianyan" open-source NLP platform. It consists of 396,209 Chinese core journal papers'...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

12 datasets found