Dataset - LDM

Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
PubMed, ArXiv, and Movies datasets

The dataset used in the paper is PubMed, ArXiv, and Movies. PubMed is a medical dataset consisting of research articles from the PubMed repository. The articles' subheadings...
- Dataset
- JSON
20NewsGroups

The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails.
- Dataset
- JSON
CORD-19 Research Challenge

COVID-19 research challenge dataset
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
- Dataset
- JSON
SNLI

The dataset used in the paper is the Stanford Natural Language Inference (SNLI) dataset, which consists of 549,367 premise-hypothesis pairs for train/dev/test sets and target...
- Dataset
- JSON
IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
- Dataset
- JSON
TREC

The dataset used for sentiment analysis, question type classification, and subjectivity classification tasks.
- Dataset
- JSON
Training Language Models to Perform Tasks

A dataset for training language models to perform tasks such as question answering and text classification.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

110 datasets found