Dataset - LDM

Vakyansh

The dataset is used for training and testing the proposed punctuation restoration and inverse text normalization models.
- Dataset
- JSON
GPT-3

A large language model that is significantly larger than the largest model tested in the results discussed above.
- Dataset
- JSON
Wikitext-103 and MusDB datasets

The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a 16 layers transformer (Vaswani et al., 2017) based language model on...
- Dataset
- JSON
SST

The dataset used in the paper is the Stanford Sentiment Treebank (SST) dataset, which contains standard train/dev/test sets and two subtasks: binary sentence classification or...
- Dataset
- JSON
The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.
- Dataset
- JSON
LLaMA

The dataset used in the paper is LLaMA, a large language model.
- Dataset
- JSON
FastText

The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.
- Dataset
- JSON
LibriSpeech LM

The LibriSpeech LM corpus used for pre-training speech-text models.
- Dataset
- JSON
Morfessor 2.0 dataset

Morfessor 2.0 dataset for English, Finnish and Turkish language models
- Dataset
- JSON
Den samiske tekstbanken dataset

Den samiske tekstbanken dataset for North S´ami language model
- Dataset
- JSON
Morpho Challenge 2010 dataset

Morpho Challenge 2010 dataset for English, Finnish and Turkish language models
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
OpenWebText Corpus

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
- Dataset
- JSON
One Billion Words Dataset

A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
Wiki-Auto

The Wiki-Auto dataset is a text simplification dataset.
- Dataset
- JSON
OSCAR 22.01

The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding...
- Dataset
- JSON
WikiText-103 dataset

The dataset used in this paper is the WikiText-103 dataset, which contains a large corpus of text.
- Dataset
- JSON
OSCAR

The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...
- Dataset
- JSON
Common Crawl

The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

22 datasets found