21 datasets found

Tags: language modeling

Filter Results
  • GPT-3

    A large language model that is significantly larger than the largest model tested in the results discussed above.
  • Wikitext-103 and MusDB datasets

    The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a 16 layers transformer (Vaswani et al., 2017) based language model on...
  • SST

    The dataset used in the paper is the Stanford Sentiment Treebank (SST) dataset, which contains standard train/dev/test sets and two subtasks: binary sentence classification or...
  • The Pile

    The Pile dataset contains 3.5 million samples of diverse text for language modeling.
  • LLaMA

    The dataset used in the paper is LLaMA, a large language model.
  • FastText

    The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.
  • LibriSpeech LM

    The LibriSpeech LM corpus used for pre-training speech-text models.
  • Morfessor 2.0 dataset

    Morfessor 2.0 dataset for English, Finnish and Turkish language models
  • Den samiske tekstbanken dataset

    Den samiske tekstbanken dataset for North S´ami language model
  • Morpho Challenge 2010 dataset

    Morpho Challenge 2010 dataset for English, Finnish and Turkish language models
  • C4

    The dataset used for pre-training language models, containing a large collection of text documents.
  • OpenWebText Corpus

    A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
  • One Billion Words Dataset

    A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Wiki-Auto

    The Wiki-Auto dataset is a text simplification dataset.
  • OSCAR 22.01

    The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding...
  • WikiText-103 dataset

    The dataset used in this paper is the WikiText-103 dataset, which contains a large corpus of text.
  • OSCAR

    The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...
  • Common Crawl

    The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...
  • IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
You can also access this registry using the API (see API Docs).