22 datasets found

Tags: language modeling

Filter Results
  • Vakyansh

    The dataset is used for training and testing the proposed punctuation restoration and inverse text normalization models.
  • GPT-3

    A large language model that is significantly larger than the largest model tested in the results discussed above.
  • Wikitext-103 and MusDB datasets

    The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a 16 layers transformer (Vaswani et al., 2017) based language model on...
  • SST

    The dataset used in the paper is the Stanford Sentiment Treebank (SST) dataset, which contains standard train/dev/test sets and two subtasks: binary sentence classification or...
  • The Pile

    The Pile dataset contains 3.5 million samples of diverse text for language modeling.
  • LLaMA

    The dataset used in the paper is LLaMA, a large language model.
  • FastText

    The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.
  • LibriSpeech LM

    The LibriSpeech LM corpus used for pre-training speech-text models.
  • Morfessor 2.0 dataset

    Morfessor 2.0 dataset for English, Finnish and Turkish language models
  • Den samiske tekstbanken dataset

    Den samiske tekstbanken dataset for North S´ami language model
  • Morpho Challenge 2010 dataset

    Morpho Challenge 2010 dataset for English, Finnish and Turkish language models
  • C4

    The dataset used for pre-training language models, containing a large collection of text documents.
  • OpenWebText Corpus

    A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
  • One Billion Words Dataset

    A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Wiki-Auto

    The Wiki-Auto dataset is a text simplification dataset.
  • OSCAR 22.01

    The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding...
  • WikiText-103 dataset

    The dataset used in this paper is the WikiText-103 dataset, which contains a large corpus of text.
  • OSCAR

    The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...
  • Common Crawl

    The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...
You can also access this registry using the API (see API Docs).