9 datasets found

Groups: Language Modeling

Filter Results
  • FastText

    The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.
  • Penn Tree Bank

    The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...
  • Word2Vec

    Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
  • C4

    The dataset used for pre-training language models, containing a large collection of text documents.
  • OpenWebText Corpus

    A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words.
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Penn Treebank

    The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
  • Wikitext-103

    The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
  • IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...