4 datasets found

Groups: Text Classification Organizations: No Organization Formats: JSON

Filter Results
  • C4

    The dataset used for pre-training language models, containing a large collection of text documents.
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Wikitext-103

    The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
  • BookCorpus

    The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.