5 datasets found

Tags: text classification

Filter Results
  • FastText

    The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.
  • C4

    The dataset used for pre-training language models, containing a large collection of text documents.
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Penn Treebank

    The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
  • IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...