Language Modeling - Groups

FastText

The FastText dataset is a subword token embedding model. It produces a vector representation of a word based on composing embeddings of the character n-grams composing the word.

Dataset
JSON

C4

The dataset used for pre-training language models, containing a large collection of text documents.

Dataset
JSON

Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.

Dataset
JSON

Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.

Dataset
JSON

IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...

Dataset
JSON

5 datasets found

FastText

C4

Text8

Penn Treebank

IMDB