Natural Language Processing - Groups

PG-19

PG-19 is a well-established benchmark for long-form language modeling.

Dataset
JSON

SST

The dataset used in the paper is the Stanford Sentiment Treebank (SST) dataset, which contains standard train/dev/test sets and two subtasks: binary sentence classification or...

Dataset
JSON

Automata-based constraints for language model decoding

The dataset used in this paper is a collection of regular expressions and grammars for constraining language models.

Dataset
JSON

Penn Tree Bank

The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...

Dataset
JSON

Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification

Dataset
JSON

Wikitext-2

The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks.

Dataset
JSON

SlimPajama

The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification.

Dataset
JSON

C4

The dataset used for pre-training language models, containing a large collection of text documents.

Dataset
JSON

Chinese Poetry

The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.

Dataset
JSON

Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.

Dataset
JSON

Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.

Dataset
JSON

IMDB

The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...

Dataset
JSON

Penn Treebank (PTB) dataset

The Penn Treebank (PTB) dataset is used for word ordering task. The dataset is used to evaluate the performance of different models for word ordering.

Dataset
JSON

13 datasets found