17 datasets found

Groups: Natural Language Processing Organizations: No Organization Formats: JSON

Filter Results
  • LLM dataset

    The dataset used in this paper is not explicitly described, but it is mentioned that it is a large language model (LLM) and that the authors used it to train and evaluate their...
  • AGNews

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a variety of datasets for semi-supervised learning tasks.
  • C4 dataset

    The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a GPT2 transformer language model on the C4 dataset.
  • Penn Tree Bank

    The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...
  • Wikipedia dataset

    The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...
  • Word2Vec

    Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
  • TEL-NLP

    The TEL-NLP dataset is a collection of Telugu text data for four NLP tasks: sentiment analysis, emotion identification, hate speech detection, and sarcasm detection.
  • BERT

    The dataset used in this paper is a pre-trained BERT model trained on English Wikipedia and Books datasets.
  • C4

    The dataset used for pre-training language models, containing a large collection of text documents.
  • SST-2

    The dataset used for the experiments across ten models– ranging from bag-of-words models to pre-trained transformers– and find that a model having higher AUC does not necessarily...
  • AG News

    The dataset used in the paper is a language domain dataset, specifically for sentiment classification, named AG News. The dataset is used to evaluate the performance of...
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Penn Treebank

    The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
  • BookCorpus

    The dataset used in this paper for unsupervised sentence representation learning, consisting of paragraphs from unlabeled text.
  • IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
  • Penn Treebank dataset

    The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.
  • GLUE

    Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...