15 datasets found

Tags: Natural Language Processing

Filter Results
  • LLM dataset

    The dataset used in this paper is not explicitly described, but it is mentioned that it is a large language model (LLM) and that the authors used it to train and evaluate their...
  • Sample Selection for Data Augmentation in Natural Language Processing

    Deep learning-based text classification models need abundant labeled data to obtain competitive performance. To tackle this, multiple researches try to use data augmentation to...
  • FNID: Fake News Inference Dataset

    A dataset for fake news inference
  • Detecting Opinion Spams and Fake News Using Text Classification

    A dataset for opinion spam and fake news detection
  • Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection

    A new benchmark dataset for fake news detection, containing 12,836 short statements labeled for truthfulness, subject, context/venue, speaker, state, party, and prior history.
  • Penn Tree Bank

    The Penn Tree Bank dataset is a corpus split into a training, validation and testing set of 929k words, a validation set of 73k words, and a test set of 82k words. The...
  • Wikipedia dataset

    The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...
  • BERT

    The dataset used in this paper is a pre-trained BERT model trained on English Wikipedia and Books datasets.
  • SST-2

    The dataset used for the experiments across ten models– ranging from bag-of-words models to pre-trained transformers– and find that a model having higher AUC does not necessarily...
  • Text8

    Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
  • Penn Treebank

    The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
  • IMDB

    The dataset used in the paper is not explicitly described, but it is mentioned that the authors tested the proposed method on three real data sets for the most relevant security...
  • Penn Treebank dataset

    The dataset used in the paper is the Penn Treebank dataset, which is a large-scale text classification dataset.
  • Training Language Models to Perform Tasks

    A dataset for training language models to perform tasks such as question answering and text classification.
  • GLUE

    Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...