Natural Language Processing - Groups

LLM dataset

The dataset used in this paper is not explicitly described, but it is mentioned that it is a large language model (LLM) and that the authors used it to train and evaluate their...

Dataset
JSON

ÆTHEL: Automatically Extracted Typelogical Derivations for Dutch

A semantic compositionality dataset for written Dutch, consisting of a lexicon of supertags for about 900,000 words in context and 72,192 validated derivations.

Dataset
JSON

Utilizing Prolog for converting between active and passive sentence with thre...

This work introduces a simple but eﬃcient method to solve one of the critical aspects of English grammar, the relationship between active sentence and passive sentence.

Dataset
JSON

AGNews

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a variety of datasets for semi-supervised learning tasks.

Dataset
JSON

Universal Dependencies (UD) treebanks

The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors used the Universal Dependencies (UD) treebanks.

Dataset
JSON

MACHAMP

MACHAMP is a toolkit for multi-task learning in NLP, supporting a wide range of NLP tasks.

Dataset
JSON

GPT-2 XL

The GPT-2 dataset is a large language model, specifically the GPT-2 XL model, trained on the Common Crawl dataset.

Dataset
JSON

Reddit Comments dataset

The Reddit Comments dataset is constructed from publicly available user comments on submissions on the Reddit website.

Dataset
JSON

Open Subtitles dataset

The Open Subtitles dataset consists of transcriptions of spoken dialog in movies and television shows.

Dataset
JSON

Attacker and Defender Counting Approach for Abstract Argumentation

The dataset is used to evaluate arguments by counting the number of attackers and defenders for each argument.

Dataset
JSON

CFQ

CFQ is a semantic parsing dataset that consists of natural language questions that are mapped to SPARQL queries.

Dataset
JSON

ANY dataset

ANY dataset combines natural and synthetic data, used to probe polarity via negative polarity items (NPIs) in two pre-trained Transformer-based models (BERT and GPT-2).

Dataset
JSON

CamemBERT

Pretrained language model for French, trained on OSCAR corpus

Dataset
JSON

English-Hindi Parallel Corpus

The dataset used for training and testing the machine translation systems.

Dataset
JSON

English-Hindi Outputs Quality Estimation using Naive Bayes Classifier

The dataset used for training and testing the Naive Bayes classifier for quality estimation of English-Hindi outputs.

Dataset
JSON

Gemma: Open models based on gemini research and technology

This dataset contains a large corpus of text for training and evaluating large language models.

Dataset
JSON

Llama 2: Open foundation and fine-tuned chat models

This dataset contains a large corpus of text for training and evaluating large language models.

Dataset
JSON

harmless/harmful anchor datasets

This dataset contains 100 harmless and 100 harmful anchor prompts for evaluating the performance of large language models.

Dataset
JSON

Decimal Addition Dataset

The dataset used in this paper is a collection of decimal addition tasks, where the input lengths range from 1 to 40 digits. The dataset is used to evaluate the ability of...

Dataset
JSON

UzSyllable dataset

A comprehensive dataset for evaluating and training machine learning algorithms for syllable prediction accuracy and performance.

Dataset
JSON

420 datasets found