Natural Language Processing - Groups

CFQ

CFQ is a semantic parsing dataset that consists of natural language questions that are mapped to SPARQL queries.

Dataset
JSON

ANY dataset

ANY dataset combines natural and synthetic data, used to probe polarity via negative polarity items (NPIs) in two pre-trained Transformer-based models (BERT and GPT-2).

Dataset
JSON

CamemBERT

Pretrained language model for French, trained on OSCAR corpus

Dataset
JSON

English-Hindi Parallel Corpus

The dataset used for training and testing the machine translation systems.

Dataset
JSON

English-Hindi Outputs Quality Estimation using Naive Bayes Classifier

The dataset used for training and testing the Naive Bayes classifier for quality estimation of English-Hindi outputs.

Dataset
JSON

Gemma: Open models based on gemini research and technology

This dataset contains a large corpus of text for training and evaluating large language models.

Dataset
JSON

Llama 2: Open foundation and fine-tuned chat models

This dataset contains a large corpus of text for training and evaluating large language models.

Dataset
JSON

harmless/harmful anchor datasets

This dataset contains 100 harmless and 100 harmful anchor prompts for evaluating the performance of large language models.

Dataset
JSON

Decimal Addition Dataset

The dataset used in this paper is a collection of decimal addition tasks, where the input lengths range from 1 to 40 digits. The dataset is used to evaluate the ability of...

Dataset
JSON

UzSyllable dataset

A comprehensive dataset for evaluating and training machine learning algorithms for syllable prediction accuracy and performance.

Dataset
JSON

Design and Implementation of a Tool for Extracting Uzbek Syllables

A comprehensive approach to syllabification for the Uzbek language, including rule-based techniques and machine learning algorithms.

Dataset
JSON

Wikitext-103 and MusDB datasets

The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a 16 layers transformer (Vaswani et al., 2017) based language model on...

Dataset
JSON

Sentence Compression via DC Programming Approach

The dataset used in this paper for sentence compression task.

Dataset
JSON

Buffer of Thoughts

Buffer of Thoughts is a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs).

Dataset
JSON

Winograd Schema - Knowledge Extraction Using Narrative Chains

The Winograd Schema Challenge (WSC) is a test of machine intelligence, designed to be an improvement on the Turing test. A Winograd Schema consists of a sentence and a...