Natural Language Processing - Groups

Leveraging QA Datasets to Improve Generative Data Augmentation

The paper proposes a method to leverage QA datasets for training generative language models to be context generators for a given question and answer.

Dataset
JSON

Diverse and Specific Clarification Question Generation with Keywords

Product descriptions on e-commerce websites often suffer from missing important aspects. Clarification question generation (CQ-Gen) can be a promising approach to help alleviate...

Dataset
JSON

LongPile

LongPile is a diverse dataset derived from the Pile corpus.

Dataset
JSON

PG-19

PG-19 is a well-established benchmark for long-form language modeling.

Dataset
JSON

Wikitext-103 and LAMBADA datasets

The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a GPT2 transformer language model on the Wikitext-103 and LAMBADA datasets.

Dataset
JSON

RedPajama Dataset

The RedPajama dataset is used for single-turn dialogue task.

Dataset
JSON

Dense Reward for Free in RLHF

The dataset used in the paper is not explicitly described, but it is mentioned that it is a preference dataset for language models.

Dataset
JSON

ANALYSING DISCRETE SELF SUPERVISED SPEECH REPRESENTATION FOR SPOKEN LANGUAGE ...

This work profoundly analyzes discrete self-supervised speech representations (units) through the eyes of Generative Spoken Language Modeling (GSLM).

Dataset
JSON

A large annotated corpus for learning natural language inference

Dataset
JSON

Universal Sentence Encoder

Universal sentence encoder

Dataset
JSON

CrisisT6

The dataset used in the paper for crisis domain adaptation using sequence-to-sequence transformers.

Dataset
JSON

nepal_queensland

The dataset used in the paper for crisis domain adaptation using sequence-to-sequence transformers.

Dataset
JSON

Femicide perception dataset

Femicide perception dataset: a large-scale perception survey of GBV descriptions automatically extracted from a corpus of Italian newspapers.

Dataset
JSON

Wang271K Dataset

The Wang271K dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors.

Dataset
JSON

SIGHAN Datasets

The SIGHAN datasets are used for Chinese Spelling Check (CSC) task, with a limited number of Chinese characters and their corresponding errors.

Dataset
JSON

Chinese Spelling Check Dataset

The dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors.

Dataset
JSON

Text Summarization

The dataset used for the text summarization task, where a summarizer produces an utterance made up of one or multiple sentences to succinctly report the main content of a text.

Dataset
JSON

Unsupervised alignment of embeddings with Wasserstein procrustes

This study introduces a new method for unsupervised alignment of embeddings with Wasserstein procrustes.

Dataset
JSON

Discovering Universal Geometry in Embeddings with ICA

This study utilizes Independent Component Analysis (ICA) to unveil a consistent semantic structure within embeddings of words or images.

Dataset
JSON

COVID-19 Twitter Data

The COVID-19 Twitter Data dataset contains tweets about the COVID-19 pandemic.

Dataset
JSON

420 datasets found