-
Leveraging QA Datasets to Improve Generative Data Augmentation
The paper proposes a method to leverage QA datasets for training generative language models to be context generators for a given question and answer. -
Diverse and Specific Clarification Question Generation with Keywords
Product descriptions on e-commerce websites often suffer from missing important aspects. Clarification question generation (CQ-Gen) can be a promising approach to help alleviate... -
Wikitext-103 and LAMBADA datasets
The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a GPT2 transformer language model on the Wikitext-103 and LAMBADA datasets. -
RedPajama Dataset
The RedPajama dataset is used for single-turn dialogue task. -
Dense Reward for Free in RLHF
The dataset used in the paper is not explicitly described, but it is mentioned that it is a preference dataset for language models. -
ANALYSING DISCRETE SELF SUPERVISED SPEECH REPRESENTATION FOR SPOKEN LANGUAGE ...
This work profoundly analyzes discrete self-supervised speech representations (units) through the eyes of Generative Spoken Language Modeling (GSLM). -
A large annotated corpus for learning natural language inference
A large annotated corpus for learning natural language inference -
Universal Sentence Encoder
Universal sentence encoder -
nepal_queensland
The dataset used in the paper for crisis domain adaptation using sequence-to-sequence transformers. -
Femicide perception dataset
Femicide perception dataset: a large-scale perception survey of GBV descriptions automatically extracted from a corpus of Italian newspapers. -
Wang271K Dataset
The Wang271K dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors. -
SIGHAN Datasets
The SIGHAN datasets are used for Chinese Spelling Check (CSC) task, with a limited number of Chinese characters and their corresponding errors. -
Chinese Spelling Check Dataset
The dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors. -
Text Summarization
The dataset used for the text summarization task, where a summarizer produces an utterance made up of one or multiple sentences to succinctly report the main content of a text. -
Unsupervised alignment of embeddings with Wasserstein procrustes
This study introduces a new method for unsupervised alignment of embeddings with Wasserstein procrustes. -
Discovering Universal Geometry in Embeddings with ICA
This study utilizes Independent Component Analysis (ICA) to unveil a consistent semantic structure within embeddings of words or images. -
COVID-19 Twitter Data
The COVID-19 Twitter Data dataset contains tweets about the COVID-19 pandemic.