-
Xlnet: Generalized Autoregressive Pretraining for Language Understanding
The Xlnet is a generalized autoregressive pretraining model for language understanding. -
PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. -
Roberta: A Robustly Optimized BERT Pre-training Approach
Robert is a robustly optimized BERT pre-training approach. -
MARGE: A Pre-trained Sequence-to-Sequence Model for Multi-lingual Paraphrasing
MARGE is a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. -
BELEBELE Benchmark
A multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. -
Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning
Parameter-efficient fine-tuning for natural language understanding tasks -
ATIS2 and ATIS3
The ATIS2 and ATIS3 datasets are used to create low-latency natural language understanding components. -
General Language Understanding Evaluation (GLUE) dataset
The General Language Understanding Evaluation (GLUE) dataset is a dataset used in the paper to evaluate the performance of natural language understanding models. -
FewCLUE dataset
The FewCLUE dataset is a Chinese few-shot learning evaluation benchmark. -
WALNUT: A Benchmark on Semi-weakly Supervised Learning for Natural Language U...
WALNUT is a benchmark for semi-weakly supervised learning for natural language understanding. It consists of 8 NLU tasks with different types, including document-level and... -
SQuAD: 100,000+ Questions for Machine Comprehension of Text
The SQuAD dataset is a benchmark for natural language understanding tasks, including question answering and text classification. -
IPA dataset
The IPA dataset contains a set of Chinese utterances that were collected and annotated in the development process of a commercialized Intelligent Personal Assistant (IPA) named... -
OSQ dataset
The OSQ dataset covers 150 IND intents and also provides a set of manually labeled Out-of-Scope Queries (OSQ) that are not supported by the current system. -
TreeMix: Compositional Constituency-based Data Augmentation for Natural Langu...
TreeMix is a compositional data augmentation approach for natural language understanding. It leverages constituency parsing tree to decompose sentences into sub-structures and... -
ROCStories (+GPT-J)
A corpus and cloze evaluation for deeper understanding of commonsense stories. -
ROCStories
The ROCStories corpus is a collection of crowdsourced five-sentence everyday stories rich in causal and temporal relations.