Natural Language Understanding - Groups

Xlnet: Generalized Autoregressive Pretraining for Language Understanding

The Xlnet is a generalized autoregressive pretraining model for language understanding.

Dataset
JSON

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

PAWS-X: A cross-lingual adversarial dataset for paraphrase identification.

Dataset
JSON

Roberta: A Robustly Optimized BERT Pre-training Approach

Robert is a robustly optimized BERT pre-training approach.

Dataset
JSON

MARGE: A Pre-trained Sequence-to-Sequence Model for Multi-lingual Paraphrasing

MARGE is a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective.

Dataset
JSON

TP-UK

The dataset used in the paper for testing the Privacy-Preserving Prompt Tuning framework.

Dataset
JSON

BELEBELE Benchmark

A multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants.

Dataset
JSON

Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning for natural language understanding tasks

Dataset
JSON

SuperGLUE

The dataset used in the paper is the SuperGLUE benchmark, which includes 17 tasks: STS-B, MRPC, MNLI, QNL, QNLI, CoLA, SST-2, MRPC, GLUE, NLI, NQ, ReCoRD, ReCoRD-Sub,...

Dataset
JSON

ATIS2 and ATIS3

The ATIS2 and ATIS3 datasets are used to create low-latency natural language understanding components.

Dataset
JSON

General Language Understanding Evaluation (GLUE) dataset

The General Language Understanding Evaluation (GLUE) dataset is a dataset used in the paper to evaluate the performance of natural language understanding models.

Dataset
JSON

FewCLUE dataset

The FewCLUE dataset is a Chinese few-shot learning evaluation benchmark.

Dataset
JSON

WALNUT: A Benchmark on Semi-weakly Supervised Learning for Natural Language U...

WALNUT is a benchmark for semi-weakly supervised learning for natural language understanding. It consists of 8 NLU tasks with different types, including document-level and...

Dataset
JSON

SQuAD: 100,000+ Questions for Machine Comprehension of Text

The SQuAD dataset is a benchmark for natural language understanding tasks, including question answering and text classification.

Dataset
JSON

MASSIVE

The MASSIVE dataset is a comprehensive collection of approximately one million annotated utterances for various natural language understanding tasks such as slot-filling, intent...

Dataset
JSON

IPA dataset

The IPA dataset contains a set of Chinese utterances that were collected and annotated in the development process of a commercialized Intelligent Personal Assistant (IPA) named...

Dataset
JSON

OSQ dataset

The OSQ dataset covers 150 IND intents and also provides a set of manually labeled Out-of-Scope Queries (OSQ) that are not supported by the current system.

Dataset
JSON

TreeMix: Compositional Constituency-based Data Augmentation for Natural Langu...

TreeMix is a compositional data augmentation approach for natural language understanding. It leverages constituency parsing tree to decompose sentences into sub-structures and...

Dataset
JSON

CoLA

The CoLA dataset has 8551 train and 527 development in domain samples.

Dataset
JSON

ROCStories (+GPT-J)

A corpus and cloze evaluation for deeper understanding of commonsense stories.

Dataset
JSON

ROCStories

The ROCStories corpus is a collection of crowdsourced five-sentence everyday stories rich in causal and temporal relations.

Dataset
JSON

28 datasets found