-
Unsupervised word segmentation and lexicon discovery using acoustic word embe...
A dataset for the Zero Resource Speech Challenge 2015. -
Fixed-dimensional acoustic embeddings of variable-length segments in low-reso...
A dataset for the Zero Resource Speech Challenge 2015. -
The Zero Resource Speech Challenge 2015
A dataset for the Zero Resource Speech Challenge 2015. -
A segmental Bayesian framework for fully-unsupervised large-vocabulary speech...
A segmental Bayesian model for full-coverage segmentation and clustering of conversational speech audio. -
HOLISTICBIAS
A large dataset for measuring bias in language models, including nearly 600 descriptor terms across 13 different demographic axes. -
VAULT: VAriable Unified Long Text representation for Machine Reading Comprehen...
VAULT: a light-weight and parallel-efficient paragraph representation for Machine Reading Comprehension (MRC) based on contextualized representation from long document input -
SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection
The SWSR dataset consists of two files: SexWeibo.csv and SexComment.csv, containing weibos (posts) and comments (replies) respectively. -
MNLI, QQP, and SST-2
The dataset used in this paper consists of three tasks: Multi-Genre Natural Language Inference (MNLI), Quora Question Pairs (QQP), and Stanford Sentiment Treebank (SST-2). -
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance...
Larger language models have higher accu- racy on average, but are they better on ev- ery single instance (datapoint)? -
Towards Efficient Dialogue Pre-training with Transferable and Interpretable L...
This paper proposes a novel dialogue model with a latent structure that is easily transferable from the general domain to downstream tasks in a lightweight and transparent way. -
WikiText-103 and Enwik8 datasets
WikiText-103 and Enwik8 datasets are used for language modeling tasks -
Topological Word Delay Embeddings
The dataset used in the paper is a collection of text samples, including a valid argument, an invalid argument with circular logic, and a randomly generated text with a regular... -
Twitter OOV Word Dataset
The dataset is a collection of Twitter tweets, filtered to include only English language tweets. The dataset is used to study out-of-vocabulary (OOV) words in Twitter. -
SPAGHETTI: Open-Domain Question Answering
SPAGHETTI: A hybrid open-domain question-answering system that combines semantic parsing and information retrieval to handle structured and unstructured data. -
Google-RE (Templates) dataset
The Google-RE (Templates) dataset contains 6.11K template-based prompts from Wikipedia and 3 relations. -
Comparing Template-based and Template-free Language Model Probing
Template-based probing uses expert-made templates to create prompts, while template-free probing uses naturally-occurring text. -
WikipassageQA, InsuranceQA v2, and MS-MARCO
The dataset contains three passage-ranking datasets: WikipassageQA, InsuranceQA v2, and MS-MARCO. -
PASSAGE RANKING WITH WEAK SUPERVISION
In this paper, we propose a weak supervision framework for neural ranking tasks based on the data programming paradigm (Ratner et al., 2016), which enables us to leverage... -
R4R Dataset
The R4R dataset is a larger VLN dataset than R2R and with more complicated navigation paths. -
R2R Dataset
The R2R dataset is a dataset based on real photos taken in indoor environments. It attracts massive attention for its simple-form task, which at the same time requires complex...