-
Opt: Open pre-trained transformer language models
The OPT dataset is a large language model dataset used in the paper. -
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Fl...
The dataset used in the paper is not explicitly described, but it is mentioned that it is a large language model dataset. -
Neural Language Correction with Character-Based Attention
Neural language correction with character-based attention. -
Stanford Neural Machine Translation Systems for Spoken Language Domain
Stanford neural machine translation systems for spoken language domain. -
Corpora Generation for Grammatical Error Correction
Two approaches for generating large parallel datasets for Grammatical Error Correction (GEC) using publicly available Wikipedia data. -
Wall Street Journal (WSJ) dataset
The Wall Street Journal (WSJ) dataset is a standard benchmark dataset for coherence modeling. -
A Cross-Domain Transferable Neural Coherence Model
Coherence is an important aspect of text quality and is crucial for ensuring its readability. The proposed coherence model is simple in structure, yet it significantly... -
Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some ...
This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. -
GPT-3 dataset
The dataset used in the paper is the GPT-3 dataset, which is a large language model output. -
CIDER: Context Informed Dictionary and sEmantic Reasoner
A new approach to sentiment analysis that takes context into account by combining the dictionaries generated by SocialSent with the scoring rules and base sentiment dictionary... -
Unsupervised word segmentation and lexicon discovery using acoustic word embe...
A dataset for the Zero Resource Speech Challenge 2015. -
Fixed-dimensional acoustic embeddings of variable-length segments in low-reso...
A dataset for the Zero Resource Speech Challenge 2015. -
The Zero Resource Speech Challenge 2015
A dataset for the Zero Resource Speech Challenge 2015. -
A segmental Bayesian framework for fully-unsupervised large-vocabulary speech...
A segmental Bayesian model for full-coverage segmentation and clustering of conversational speech audio. -
HOLISTICBIAS
A large dataset for measuring bias in language models, including nearly 600 descriptor terms across 13 different demographic axes. -
VAULT: VAriable Unified Long Text representation for Machine Reading Comprehen...
VAULT: a light-weight and parallel-efficient paragraph representation for Machine Reading Comprehension (MRC) based on contextualized representation from long document input -
SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection
The SWSR dataset consists of two files: SexWeibo.csv and SexComment.csv, containing weibos (posts) and comments (replies) respectively. -
MNLI, QQP, and SST-2
The dataset used in this paper consists of three tasks: Multi-Genre Natural Language Inference (MNLI), Quora Question Pairs (QQP), and Stanford Sentiment Treebank (SST-2). -
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance...
Larger language models have higher accu- racy on average, but are they better on ev- ery single instance (datapoint)? -
Towards Efficient Dialogue Pre-training with Transferable and Interpretable L...
This paper proposes a novel dialogue model with a latent structure that is easily transferable from the general domain to downstream tasks in a lightweight and transparent way.