Dataset - LDM

GMEG-wiki and GMEG-yahoo

The GMEG-wiki and GMEG-yahoo datasets are used to evaluate the proposed approach.
- Dataset
- JSON
BEA-2019

The Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples.
- Dataset
- JSON
CoNLL-2014

The task of grammatical error correction (GEC) is to map an ungrammatical sentence xbad into a grammatical version of it, xgood.
- Dataset
- JSON
LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs, but manually annotating such pairs can be expensive.
- Dataset
- JSON
GLUE benchmark

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used three downstream tasks from the GLUE benchmark: Stanford Sentiment Treebank...
- Dataset
- JSON
PANGeA: Procedural Artificial Narrative using Generative AI for Turn-Based, R...

PANGeA: Procedural Artificial Narrative using Generative AI for Turn-Based, Role-Playing Video Games
- Dataset
- JSON
Contra State Dataset

The dataset used in the paper is a collection of instruction sets and states for the Contra game, used to train a language model and a reinforcement learning policy.
- Dataset
- JSON
Contra Instruction Dataset

The dataset used in the paper is a collection of instruction sets and states for the Contra game, used to train a language model and a reinforcement learning policy.
- Dataset
- JSON
Contra Dataset

The dataset used in the paper is a collection of instruction sets and states for the Contra game, used to train a language model and a reinforcement learning policy.
- Dataset
- JSON
SlimPajama

The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification.
- Dataset
- JSON
TESS: Text-to-Text Self-Conditioned Simplex Diffusion

Diffusion models have emerged as a power-ful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models...
- Dataset
- JSON
How do large language models capture the ever-changing world knowledge?

This paper presents a review of recent advances in large language models' ability to capture ever-changing world knowledge.
- Dataset
- JSON
BERT

The dataset used in this paper is a pre-trained BERT model trained on English Wikipedia and Books datasets.
- Dataset
- JSON
Masked Acoustic Unit for Mispronunciation Detection and Correction

The proposed method uses the acoustic unit (AU) as the intermediary feature for both mispronunciation detection and correction.
- Dataset
- JSON
Language models are few-shot learners

A language model that demonstrates capabilities in processing and generating human-like text.
- Dataset
- JSON
Self-Supervised Alignment with Mutual Information

The dataset is used for training a language model to follow behavioral principles without the use of preference labels, demonstrations, or human oversight.
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
DEPTH

The dataset used for the DEPTH model, a hierarchical language model that learns representations for both sub-word and sentence-level tokens.
- Dataset
- JSON
Dynahate

Dynahate: A dataset for hate speech detection.
- Dataset
- JSON
NLPositionality

NLPositionality is a framework for characterizing design biases and quantifying the positionality of NLP datasets and models.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found