Dataset - LDM

ChatGPT model data

ChatGPT model data, used to generate text
- Dataset
- JSON
Adding A Filter Based on The Discriminator to Improve Unconditional Text Gene...

The dataset is used for unconditional text generation, and the authors propose a novel mechanism to improve the generator by adding a filter which has the same input as the...
- Dataset
- JSON
Textual Sports Commentary Dataset

The textual dataset is a collection of live sports commentaries scraped from various sources, including live score websites and YouTube.
- Dataset
- JSON
Sports Commentary Dataset

The dataset is a collection of live sports commentaries, including audio and textual data, used to train and evaluate machine learning models for event recognition and...
- Dataset
- JSON
CLEVR-Robot Environment

A benchmark for evaluating task compositionality and long-horizon tasks through object manipulation, with language serving as the mechanism for goal specification.
- Dataset
- JSON
Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
- Dataset
- JSON
PersonaChat dataset

The PersonaChat dataset is a large persona-conditioned chit-chat style dialogue dataset.
- Dataset
- JSON
TIMIT

The TIMIT corpus is a widely used benchmark for speech recognition tasks. It contains 3,696 training utterances from 462 speakers, excluding the SA sentences. The core test set...
- Dataset
- JSON
AdvBench

The dataset used in the paper to test the Gradient Cuff method for detecting jailbreak attacks on large language models.
- Dataset
- JSON
OPT-66B and Llama2-70B

The dataset used in the paper is OPT-66B, a large language model, and Llama2-70B, another large language model.
- Dataset
- JSON
Mixtral of Experts

The dataset used in the paper for instruction following task
- Dataset
- JSON
speechocean762

speechocean762: An open-source non-native English speech corpus for pronunciation assessment.
- Dataset
- JSON
Automatic Pronunciation Assessment

A hierarchical context-aware modeling approach for multi-aspect and multi-granular pronunciation assessment
- Dataset
- JSON
Experimental Results

The authors evaluate the performance of their proposed conformal prediction methods for multistep feedback covariate shift (MFCS) on synthetic black-box optimization and active...
- Dataset
- JSON
The Online Pivot: Lessons Learned from Teaching a Text and Data Mining Course...

A text and data mining course on Natural Language Processing, adapted for online teaching during the COVID-19 pandemic.
- Dataset
- JSON
WikiSQL

Semantic parsing maps a user-issued natural language (NL) utterance to a machine-executable meaning representation (MR), such as λ−calculus (Zettlemoyer and Collins, 2005), SQL...
- Dataset
- JSON
Hearst

The dataset used in this paper is the Hearst dataset, which is a collection of text documents.
- Dataset
- JSON
WordNet Noun

The dataset used in this paper is the WordNet Noun dataset, which is a collection of nouns with their semantic relationships.
- Dataset
- JSON
Universal Conceptual Cognitive Annotation (UCCA)

The Universal Conceptual Cognitive Annotation (UCCA) dataset is a graph-based semantic annotation scheme based on typological linguistic principles.
- Dataset
- JSON
Russian Noun Dataset

The dataset used for clustering contains the 2000 most frequent nouns in the Russian Web corpus.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

530 datasets found