Dataset - LDM

Detecting Hallucinated Content in Conditional Neural Sequence Generation

Neural sequence models can generate highly fluent sentences, but recent studies have also shown that they are also prone to hallucinate additional content not supported by the...
- Dataset
- JSON
Machine Translation and Automated Analysis of the Sumerian Language Dataset

The Machine Translation and Automated Analysis of the Sumerian Language dataset, which contains Sumerian texts in cuneiform script.
- Dataset
- JSON
Sumerian Cuneiform Dataset

The dataset used for the study of Sumerian cuneiform, including part-of-speech tagging, named entity recognition, and machine translation.
- Dataset
- JSON
Intrinsic Dimensions of Language Fractal Structures

The dataset consists of embeddings of all n-grams of a natural language, constituting a representative sample of a language fractal structure.
- Dataset
- JSON
SNLI dataset

The dataset used in the paper is the SNLI dataset.
- Dataset
- JSON
Linear-time minimum Bayes risk decoding with reference aggregation

Linear-time minimum Bayes risk decoding with reference aggregation
- Dataset
- JSON
Finetuned language models are zero-shot learners

Finetuned language models are zero-shot learners
- Dataset
- JSON
Evaluating large language models trained on code

The paper presents the results of the OpenAI Codex evaluation on generating Python code.
- Dataset
- JSON
Improving Minimum Bayes Risk Decoding with Multi-Prompt

Multi-prompt decoding for conditional text generation
- Dataset
- JSON
ChatGPT and GPT-4

A dataset for evaluating the logical reasoning ability of chatgpt and gpt-4.
- Dataset
- JSON
A Joint Model for Deﬁnition Extraction with Syntactic Connection and Semantic...

Deﬁnition Extraction (DE) is one of the well-known topics in Information Extraction that aims to identify terms and their corresponding deﬁnitions in unstructured texts.
- Dataset
- JSON
Chimera dataset

The Chimera dataset is a ‘Chimera’ dataset of (Lazaridou et al., 2017). This dataset was speciﬁcally constructed to sim- ulate a nonce situation where a speaker encoun- ters a...
- Dataset
- JSON
TaxiXNLI (translated)

Multilingual extension of the TAXINLI dataset for analyzing the effects of reasoning types on cross-lingual transfer performance.
- Dataset
- JSON
TaxiXNLI (diagnostic)

Multilingual extension of the TAXINLI dataset for analyzing the effects of reasoning types on cross-lingual transfer performance.
- Dataset
- JSON
TaxiXNLI

Multilingual extension of the TAXINLI dataset for analyzing the effects of reasoning types on cross-lingual transfer performance.
- Dataset
- JSON
Corpus of Linguistic Acceptability (CoLA)

The Corpus of Linguistic Acceptability (CoLA) is a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature.
- Dataset
- JSON
Execution-based Evaluation for NL2Bash

A set of 50 prompts to evaluate execution-based evaluation for NL2Bash task
- Dataset
- JSON
Words2Contact

The Words2Contact dataset contains verbal instructions for humanoid robots to place support contacts.
- Dataset
- JSON
Word2Vec: A Novel Semi-Supervised Learning Approach for Word Embeddings

Word2Vec is a technique for learning vector representations of words in a text corpus.
- Dataset
- JSON
SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity

SimVerb-3500 is a large-scale evaluation set of verb similarity, providing human ratings for the similarity of 3,500 verb pairs.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found