Dataset - LDM

FAIRBELIEF

FAIRBELIEF is a language-agnostic analytical approach to capture and assess beliefs embedded in LMs.
- Dataset
- JSON
MultiLexNorm dataset

The MultiLexNorm dataset is used to evaluate the robustness of MT models to lexical normalization.
- Dataset
- JSON
MTNT dataset

The MTNT dataset is used to evaluate the robustness of MT models to noisy text.
- Dataset
- JSON
FLORES-200 devtest dataset

The FLORES-200 devtest dataset is used to evaluate the robustness of MT models to synthetic character perturbations.
- Dataset
- JSON
Biomedical Dataset

The dataset used in this paper is a collection of biomedical data, including text, images, and other types of data.
- Dataset
- JSON
Pre-trained Language Models in Biomedical Domain: A Systematic Survey

This paper summarizes the recent progress of pre-trained language models in the biomedical domain and their applications in downstream biomedical tasks.
- Dataset
- JSON
UKWaC and Wackypedia corpora

The dataset used in this paper is a large text corpus compiled from UKWaC and Wackypedia corpora.
- Dataset
- JSON
Covid-19 MLIA @ Eval initiative

The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal...
- Dataset
- JSON
Hausa Language Datasets

The first large scale collection of diverse Hausa language datasets
- Dataset
- JSON
Argument Validity and Novelty Prediction Shared Task

The dataset is used for the evaluation of argument quality classification tasks, including concreteness, validity, and novelty.
- Dataset
- JSON
CIMT Argument Concreteness Dataset

The dataset is used for the evaluation of argument quality classification tasks, including concreteness, validity, and novelty.
- Dataset
- JSON
German Reviews Dataset

A dataset for sentiment analysis on German reviews.
- Dataset
- JSON
English Reviews Dataset

A dataset for sentiment analysis on English reviews.
- Dataset
- JSON
Spanish Reviews Dataset

A dataset for sentiment analysis on Spanish reviews.
- Dataset
- JSON
Universal and Unsupervised Sentiment Analysis

A novel model for universal and unsupervised sentiment analysis driven by a set of syntactic rules for semantic composition.
- Dataset
- JSON
ROCStories

The ROCStories corpus is a collection of crowdsourced five-sentence everyday stories rich in causal and temporal relations.
- Dataset
- JSON
Crowd-sourced Language Annotations Dataset

The dataset consists of 5,600 episode-instruction pairs, where each episode is labeled with two hindsight instructions each.
- Dataset
- JSON
Data-driven Instruction Augmentation for Language-conditioned Control

Data-driven Instruction Augmentation for Language-conditioned Control (DIAL) is a method that uses pre-trained vision-language models (VLMs) to label offline datasets for...
- Dataset
- JSON
MuTual

A dataset for research in multi-turn dialogue systems
- Dataset
- JSON
E-commerce Dialogue Corpus

The dataset is used for training and testing response selection models for multi-turn conversations.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found