-
FAIRBELIEF
FAIRBELIEF is a language-agnostic analytical approach to capture and assess beliefs embedded in LMs. -
MultiLexNorm dataset
The MultiLexNorm dataset is used to evaluate the robustness of MT models to lexical normalization. -
MTNT dataset
The MTNT dataset is used to evaluate the robustness of MT models to noisy text. -
FLORES-200 devtest dataset
The FLORES-200 devtest dataset is used to evaluate the robustness of MT models to synthetic character perturbations. -
Biomedical Dataset
The dataset used in this paper is a collection of biomedical data, including text, images, and other types of data. -
Pre-trained Language Models in Biomedical Domain: A Systematic Survey
This paper summarizes the recent progress of pre-trained language models in the biomedical domain and their applications in downstream biomedical tasks. -
UKWaC and Wackypedia corpora
The dataset used in this paper is a large text corpus compiled from UKWaC and Wackypedia corpora. -
Covid-19 MLIA @ Eval initiative
The Covid-19 MLIA @ Eval initiative consists of three Natural Language Processing tasks: information extraction, multilingual semantic search and machine translation. The goal... -
Hausa Language Datasets
The first large scale collection of diverse Hausa language datasets -
Argument Validity and Novelty Prediction Shared Task
The dataset is used for the evaluation of argument quality classification tasks, including concreteness, validity, and novelty. -
CIMT Argument Concreteness Dataset
The dataset is used for the evaluation of argument quality classification tasks, including concreteness, validity, and novelty. -
German Reviews Dataset
A dataset for sentiment analysis on German reviews. -
English Reviews Dataset
A dataset for sentiment analysis on English reviews. -
Spanish Reviews Dataset
A dataset for sentiment analysis on Spanish reviews. -
Universal and Unsupervised Sentiment Analysis
A novel model for universal and unsupervised sentiment analysis driven by a set of syntactic rules for semantic composition. -
ROCStories
The ROCStories corpus is a collection of crowdsourced five-sentence everyday stories rich in causal and temporal relations. -
Crowd-sourced Language Annotations Dataset
The dataset consists of 5,600 episode-instruction pairs, where each episode is labeled with two hindsight instructions each. -
Data-driven Instruction Augmentation for Language-conditioned Control
Data-driven Instruction Augmentation for Language-conditioned Control (DIAL) is a method that uses pre-trained vision-language models (VLMs) to label offline datasets for... -
E-commerce Dialogue Corpus
The dataset is used for training and testing response selection models for multi-turn conversations.