-
FLoRes Benchmark
The FLoRes dataset is a benchmark designed for low-resource machine translation. It includes English-to-Nepali translations with approximately 564,000 parallel sentences, making... -
Audio Visual Scene-aware Dialog dataset
The Audio Visual Scene-aware Dialog (AVSD) dataset requires systems to generate answers about events observed in a video through previous dialogs. -
VisDial dataset
The VisDial dataset consists of dialogs composed of question-answer pairs about an image, aiming to enhance visual dialog systems. -
JCR, Europarl, news-commentary, and wikititles corpora
The training data is made up of the JCR, Europarl, news-commentary and wikititles corpora, utilized for training their machine translation systems between Spanish and Portuguese. -
French-20K
The French-20K dataset is used for cross-lingual evaluation of the semantic parsing approach, where training data from English and German is leveraged due to limited French data. -
German-20K
The German-20K dataset is utilized for training and evaluating the model's performance on semantic parsing tasks in the German language. -
English-Wiki
The English-Wiki dataset is used for training and evaluation of the UCCA semantic parsing model, consisting of syntactic and semantic structures in the English language. -
French-English Translation Dataset
For FR-EN tasks, it contains around 0.2M sentence pairs. -
German-English Translation Dataset
The training data for the DE-EN task consists of 4.6M sentence pairs. -
NIST Chinese-English Translation Dataset
The training data for ZH-EN task consists of 1.8M sentence pairs. The development set is chosen as NIST02 and test sets are NIST05, 06, 08. -
CASP12 ProteinNet dataset
The CASP12 ProteinNet dataset consists of around 50,000 protein structures used to evaluate models for protein structure prediction, specifically in the context of free modeling... -
Propaganda Techniques Corpus
The Propaganda Techniques Corpus (PTC) is a dataset consisting of news articles with sentences annotated for the presence of specific propaganda techniques, aimed at binary... -
CoNLL 2002/2003 NER
The CoNLL 2002/2003 NER corpus is a standard dataset for named entity recognition, providing annotated data for various languages.