Dataset - LDM

SemEval-2017 Semantic Textual Similarity Dataset

The SemEval-2017 dataset for Semantic Textual Similarity includes monolingual and cross-lingual sentence pairs for evaluating semantic similarity.
- Dataset
- JSON
SemEval-2016 Semantic Textual Similarity Dataset

The SemEval-2016 dataset for Semantic Textual Similarity was used to evaluate sentence pairs by training models with 90% of the data for training and 10% for validation.
- Dataset
- JSON
German Traffic Sign Recognition Benchmark (GTSRB)

The GTSRB dataset consists of images of German traffic signs, utilized in the paper for evaluating the classification error and the impact of alignment on recognition.
- Dataset
- JSON
MNIST Cluttered dataset

The MNIST Cluttered dataset consists of images containing handwritten digits situated within a cluttered background, intended for assessing object recognition capabilities.
- Dataset
- JSON
Visual Commonsense Reasoning (VCR)

VCR consists of 290k questions derived from 110k movie scenes, focusing on visual commonsense reasoning.
- Dataset
- JSON
US-CT Dataset

A synthetic dataset developed for ultrasound and CT image registration experiments, leveraging CT images to simulate ultrasound data for matching and localization.
- Dataset
- JSON
Human Face Database

A human face dataset used for evaluating image alignment techniques, containing altered and deformed images of human faces for testing alignment accuracy.
- Dataset
- JSON
MNIST Handwritten Digits Dataset

The MNIST handwritten digits dataset is a widely used benchmark dataset that consists of 60,000 training images and 10,000 testing images of handwritten digits, allowing...
- Dataset
- JSON
PF-PASCAL Benchmark

The PF-PASCAL benchmark is comprised of 1,351 image pairs over 20 object categories with keypoint annotations for evaluating semantic correspondence.
- Dataset
- JSON
PF-WILLOW Benchmark

The PF-WILLOW benchmark contains 10 object sub-classes, each with 10 keypoint annotations for performance evaluation in semantic correspondence tasks.
- Dataset
- JSON
TSS Benchmark

The TSS benchmark consists of 400 image pairs divided into three groups for evaluating semantic correspondence methods.
- Dataset
- JSON
English Wikipedia

The English Wikipedia is widely used as a text corpus for NLP tasks.
- Dataset
- JSON
BooksCorpus

The BooksCorpus dataset consists of 11,038 books and has been used for text-only training.
- Dataset
- JSON
Visual Question Answering

Visual Question Answering (VQA) requires a model to answer open-ended questions regarding images.
- Dataset
- JSON
Image-Grounded Conversations

Image-Grounded Conversations (IGC) consists of dialogues between human participants over images.
- Dataset
- JSON
Image Chat

Image Chat involves complete dialogues grounded on images, enabling a natural conversation by introducing styles.
- Dataset
- JSON
Personality Captions

Personality Captions dataset contains image-caption pairs with attributes describing 215 different speech styles.
- Dataset
- JSON
Instagram Images

A dataset of 3.5 billion Instagram images collected to explore the limits of weakly supervised pretraining.
- Dataset
- JSON
Frequent Russian Words Dataset

This dataset represents the top 10000 and 100000 most frequent words used in the training of word embedding models for the Russian language, derived from Wikipedia and other...
- Dataset
- JSON
Word Embedding Models for Russian Language

The dataset consists of publicly available word embedding models for the Russian language, including RusVectores, fastText, and Russian Distributional Thesaurus.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

20,491 datasets found