Dataset - LDM

LETOR 4.0

The LETOR 4.0 dataset is a collection of information retrieval tasks.
- Dataset
- JSON
IRGAN

IRGAN is an information retrieval (IR) modeling approach that uses a theoretical minimax game between a generative and a discriminative model to iteratively optimize both of...
- Dataset
- JSON
YouTube Clickbait Detection Dataset

The dataset is a collection of online videos from YouTube, with comments and metadata. It is used to evaluate the performance of the Online Video Clickbait Protector (OVCP) scheme.
- Dataset
- JSON
NevIR

Negation in Neural Information Retrieval
- Dataset
- JSON
ClueWeb09B

The ClueWeb09B collection is a large-scale web search dataset, containing 31 million web pages, 31 million queries, and 1.5 billion documents.
- Dataset
- JSON
AOL Dataset

The AOL dataset contains a collection of queries and documents for search engine evaluation.
- Dataset
- JSON
TREC 2004 Robust Retrieval Track

The TREC 2004 Robust Retrieval Track dataset contains a collection of documents and queries for robust retrieval tasks.
- Dataset
- JSON
MathMLBen

The MathMLBen dataset is used to evaluate the performance of formula embedding techniques for mathematical information retrieval.
- Dataset
- JSON
arXMLiv 2018

The arXMLiv 2018 dataset is an HTML collection of the arXiv.org preprint archive, used as a training corpus for word embedding techniques.
- Dataset
- JSON
COVID-19 Vaccination Search Insights

COVID-19 Vaccination Search Insights dataset is a collection of anonymized search queries and their corresponding labels, which indicate whether the query is related to COVID-19...
- Dataset
- JSON
TREC Deep Learning 2021 Collection

The TREC Deep Learning 2021 collection is a test collection for information retrieval evaluation, adopting a shallow pooling approach.
- Dataset
- JSON
TREC-8 Ad Hoc Collection

The TREC-8 ad hoc collection is a test collection for information retrieval evaluation, known for its high-quality pool.
- Dataset
- JSON
Concept Embedding for Information Retrieval

Conceptual indexing includes the process of annotating raw text by concepts of a particular knowledge source. It is used to represent the content of documents and queries by...
- Dataset
- JSON
CORD-19

The CORD-19 dataset contains academic journal articles relating to a variety of coronaviruses and related viral infections, not only COVID-19, sourced from PubMed Central (PMC),...
- Dataset
- JSON
COVID-19 Information Retrieval and Extraction

The dataset used for COVID-19 information retrieval and extraction
- Dataset
- JSON
BEIR

The BEIR dataset is a large-scale zero-shot evaluation dataset for information retrieval models, consisting of 13,000 documents and 1,000 questions.
- Dataset
- JSON
TREC 2019 and TREC 2020 Deep Learning Track datasets

TREC 2019 and TREC 2020 Deep Learning Track datasets
- Dataset
- JSON
Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...
- Dataset
- JSON
Baidu Search Dataset

The Baidu search dataset is a large-scale search dataset for unbiased learning to rank.
- Dataset
- JSON
ULTRE-2 Task

The ULTRE-2 task encourages participants to explore ULTR approaches to alleviate various types of biases in real user clicks during training, and achieve better ranking...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

36 datasets found