Information Retrieval - Groups

MathMLBen

The MathMLBen dataset is used to evaluate the performance of formula embedding techniques for mathematical information retrieval.

Dataset
JSON

arXMLiv 2018

The arXMLiv 2018 dataset is an HTML collection of the arXiv.org preprint archive, used as a training corpus for word embedding techniques.

Dataset
JSON

COVID-19 Vaccination Search Insights

COVID-19 Vaccination Search Insights dataset is a collection of anonymized search queries and their corresponding labels, which indicate whether the query is related to COVID-19...

Dataset
JSON

TREC Deep Learning 2021 Collection

The TREC Deep Learning 2021 collection is a test collection for information retrieval evaluation, adopting a shallow pooling approach.

Dataset
JSON

TREC-8 Ad Hoc Collection

The TREC-8 ad hoc collection is a test collection for information retrieval evaluation, known for its high-quality pool.

Dataset
JSON

Concept Embedding for Information Retrieval

Conceptual indexing includes the process of annotating raw text by concepts of a particular knowledge source. It is used to represent the content of documents and queries by...

Dataset
JSON

CORD-19

The CORD-19 dataset contains academic journal articles relating to a variety of coronaviruses and related viral infections, not only COVID-19, sourced from PubMed Central (PMC),...

Dataset
JSON

COVID-19 Information Retrieval and Extraction

The dataset used for COVID-19 information retrieval and extraction

Dataset
JSON

BEIR

The BEIR dataset is a large-scale zero-shot evaluation dataset for information retrieval models, consisting of 13,000 documents and 1,000 questions.

Dataset
JSON

TREC 2019 and TREC 2020 Deep Learning Track datasets

Dataset
JSON

Wikipedia dataset

The dataset used in the paper is the Wikipedia dataset, which contains over six million English Wikipedia articles with a full-text field associated with 50 training queries...

Dataset
JSON

Baidu Search Dataset

The Baidu search dataset is a large-scale search dataset for unbiased learning to rank.

Dataset
JSON

ULTRE-2 Task

The ULTRE-2 task encourages participants to explore ULTR approaches to alleviate various types of biases in real user clicks during training, and achieve better ranking...

Dataset
JSON

TMC

The TMC dataset is a collection of air traffic reports.

Dataset
JSON

Reuters21578

The problem of similarity search is to find the most similar items in a large collection to a query item of interest. Fast similarity search is at the core of many information...

Dataset
JSON

TripClick

The TripClick dataset is a large-scale benchmark for information retrieval.

Dataset
JSON

CLEF 2003

The dataset used for the experiments in the paper.

Dataset
JSON

Tetun Test Collection

The Tetun test collection is a document-level audited dataset for relevance judgments.

Dataset
JSON

Labadain-30k+

The Labadain-30k+ dataset is a monolingual Tetun document-level audited dataset.

Dataset
JSON

Reuters-21578

Text classiﬁcation problem has long been an interesting research ﬁeld, the aim of text classiﬁcation is to develop algorithm to ﬁnd the categories of given documents.

Dataset
JSON

28 datasets found