Information Retrieval - Groups

MQ2009

The MQ2009 query set is a large-scale query dataset, containing 200,000 queries.

Dataset
JSON

Wikimarks

The Wikimarks dataset, which consists of 30 million deduplicated paragraphs from all Wikipedia articles.

Dataset
JSON

TREC-CAR Benchmark Y1

The dataset used for the Retrieve-Cluster-Summarize system, consisting of 117 article-level queries and 126 test queries.

Dataset
JSON

WINGNUS: Keyphrase extraction utilizing document logical structure

The dataset used in this paper for keyphrase extraction utilizing document logical structure.

Dataset
JSON

SemEval-2010 Task 5 dataset

The dataset used in this paper for keyphrase extraction from academic articles.

Dataset
JSON

BBC News dataset

The BBC News dataset was used for sentiment analysis of news articles.

Dataset
JSON

NIPS full paper dataset

The NIPS full paper dataset is a collection of text documents.

Dataset
JSON

ClueWeb09B

The ClueWeb09B collection is a large-scale web search dataset, containing 31 million web pages, 31 million queries, and 1.5 billion documents.

Dataset
JSON

Doc2Token

The dataset used in this paper for novel token prediction in e-commerce search.

Dataset
JSON

WikipassageQA, InsuranceQA v2, and MS-MARCO

The dataset contains three passage-ranking datasets: WikipassageQA, InsuranceQA v2, and MS-MARCO.

Dataset
JSON

Deeper text understanding for IR with contextual neural language modeling

This paper proposes a method for learning-to-rank with contextual neural language modeling.

Dataset
JSON

Learning to rank: from pairwise approach to listwise approach

This paper proposes a method for learning to rank, which is a key task in information retrieval.

Dataset
JSON

TREC Dynamic Domain 2015 ad-hoc retrieval task

The dataset used in the paper is the TREC Dynamic Domain 2015 ad-hoc retrieval task, which includes search result diversification. The dataset consists of 23 official runs and...

Dataset
JSON

TREC Web Track 2014 ad-hoc retrieval task

The dataset used in the paper is the TREC Web Track 2014 ad-hoc retrieval task, which includes search result diversification. The dataset consists of 50 test topics and 10,000...

Dataset
JSON

Web2Text: Deep Structured Boilerplate Removal

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is...

Dataset
JSON

Robust04

The dataset used in the paper is the Robust04 dataset, a news corpus containing 0.5M documents and 249 queries.

Dataset
JSON

BEIR

The BEIR dataset is a large-scale zero-shot evaluation dataset for information retrieval models, consisting of 13,000 documents and 1,000 questions.

Dataset
JSON

SERP dataset

The dataset used in the paper is a collection of search engine result pages (SERPs) with their corresponding relevance scores.

Dataset
JSON

Wikipedia Corpus

The dataset used in the paper is a subset of the Wikipedia corpus, consisting of 7500 English Wikipedia articles belonging to one of the following categories: People, Cities,...

Dataset
JSON

MSMARCO

The dataset used for training and evaluating IR systems, containing a large collection of documents and queries.

Dataset
JSON

27 datasets found