-
French Street Name Signs Dataset
The French Street Name Signs (FSNS) dataset contains images of French street name signs extracted from Google Streetview, featuring low resolution text lines in natural scenes... -
Google Billion Word dataset
The Google Billion Word dataset is one of the largest language modeling datasets with almost one billion tokens and a vocabulary of over 800K words, based on an English corpus... -
CNN/Daily Mail corpus
The CNN/Daily Mail corpus contains pairs of online news articles and their summaries, consisting of approximately 287,000 training pairs, 13,368 validation pairs, and 11,490... -
TL;DR Reddit corpus
The TL;DR Reddit corpus consists of approximately 3 million content-summary pairs mined from Reddit, designed for the TL;DR challenge focusing on text summarization.