-
OpenWebText Corpus
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words. -
Web Synthetic Page Layout
The dataset used for paragraph recognition in document images by spatial graph convolutional networks (GCN) applied on OCR text boxes.