-
Bing Images of Short Quotes
This dataset contains about 215 images of short quotes with different background styles. -
Incidental Scene Text Dataset
This dataset consists of 4468 cut-out word images corresponding to the axis-oriented bounding boxes of the words provided. -
Born-Digital Images Dataset
This dataset contains images made digitally employing a desktop scanner, a camera, and screen capture software. -
The Pile: An 800GB dataset of diverse text for language modeling
Pile is a dataset of text, consisting of 800GB of diverse text. -
German Common Crawl
German Common Crawl is a dataset of web pages crawled from the internet. -
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.