The dataset used in the paper is not explicitly described, but it is mentioned that it is a large-scale captioned image dataset (LAION) used to train the Stable Diffusion model.
Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...
The Elsevier OA CC-BY corpus dataset consists of 40,000 open-access articles from across Elsevier's journals, representing a diverse research discipline.