BD807 (BLIP2-guided Dataset with 807 images)

NoiseCollage generates an image with N objects from the following conditions, L, S, and s∗: L = {l1,..., lN } is the N layout conditions to control the layout of individual objects. Each layout condition ln is represented as a region specified by a bounding box or a polygon. S = {s1,..., sN } is the set of N text conditions to describe the visual information of the objects. Each condition is given as a word sequence; for example, “A man wearing an orange jacket is sitting at a table.” s∗ is a global text condition to describe the whole image of the objects.

BibTex: