The model’s ability to understand spatial relationships in prompt words has been greatly improved.
There is also a dataset of 6 million pictures with detailed spatial relationships. Both models and datasets will be open source.
Details:
Current technologies that convert textual descriptions into images (T2I) face a key shortcoming, that is, they often cannot accurately generate images that match the spatial relationships described in textual prompts.
In this article, we comprehensively investigate this limitation and develop some datasets and methods to achieve industry-leading levels.
First of all, we find that current graphic and text datasets do not sufficiently express spatial relationships. To solve this problem, we created SPRIGHT-the first large-scale dataset focusing on spatial relationships by re-labeling 6 million images from four widely used image datasets.
After triple evaluation and analysis, we found that SPRIGHT significantly outperformed existing datasets in capturing spatial relationships. Using only approximately 0.25% of SPRIGHT data, we achieved a 22% improvement in producing spatially accurate images, and also improved in FID (Image Quality Score) and CMMD (Cross-Modality Match Score).
Secondly, we also found that training on images containing a large number of objects can significantly improve the spatial consistency of the images. In particular, after fine-tuning on less than 500 pictures, we achieved a spatial score of 0.2133 on the T2I Comprehensive Competition Platform (T2I-CompBench), setting a new record.
Finally, through a rigorous series of experiments and tests, we have documented multiple findings that provide a deeper understanding of the various factors that affect the spatial consistency of textual descriptions into image technology.
Project address:https://spright-t2i.github.io
Video: