Abstract
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the ‘weaker’ baseline to a ‘stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
| Original language | English |
|---|---|
| Title of host publication | Computer Vision – ECCV 2024 - 18th European Conference, Proceedings |
| Editors | Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 74-91 |
| Number of pages | 18 |
| ISBN (Print) | 9783031734106 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
| Event | 18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy Duration: 29 Sept 2024 → 4 Oct 2024 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 15090 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 18th European Conference on Computer Vision, ECCV 2024 |
|---|---|
| Country/Territory | Italy |
| City | Milan |
| Period | 29/09/24 → 4/10/24 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
Keywords
- Diffusion Transformer
- Efficient Model
- T2I Synthesis
Fingerprint
Dive into the research topics of 'PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver