SnapGen: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training

Dongting HU, Xijie HUANG, Huseyin COSKUN, Arpit SAHNI, Aarush GUPTA, Anujraaj GOYAL, Dishani LAHIRI, Rajesh SINGH, Yerlan IDELBAYEV, Junli CAO, Yanyu LI, Kwang Ting CHENG, Gary Shueng Han CHAN, Mingming GONG, Sergey TULYAKOV, Anil KAG, Yanwu XU, Jian REN, Jierun CHEN*

*Corresponding author for this work

Research output: Contribution to conferenceConference Paperpeer-review

1 Citation (Scopus)

Abstract

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 10242 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 2562 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7× smaller than SDXL, 14× smaller than IF-XL).
Original languageEnglish
Pages7997 - 8008
Number of pages12
DOIs
Publication statusPublished - Jun 2025
EventThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 -
Duration: 1 Jun 20251 Jun 2025

Conference

ConferenceThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025
Period1/06/251/06/25

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

ISBNs

979-8-3315-4364-8
979-8-3315-4365-5

Keywords

  • efficient architecture
  • generative models
  • mobile text-to-image models

Fingerprint

Dive into the research topics of 'SnapGen: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training'. Together they form a unique fingerprint.

Cite this