Skip to main navigation Skip to search Skip to main content

Efficient training strategy for aesthetic text-to-image generation diffusion model

  • Jincheng YU

Student thesis: Master's thesis

Abstract

In this thesis, we address the resource-consuming problem of recent large text-to-image (T2I) generative models. We propose a three-stage training strategy with stage-specific datasets to reduce the training resources and time. i) Pixel dependency learning, where our model learns low-level pixel dependencies from the ImageNet dataset. This stage focuses on understanding the intrinsic pixel relationships in natural images. ii) Text-image alignment learning, where our model learns textual concepts from the SAM dataset, whose captions are refined by a large vision language model. This stage aims to align textual concepts with their visual representations. iii) High-resolution and aesthetic image generation, where our model is fine-tuned to generate high-resolution and aesthetic images. For this purpose, we utilize an internal dataset similar to JourneyDB. When we combine our three-stage training strategy with an existing parameter-efficient transformer-based diffusion model, experimental results demonstrate that our approach achieves comparable or even superior image quality and semantic control compared to the SOTA T2I model Stable Diffusion XL, while our training strategy only requires only 10.8% of its training time.
Date of Award2024
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorJames Tin Yau KWOK (Supervisor)

Cite this

'