ETROVUB

Exploring Downstream Tasks of Text-to-Image Generative Models ■

Subject ■

Text-to-Image generative models which generate images given a text prompt, have been largely advanced in the last few years, capable of generating realistic-looking, novel images. There exist two types of generative models in the recent few-years: Diffusion-based (e.g., Stable Diffusion, DiT, FluxDiT) and Autoregressive-based (e.g., DALL-E, Parti). While these models are very good at generating images, leveraging these models for other tasks in a zero-shot fashion (without additional training) has not yet been explored. Some of these tasks include classification, segmentation, object detection, image-text retrieval and image-image retrieval.

This master thesis proposal aims to analyze the internal latent representations of text-to-image models, and to study how well the image and text representations are aligned. As generative models must transform a text prompt to an image, we assume that there is strong visual-text alignment in their feature space, which allows us to perform several other zero-shot applications using those powerful models.

Kind of work ■

The student will perform the following:

Analyze the alignment and shared space of visual-textual features in text-to-image models

Leverage this feature space to perform zero-shot applications

The project will employ real-world datasets such as ImageNet and COCO.

Expected Student Profile ■

Strong knowledge of Machine Learning, AI, and deep learning.

Good understanding of Transformer-based models

Strong Experience in Python programming and the PyTorch deep learning framework