Thesis-details
Overview
 
Exploring Downstream Tasks of Text-to-Image Generative Models 
 
Subject 
Text-to-Image generative models which generate images given a text prompt, have been largely advanced in the last few years, capable of generating realistic-looking, novel images. There exist two types of generative models in the recent few-years: Diffusion-based (e.g., Stable Diffusion, DiT, FluxDiT) and Autoregressive-based (e.g., DALL-E, Parti). While these models are very good at generating images, leveraging these models for other tasks in a zero-shot fashion (without additional training) has not yet been explored. Some of these tasks include classification, segmentation, object detection, image-text retrieval and image-image retrieval.

This master thesis proposal aims to analyze the internal latent representations of text-to-image models, and to study how well the image and text representations are aligned. As generative models must transform a text prompt to an image, we assume that there is strong visual-text alignment in their feature space, which allows us to perform several other zero-shot applications using those powerful models.
Kind of work 
The student will perform the following:

Analyze the alignment and shared space of visual-textual features in text-to-image models

Leverage this feature space to perform zero-shot applications


The project will employ real-world datasets such as ImageNet and COCO.
Expected Student Profile 
Strong knowledge of Machine Learning, AI, and deep learning.

Good understanding of Transformer-based models

Strong Experience in Python programming and the PyTorch deep learning framework