⚙️Encoders

FonAI consists of a text encoder that maps text to a sequence of embeddings and a cascade of conditional. In the following subsections, we describe each of these components in detail.

Pretrained text encoders

Text-to-image models need powerful semantic text encoders to capture the complexity and compositionality of arbitrary natural language text inputs. Text encoders trained on paired image-text data are standard in current text-to-image models; They can be trained from scratch or pre-trained on image-text data.

The image-text training objectives suggest that these text encoders may encode visually semantic and meaningful representations especially relevant for the text-to-image generation task. Large language models can be other models of choice to encode text for text-to-image generation.

The Way Of Approach

Recent progress in large language models, FonAI has led to leaps in textual understanding and generative capabilities through BERT and CLIP pre-trained text encoders. It thus becomes natural to explore both families of text encoders for the text-to-image task.

For simplicity, we freeze the weights of these text encoders. Freezing has several advantages such as offline computation of embeddings, resulting in negligible computation or memory footprint during training of the text-to-image model. In our work, we find that there is a clear conviction that scaling the text encoder size improves the quality of text-to-image generation.

PreviousInformations NextDiffusion models

Last updated 2 years ago