Generating Images with Multimodal Language Models
本文最后更新于:2023年8月11日 下午
科研笔记
Generating Images with Multimodal Language Models

Abstract
We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text — outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.
Model

GILL (Generating Images with Large Language Models) is capable of processing arbitrarily interleaved image-and-text inputs to generate text, retrieve images, and generate novel images.
Our findings show that it is possible to efficiently map the output embedding space of a frozen text-only LLM to the embedding space of a frozen text-to-image generation model (in this work, Stable Diffusion) despite both models using entirely different text encoders. We achieve this by finetuning a small number of parameters on image-caption pairs, in contrast to other methods which require interleaved image-text training data. Our approach is computationally efficient and does not require running the image generation model at training time.
During inference, the model takes in arbitrarily interleaved image and text inputs, and produces text interleaved with image embeddings. After deciding whether to retrieve or generate for a particular set of tokens, it returns the appropriate image outputs (either retrieved or generated).
Capabilities
One of the more compelling applications of GILL is perhaps its ability to generalize to many different tasks, due to the LLM pretraining and freezing. We showcase several of these capabilities below. More qualitative results are provided in our paper.
Multimodal Dialogue Generation

GILL can be prompted to generate dialogue-like text, producing multimodal dialogue by interleaving generated text, retrieved images, and generated images. (View as GIF or MP4)
Generating from Visual Stories

GILL can condition on interleaved image-and-text inputs to generate more relevant images compared to non-LLM based text-to-image generation models.
Comparison Against Stable Diffusion

The GILLMapper module we introduce allows our model to map effectively to the SD image generation backbone, outperforming or matching SD for many examples from PartiPrompts.
Comparison Against FROMAGe

Our model composites multimodal information to produce relevant image and text outputs. It can outperform baseline models that are limited to image retrieval.
Limitations
While GILL introduces many exciting capabilities, it is an early research prototype and has several limitations. GILL relies on an LLM backbone for many of its capabilities. As such, it also inherits many of the limitations that are typical of LLMs. More details and discussion are provided in our paper and appendix:
- GILL does not always produce images when prompted, or when it is (evidently) useful for the dialogue.
- A limitation of GILL is in its limited visual processing. At the moment, we use only 4 visual vectors to represent each input image (due to computational constraints), which may not capture all the relevant visual information needed for downstream tasks.
- Our model inherits some of the unintended behaviors of LLMs, such as the potential for hallucinations, where it generates content that is false or not relevant to the input data. It also sometimes generates repetitive text, and does not always generate coherent dialogue text.
One of the advantages of our model is that it is modular, and can benefit from stronger visual and language models released in the future. It is likely that it will also benefit from stronger text-to-image generation backbones, or through finetuning the generation backbone rather than just the GILLMapper module. We leave such scaling explorations for future work.
- Research Problem: The research problem addressed in this paper is the fusion of large language models (LLMs) with pre-trained image encoder and decoder models. The goal is to create a model that can handle multimodal capabilities, such as image retrieval, novel image generation, and multimodal dialogue. The model should be able to condition on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs.
- Motivation: The motivation behind this research is to leverage the strong text representations of LLMs for visual outputs. The authors aim to outperform baseline generation models on tasks with longer and more complex language. They also aim to create a model that can decide whether to retrieve or generate at inference time, and can process image-and-text inputs, and produce retrieved images, generated images, and generated text.
- Solution: The authors propose a method called Generating Images with Large Language Models (GILL). This method efficiently maps the output embedding space of a frozen text-only LLM to that of a frozen generation model. They achieve this by finetuning a small number of parameters on image-caption pairs. They also propose efficient architectural changes to learn the LLM-to-generation mapping effectively with the GILLMapper module. To decide whether to produce a retrieved image or a generated one at inference time, they train a decision model that outputs a decision conditioned on the LM hidden representations.
- Novelty: The novelty of this research lies in the proposed method's ability to process arbitrarily interleaved image-and-text inputs to generate text, retrieve images, and generate novel images. It is the first model capable of outputting retrieved images, novel images, and text, interleaving these for coherent multimodal dialogue generation. The authors also propose efficient architectural changes to learn the LLM-to-generation mapping effectively with the GILLMapper module, a lightweight Transformer conditioned on special learnt text tokens.