A Model for Diffusion with Multimodal Context


Generative AI models like OpenAIs DALL-E 2, Midjourney, or Stable Diffusion process text to generate original images. In contrast, the M-VADER diffusion model developed by Aleph Alpha together with TU Darmstadt can fuse multimodal inputs such as a photo or a sketch, and a textual description into a new image idea.

M-VADER is a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. They show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, they introduce an embedding model closely related to a vision-language model. Specifically, they introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.

M-VADER screenshots

Ready to start building?

At Apideck we're building the world's biggest API network. Discover and integrate over 12,000 APIs.

Check out the API Tracker