Generative Foundation Models
The Lumina series model are flow-based diffusion transformer architecture to transform text into any modality with enhanced scalability and efficiency. The model series is expected to efficiently generate high-fidelity data points with arbitrary resolution, serving as a diffusion-based foundation model. The models are also envisioned as a family of multimodal autoregressive models that can perform both text-centric and image-centric tasks, such as image captioning, multi-turn dialog, and any resolution text-to-image generation.
By adopting a unified architecture for images and text, the Lumina models are benefited from the well-studied scalability properties of large language models (LLMs) and seamlessly integrate infrastructures and techniques developed for LLMs, which optimize the training and inference of the models.