Abstract: We present CoDi-2, a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with languagefor both encoding and ...