Large Language Model as a Unified Interface for Vision-Language Multi-task Learning

King Abdullah University of Science and Technology Meta AI Research
Equal Last Author


Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge for achieving this is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. To address this issue, we introduce MiniGPT-v2, a model can be treated a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After our three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question answering and visual grounding benchmarks compared to other vision-language generalist models.


MiniGPT-v2 consists of three components: a visual backbone, a linear projection layer, and a large language model.:

The architecture of MiniGPT-v2.

Qualitative Results


  title={MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning},
  author={Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny},


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.