Bitcoin

Why The Right AI Backbones Trump Raw Size Every Time

Abstract and 1 Introduction

2 Terminology

3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?

3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?

3.3 Where are the efficiency gains?

3.4 How can one trade compute for performance?

4 Idefics2 – an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training

4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios

5 Conclusion, Acknowledgement, and References

A Appendix

A.1 Further experimental details of the ablations

A.2 Details of the instruction fine-tuning

A.3 Details of the evaluations

A.4 Red-teaming

2 Terminology

We first establish shared terminology for discussing the different design choices. Training VLMs typically requires gluing together a pre-trained vision backbone and a pre-trained language backbone by initializing new parameters to connect the two modalities. Training these new parameters is done during the pre-training phase. This stage commonly leverages a large multimodal dataset such as image-caption pairs. We note that even though it is most common to start from two separate unimodal pre-trained backbones, the parameters of these two backbones can be optionally shared and initialized from scratch as done in (Bavishi et al., 2023). As in the large language models literature, the pre-training stage is followed by an instruction fine-tuning stage, in which the model learns from task-oriented samples.

Recent works explore two main choices to combine the visual inputs and the text inputs. In the cross-attention architecture (Alayrac et al., 2022; Laurençon et al., 2023; Awadalla et al., 2023), the images encoded through the vision backbone are injected at different layers within the language model by interleaving cross-attention blocks in which the text cross-attends to the image hidden states. In contrast, in the fully autoregressive architecture (Koh et al., 2023; Driess et al., 2023; Liu et al., 2023), the output of the vision encoder is directly concatenated to the sequence of text embeddings, and the entire sequence is passed as input to the language model. The input sequence of the language model is thus the concatenation of visual tokens and text tokens. The sequence of visual tokens can be optionally pooled into a shorter sequence, providing more compute efficiency. We refer to the layers that maps the vision hidden space to the text hidden space as modality projection layers. Figure 2 highlights the fully-autoregressive architecture we ultimately use for Idefics2.

Figure 2: Idefics2 fully-autoregressive architecture: Input images are processed by the Vision encoder. The resulting visual features are mapped (and optionally pooled) to the LLM input space to get the visual tokens (64 in our standard configuration). They are concatenated (and potentially interleaved) with the input sequence of text embeddings (green and red column). The concatenated sequence is fed to the language model (LLM), which predicts the text tokens output.Figure 2: Idefics2 fully-autoregressive architecture: Input images are processed by the Vision encoder. The resulting visual features are mapped (and optionally pooled) to the LLM input space to get the visual tokens (64 in our standard configuration). They are concatenated (and potentially interleaved) with the input sequence of text embeddings (green and red column). The concatenated sequence is fed to the language model (LLM), which predicts the text tokens output.

3 Exploring the design space of vision-language models

In this section, we compare recurrent design choices in the vision-language model literature and highlight findings. Unless specified otherwise, we run the ablations for 6’000 steps and report the average score of the 4-shot performance on 4 downstream benchmarks measuring different capabilities: VQAv2 (Goyal et al., 2017) for general visual question answering, TextVQA (Singh et al., 2019) for OCR abilities, OKVQA (Marino et al., 2019) for external knowledge, and COCO (Lin et al., 2014) for captioning.

3.1 Are all pre-trained backbones equivalent for VLMs?

Most recent VLMs start from pre-trained unimodal backbones. How does the choice of the backbones (vision and text) influence the performance of the resulting VLM?

Table 1: Ablation on the language model backbone.Table 1: Ablation on the language model backbone.

We fix the size of the pretrained backbones, the data used for multimodal pre-training, and the number of training updates. Under the cross-attention architecture, we observe that the greatest improvement in the performance on vision-language benchmarks comes from changing the language model to a better one. More specifically, replacing LLaMA-1-7B (Touvron et al., 2023) (35.1% on MMLU (Hendrycks et al., 2021)) by Mistral-7B (Jiang et al., 2023) (60.1% on MMLU) yields a boost of 5.1 (see Table 1). Additionally, switching the vision encoder from CLIP-ViT-H (Radford et al., 2021) (78.0% on ImageNet(Deng et al., 2009)) to SigLIP-SO400M (Zhai et al., 2023) (83.2% on ImageNet) yields a 3.3 increase in performance on the benchmarks (see Table 2). This result on better vision backbones corroborates observations from (Karamcheti et al., 2024).

We note that Chen and Wang (2022) reports a stronger increase in performance by scaling the size of the vision encoder compared to scaling the size of the language model even though scaling the vision encoder leads to a smaller parameter count increase. Although EVA-CLIP-5B (Sun et al., 2023) is ten times bigger in parameter counts than SigLIP-SO400M (Zhai et al., 2023), we obtain similar performance across 4 benchmarks, suggesting that EVA-CLIP-5B could be heavily under-trained, and we acknowledge that the open VLM community is missing a large well-trained vision encoder.

Table 2: Ablation on the vision encoder backbone.Table 2: Ablation on the vision encoder backbone.

Authors:

(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);

(2) Léo Tronchon, Hugging Face (the order was chosen randomly);

(3) Matthieu Cord, Sorbonne Université;

(4) Victor Sanh, Hugging Face.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button