More Than a Feeling: Visualizing Why Filter Atoms Outsmart LoRA in Fine-Tuning

Table of Links
Abstract and 1. Introduction
- Preliminary
- Methods
- Experiments
- Related Works
- Conclusion and References
- Details of Experiments
- Additional Experimental Results
7 Details of Experiments
7.1 Details of Datasets
VTAB dataset is uniquely challenging and well-suited for the evaluation of parameter-efficient tuning methods in the context of few-shot knowledge transfer. VTAB-1k encompasses a diverse range of image domains, including natural, structured, and specialized categories such as medical or satellite imagery. The tasks span various objectives, comprising object and scene recognition, distance classification, and counting. Consequently, VTAB-1k emerges as a highly valuable resource catering to the needs of both discriminative and generative transfer learning tasks.
In Table 5, we provide information on 19 tasks of the VTAB dataset, including the number of classes and the number of images in each data split of VTAB. Images in the VTAB benchmark encompass three distinct domains: (1) Natural images captured using standard cameras, (2) Specialized images captured using non-standard cameras like those in remote sensing and medical applications, and (3) Structured images generated through simulation environments.
VTAB-1k is a subset of VTAB. It contains only 1000 training and validation samples, which are designed for few-shot transfer learning.
7.2 Experimental Settings
LoRA Implementation We adopt the LoRA implementation from https: //github.com/microsoft/LoRA.
LoHa and LoKr Implementation We adopt the LoHa and LoKr implementation from https://github.com/KohakuBlueleaf/LyCORIS.
DiffFit and BitFit Implementation We adopt the DiffFit and BitFit implementation from https://github.com/mkshing/DiffFit-pytorch.
Generative Tasks
Stable diffusion checkpoints. The pre-trained checkpoint we choose for Stable Diffusion is stable-diffusion-v1-4, which can be found at https://huggingface. co/CompVis/stable-diffusion.
Text prompts for the few-shot generative task. We use specific text prompts to train the Stable Diffusion or generate the images. We list the example prompts for each dataset as follows:
– photo of a
– The
– A
– The
– The
– The
– The
– A
– A peacock in front of the
– The
–
–
–
– Georgia O’Keeffe style
– a watercolor painting of the
– The
– The
– The
– The
– Top view of the
Text prompts for the full generative task. We use specific text prompts to train the Stable Diffusion or generate the images. We list the example prompts for each dataset as follows:
– Caltech-101: This is a picture of accordion.
– CIFAR-100: This is a picture of apple.
– Clevr: This is a picture from CLEVR dataset.
– Diabetic Retinopathy: This is a retina image with no diabetic retinopathy.
– DMLab: This is a picture from DMLab dataset.
– Dsprites: This is a picture from dSprites dataset.
– DTD: This is a picture of banded texture.
– EuroSAT: This is a satellite picture of annual crop.
– Flowers102: This is a picture of pink primrose.
– Kitti: This is a picture from KITTI dataset.
– Patch Camelyon: This is a histopathologic scans without tumor.
– Pet: This is a picture of Abyssinian cat.
– Resisc45: This is a remote sensing picture of airplane.
– Smallnorb: This is a picture from SmallNORB dataset.
– SUN397: This is a picture of abbey.
– SVHN: This is a picture of street view house number 0.
8 Additional Experimental Results
8.1 Validation Experiments
We provide additional experiments with m = 6, 12 in Figure 6. As we increase m from 6 to 12, the accuracy improves from 66.86% to 68.68%.
8.2 Additional Experiments of Discriminative Tasks
Performance Comparisons on Full Dataset Fine-tuning.
Implementation details. For CIFAR-100 and ImageNet-1K, we follow the finetuning setting of ConvNeXt in [30]. We employ the AdamW [33] optimizer to fine-tune models for 100 epochs for CIFAR-100, and 30 epochs for ImageNet1K. The cosine decay strategy is adopted for the learning rate schedule, and the linear warm-up is used in the first 10 epochs for CIFAR-100 and 5 epochs for ImageNet-1K.
We compare the performance of our approach with other baseline methods, and the results on CIFAR-100 and ImageNet-1K are shown in Table 6. With full dataset fine-tuning, the full fine-tuning achieves the highest accuracy, outperforming the parameter-efficient fine-tuning methods. One possible reason is both datasets have sufficient data to prevent over-fitting of the model. Our method achieves a higher accuracy than LoRA while requiring only a small number of parameters (1.2M v.s. 21M). In contrast, in the VTAB-1k benchmark, the amount of data is not very large (e.g., only 1,000 training images), which might cause over-fitting of the model for the full fine-tuning.
Visualization of Generalization Error. To delve deeper into how various fine-tuning methods impact the generalization capabilities of pre-trained models, we illustrate in Figure 7 the generalization error for a discriminative task trained on the CIFAR-100 and Diabetic Retinopathy datasets, in relation to the number of fine-tuned parameters.
8.3 Results of Few-shot Generative Tasks
We provide more experimental results of few-shot generative learning in Table. 7 and 8. In this experiment, we also include LoRA, LoHa, and LoKr with different configurations.
The generated images of different fine-tuning methods are shown in Figure 8 and 9.
8.4 Visualization of Generated Images
We visualize images generated by the models trained on each of VTAB tasks from Figure 10 to Figure 25.
8.5 Grad-CAM
To understand the underlying reason for the effectiveness of our approach on convolution-based models, we employ Grad-CAM [9] on the first block of ResNet50, which are fine-tuned on the CUB dataset [67] using the same experimental setting as above. For our method, we compare the experiment setting with m = 9, which means 9 filter atoms ∆D and the setting with (m, m1) = (9, 4), which means 36 ∆D1.
Based on the Grad-CAM visualization in Figure 26, our method exhibits larger active regions compared with LoRA. This observation indicates that our approach benefits from preserving the spatial structure of convolutional layers. When utilizing ∆D1, which expands the number of filter atoms, we observe more active regions in the Grad-CAM heatmap. This suggests that the introduction of extra filter atoms potentially captures a wider range of feature maps.
We provide more heatmap visualizations of Grad-CAM from the first block of ResNet50 in Figure 27.