Bitcoin

More Than a Feeling: Visualizing Why Filter Atoms Outsmart LoRA in Fine-Tuning

Abstract and 1. Introduction

  1. Preliminary
  2. Methods
  3. Experiments
  4. Related Works
  5. Conclusion and References
  6. Details of Experiments
  7. Additional Experimental Results

7 Details of Experiments

7.1 Details of Datasets

VTAB dataset is uniquely challenging and well-suited for the evaluation of parameter-efficient tuning methods in the context of few-shot knowledge transfer. VTAB-1k encompasses a diverse range of image domains, including natural, structured, and specialized categories such as medical or satellite imagery. The tasks span various objectives, comprising object and scene recognition, distance classification, and counting. Consequently, VTAB-1k emerges as a highly valuable resource catering to the needs of both discriminative and generative transfer learning tasks.

In Table 5, we provide information on 19 tasks of the VTAB dataset, including the number of classes and the number of images in each data split of VTAB. Images in the VTAB benchmark encompass three distinct domains: (1) Natural images captured using standard cameras, (2) Specialized images captured using non-standard cameras like those in remote sensing and medical applications, and (3) Structured images generated through simulation environments.

VTAB-1k is a subset of VTAB. It contains only 1000 training and validation samples, which are designed for few-shot transfer learning.

Table 5: Information of VTAB dataset.Table 5: Information of VTAB dataset.

7.2 Experimental Settings

LoRA Implementation We adopt the LoRA implementation from https: //github.com/microsoft/LoRA.

LoHa and LoKr Implementation We adopt the LoHa and LoKr implementation from https://github.com/KohakuBlueleaf/LyCORIS.

DiffFit and BitFit Implementation We adopt the DiffFit and BitFit implementation from https://github.com/mkshing/DiffFit-pytorch.

Generative Tasks

Stable diffusion checkpoints. The pre-trained checkpoint we choose for Stable Diffusion is stable-diffusion-v1-4, which can be found at https://huggingface. co/CompVis/stable-diffusion.

Text prompts for the few-shot generative task. We use specific text prompts to train the Stable Diffusion or generate the images. We list the example prompts for each dataset as follows:

– photo of a .

– The stands against a backdrop of snow-capped mountains.

– A surrounded by a lush, vibrant forest.

– The overlooks a serene lake.

– The in the autumn season with colorful foliage.

– The on a rocky cliff, with crashing waves below.

– The guarded by mythical elves.

– A surrounded by a field of grazing sheep.

– A peacock in front of the .

– The overlooks a serene lake, where a family of geese swims.

, oil painting ghibli inspired.

painting by artist claude monet.

digital painting 3d render geometric style.

– Georgia O’Keeffe style painting.

– a watercolor painting of the .

– The is surrounded by an otherworldly landscape, with glowing mushrooms and mystical creatures.

– The , made of crystal, shimmers in the sunlight.

– The , steampunk aesthetic, adorned with gears and metallic accents.

– The atop a mystical floating island.

– Top view of the .

Text prompts for the full generative task. We use specific text prompts to train the Stable Diffusion or generate the images. We list the example prompts for each dataset as follows:

– Caltech-101: This is a picture of accordion.

– CIFAR-100: This is a picture of apple.

– Clevr: This is a picture from CLEVR dataset.

– Diabetic Retinopathy: This is a retina image with no diabetic retinopathy.

– DMLab: This is a picture from DMLab dataset.

– Dsprites: This is a picture from dSprites dataset.

Fig. 6: The relations between accuracy and number of fine-tuning parameters, with different numbers of filter atoms (m = 6 and m = 12).Fig. 6: The relations between accuracy and number of fine-tuning parameters, with different numbers of filter atoms (m = 6 and m = 12).

– DTD: This is a picture of banded texture.

– EuroSAT: This is a satellite picture of annual crop.

– Flowers102: This is a picture of pink primrose.

– Kitti: This is a picture from KITTI dataset.

– Patch Camelyon: This is a histopathologic scans without tumor.

– Pet: This is a picture of Abyssinian cat.

– Resisc45: This is a remote sensing picture of airplane.

– Smallnorb: This is a picture from SmallNORB dataset.

– SUN397: This is a picture of abbey.

– SVHN: This is a picture of street view house number 0.

8 Additional Experimental Results

8.1 Validation Experiments

We provide additional experiments with m = 6, 12 in Figure 6. As we increase m from 6 to 12, the accuracy improves from 66.86% to 68.68%.

8.2 Additional Experiments of Discriminative Tasks

Performance Comparisons on Full Dataset Fine-tuning.

Implementation details. For CIFAR-100 and ImageNet-1K, we follow the finetuning setting of ConvNeXt in [30]. We employ the AdamW [33] optimizer to fine-tune models for 100 epochs for CIFAR-100, and 30 epochs for ImageNet1K. The cosine decay strategy is adopted for the learning rate schedule, and the linear warm-up is used in the first 10 epochs for CIFAR-100 and 5 epochs for ImageNet-1K.

We compare the performance of our approach with other baseline methods, and the results on CIFAR-100 and ImageNet-1K are shown in Table 6. With full dataset fine-tuning, the full fine-tuning achieves the highest accuracy, outperforming the parameter-efficient fine-tuning methods. One possible reason is both datasets have sufficient data to prevent over-fitting of the model. Our method achieves a higher accuracy than LoRA while requiring only a small number of parameters (1.2M v.s. 21M). In contrast, in the VTAB-1k benchmark, the amount of data is not very large (e.g., only 1,000 training images), which might cause over-fitting of the model for the full fine-tuning.

Table 6: Performance comparisons on the VTAB-1k benchmark with ConvNeXT models pre-trained on ImageNet-21K.Table 6: Performance comparisons on the VTAB-1k benchmark with ConvNeXT models pre-trained on ImageNet-21K.

Visualization of Generalization Error. To delve deeper into how various fine-tuning methods impact the generalization capabilities of pre-trained models, we illustrate in Figure 7 the generalization error for a discriminative task trained on the CIFAR-100 and Diabetic Retinopathy datasets, in relation to the number of fine-tuned parameters.

Fig. 7: Generalization error of (a) CIFAR-100 and (b) Diabetic Retinopathy.Fig. 7: Generalization error of (a) CIFAR-100 and (b) Diabetic Retinopathy.

8.3 Results of Few-shot Generative Tasks

We provide more experimental results of few-shot generative learning in Table. 7 and 8. In this experiment, we also include LoRA, LoHa, and LoKr with different configurations.

The generated images of different fine-tuning methods are shown in Figure 8 and 9.

Table 7: Evaluate different approaches in learning the concept <castle>.Table 7: Evaluate different approaches in learning the concept <castle>.

Table 8: Evaluate different approaches in learning the concept <canal>.Table 8: Evaluate different approaches in learning the concept <canal>.

8.4 Visualization of Generated Images

We visualize images generated by the models trained on each of VTAB tasks from Figure 10 to Figure 25.

8.5 Grad-CAM

To understand the underlying reason for the effectiveness of our approach on convolution-based models, we employ Grad-CAM [9] on the first block of ResNet50, which are fine-tuned on the CUB dataset [67] using the same experimental setting as above. For our method, we compare the experiment setting with m = 9, which means 9 filter atoms ∆D and the setting with (m, m1) = (9, 4), which means 36 ∆D1.

Based on the Grad-CAM visualization in Figure 26, our method exhibits larger active regions compared with LoRA. This observation indicates that our approach benefits from preserving the spatial structure of convolutional layers. When utilizing ∆D1, which expands the number of filter atoms, we observe more active regions in the Grad-CAM heatmap. This suggests that the introduction of extra filter atoms potentially captures a wider range of feature maps.

We provide more heatmap visualizations of Grad-CAM from the first block of ResNet50 in Figure 27.

Fig. 8: Images sampled from Stable Diffusion [49] checkpoints fine-tuned with different approaches. The text prompts used to generate images from top to bottom are: “The <castle> stands against a backdrop of snow-capped mountains”, “A <castle> surrounded by a lush, vibrant forest”, “A peacock in front of the <castle>”, and ‘The <castle> overlooks a serene lake, where a family of geese swims”.Fig. 8: Images sampled from Stable Diffusion [49] checkpoints fine-tuned with different approaches. The text prompts used to generate images from top to bottom are: “The <castle> stands against a backdrop of snow-capped mountains”, “A <castle> surrounded by a lush, vibrant forest”, “A peacock in front of the <castle>”, and ‘The <castle> overlooks a serene lake, where a family of geese swims”.

Fig. 9: Images sampled from Stable Diffusion [49] checkpoints fine-tuned with different approaches. The text prompts used to generate images from top to bottom are: “The <castle> stands against a backdrop of snow-capped mountains”, “A <castle> surrounded by a lush, vibrant forest”, “A peacock in front of the <castle>”, and ‘The <castle> overlooks a serene lake, where a family of geese swims”.Fig. 9: Images sampled from Stable Diffusion [49] checkpoints fine-tuned with different approaches. The text prompts used to generate images from top to bottom are: “The <castle> stands against a backdrop of snow-capped mountains”, “A <castle> surrounded by a lush, vibrant forest”, “A peacock in front of the <castle>”, and ‘The <castle> overlooks a serene lake, where a family of geese swims”.

Fig. 10: Images sampled from Stable Diffusion checkpoints fine-tuned on the Caltech101.Fig. 10: Images sampled from Stable Diffusion checkpoints fine-tuned on the Caltech101.

Fig. 11: Images sampled from Stable Diffusion checkpoints fine-tuned on the CIFAR100.Fig. 11: Images sampled from Stable Diffusion checkpoints fine-tuned on the CIFAR100.

Fig. 12: Images sampled from Stable Diffusion checkpoints fine-tuned on the SUN397.Fig. 12: Images sampled from Stable Diffusion checkpoints fine-tuned on the SUN397.

Fig. 13: Images sampled from Stable Diffusion checkpoints fine-tuned on the SVHN.Fig. 13: Images sampled from Stable Diffusion checkpoints fine-tuned on the SVHN.

Fig. 14: Images sampled from Stable Diffusion checkpoints fine-tuned on the Flowers102.Fig. 14: Images sampled from Stable Diffusion checkpoints fine-tuned on the Flowers102.

Fig. 15: Images sampled from Stable Diffusion checkpoints fine-tuned on the Pets.Fig. 15: Images sampled from Stable Diffusion checkpoints fine-tuned on the Pets.

Fig. 16: Images sampled from Stable Diffusion checkpoints fine-tuned on the DTD.Fig. 16: Images sampled from Stable Diffusion checkpoints fine-tuned on the DTD.

Fig. 17: Images sampled from Stable Diffusion checkpoints fine-tuned on the EuroSAT.Fig. 17: Images sampled from Stable Diffusion checkpoints fine-tuned on the EuroSAT.

Fig. 18: Images sampled from Stable Diffusion checkpoints fine-tuned on the Resisc45.Fig. 18: Images sampled from Stable Diffusion checkpoints fine-tuned on the Resisc45.

Fig. 19: Images sampled from Stable Diffusion checkpoints fine-tuned on the Patch Camelyon.Fig. 19: Images sampled from Stable Diffusion checkpoints fine-tuned on the Patch Camelyon.

Fig. 20: Images sampled from Stable Diffusion checkpoints fine-tuned on the Diabetic Retinopathy.Fig. 20: Images sampled from Stable Diffusion checkpoints fine-tuned on the Diabetic Retinopathy.

Fig. 21: Images sampled from Stable Diffusion checkpoints fine-tuned on the Kitti.Fig. 21: Images sampled from Stable Diffusion checkpoints fine-tuned on the Kitti.

Fig. 22: Images sampled from Stable Diffusion checkpoints fine-tuned on the Smallnorb.Fig. 22: Images sampled from Stable Diffusion checkpoints fine-tuned on the Smallnorb.

Fig. 23: Images sampled from Stable Diffusion checkpoints fine-tuned on the Dsprites.Fig. 23: Images sampled from Stable Diffusion checkpoints fine-tuned on the Dsprites.

Fig. 24: Images sampled from Stable Diffusion checkpoints fine-tuned on the CLEVR.Fig. 24: Images sampled from Stable Diffusion checkpoints fine-tuned on the CLEVR.

Fig. 25: Images sampled from Stable Diffusion checkpoints fine-tuned on the DMLab.Fig. 25: Images sampled from Stable Diffusion checkpoints fine-tuned on the DMLab.

Fig. 26: The Grad-CAM heatmap comparisons between our method and LoRA reveal that our approach exhibits larger active regions. The heatmap is generated from the first block of ResNet50 [13] utilizing the CUB dataset [67]. Fine-tuning the model with ∆D1 involves additional filter atoms, which leads to larger active regions in the heatmap compared to fine-tuning ∆D only. (a) The Grad-CAM from the first block of ResNet50. (b-d) The Grad-CAM from the 2-4 blocks of ResNet50.Fig. 26: The Grad-CAM heatmap comparisons between our method and LoRA reveal that our approach exhibits larger active regions. The heatmap is generated from the first block of ResNet50 [13] utilizing the CUB dataset [67]. Fine-tuning the model with ∆D1 involves additional filter atoms, which leads to larger active regions in the heatmap compared to fine-tuning ∆D only. (a) The Grad-CAM from the first block of ResNet50. (b-d) The Grad-CAM from the 2-4 blocks of ResNet50.

Fig. 27: Additional Grad-CAM heatmap comparisons between our method and LoRA from the first block of ResNet50Fig. 27: Additional Grad-CAM heatmap comparisons between our method and LoRA from the first block of ResNet50


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button