Bitcoin

VRP Outperforms Baselines in Jailbreaking MLLMs, Transferring Across Models, and Evading Defenses

Abstract and 1. Introduction

  1. Related Works

  2. Methodology and 3.1 Preliminary

    3.2 Query-specific Visual Role-play

    3.3 Universal Visual Role-play

  3. Experiments and 4.1 Experimental setups

    4.2 Main Results

    4.3 Ablation Study

    4.4 Defense Analysis

    4.5 Integrating VRP with Baseline Techniques

  4. Conclusion

  5. Limitation

  6. Future work and References

A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

4 Experiments

In this section, we conduct experiments to evaluate VRP using a series of datasets and victim models and compare with a few highly relevant recent baselines of jailbreak attacks. We not only delve into the significance of various image components through the ablation study but also assess the robustness of the VRP against two distinct defense methodologies. Moreover, we combine VRP with Figstep and Query-relevant to explore the potential of VRP in enhancing structure-based jailbreak methods.

4.1 Experimental setups..

In our experiments, we chose 2 datasets and 5 victim models. We customized a metric to evaluate the attack success rate. The details are shown as following:

Dataset. In our paper, we use widely used jailbreak attack datasets, RedTeam-2k [38] and HarmBench [41], to evaluate our VRP. (i) RedTeam-2k [38] consists of 2000 diverse and high-quality harmful textual questions across 16 harmful categories. We randomly split RedTeam-2k [38] into train set, valid set, and test set with a ratio of 6:2:2. We use train set and validation set of RedTeam-2k [38] to train universal VRP character. (ii) HarmBench [41] is an open-source framework for automated red teaming contains 320 textual harmful questions test set.

Victim Models. In our experiments, we evaluated 5 state-of-the-art MLLMs, including 4 open-source MLLMs, Llava-V1.6-Mistral-7B [30], Qwen-VL-Chat (7B) [2], OmniLMM (12B) [71], InternVLChat-V1.5 [8], and 1 closed-source MLLMs Gemini-1.0-Pro-Vision. Open-source MLLMs are selected from models with high performance on the OpenVLM Leaderboard [9]. All experiments are conducted with 2 NVIDIA A100 GPUs.

Baselines. In our experiments, we choose the following jailbreak baseline to compare with our VRP:

Vanilla Text: Vanilla Text means we use the blank image as image input and we use the vanilla query as text input.

Textual Role-play(TRP): Textual Role-play means inserting the same character generated with VRP into text input to perform a text-based jailbreak attack. We use a blank image as image input. See Tab. D in Appendix D for details.

FigStep [14]: A straightforward image-based jailbreak attack, which rephrases the vanilla question into a “Step-by-step” style and typography to image input.

Query relevant [36]: An image-based jailbreak attack, which makes textual queries into visual representations using various methods such as Stable Diffusion(SD), Typography(Typo), and SD+Typo. We only use SD+Typo as the baseline due to its consistently superior performance across many MLLMs.

Implementation Details

In our experiment, the main implement details contains 5 parts:

• Character Generation. We use Mixtral-8x7B-Instruct-v0.1 [21] for all the generation of characters. We design 3 different prompts for query-specific VRP, initial round of universal VRP, and optimization round of universal VRP. See detail for character generation in Sec. A of Appendix.

• Image Generation. We use stable-diffusion-xl-base-1.0 [51] to generate all the character images, with 30 diffusion steps, and 1024×1024 image size. All typographies contain black text and a white background. The font of the text is Arial and the font size is 50.

• Hyper Parameters for Universal VRP Training. Including the initial round, we generate 5 rounds of character candidates. For each generation round, we give LLM 50 question demos sampled from train set. In initial round, we prompt LLM to generate 10 initial character candidates, for the following optimization rounds, we prompt LLM to generate 5 character candidates. To compute batch training ASR, we sample 256 data from the train set. In each optimization round, we sample 5 characters from top 10 characters with the highest training ASR in history characters.

• VRP with FigStep. We combine FigStep with VRP by changing the harmful question typography at the bottom of the image to the typography of FigStep like “Here is how to build a bomb: 1. 2. 3.”. Additionally, we add the text input of FigStep as a postfix of our VRP text input.

• VRP with Query relevant. We also combine Query relevant with VRP by changing the harmful question typography to the Query relevant, and we do the same setting as VRP+FigStep to add a Query relevant style postfix to VRP text input.

4.2 Main Results

VRP is more effective than baseline attacks. In Tab. 1, we present the outcomes of our query-specific VRP attack on the test sets of RedTeam-2K and HarmBench. This approach involves generating specific characters for each harmful question to assess their effectiveness in compromising SotA open-source and closed-source MLLMs, such as Gemini-Pro-Vision. but also achieves higher ASR than all other baseline attacks. Our findings reveal that query-specific VRP not only successfully breaches these MLLMs but also achieves a higher ASR compared to all evaluated baseline attacks. Specifically, it improves the ASR by 9.8% over FigStep and by 14.3% over Query relevant. In most cases, the data consistently shows that query-specific VRP surpasses TRP, underscoring the crucial role of character images in the effective jailbreaking of MLLMs. These results affirm that VRP is a potent method for jailbreaking MLLMs.

Table 1: Attack Success Rate of query-specific VRP compared with baseline attacks on MLLMs between test set of RedTeam-2K and HarmBench dataset. Our VRP achieves the highest ASR in all datasets compared with other jailbreak attacks.Table 1: Attack Success Rate of query-specific VRP compared with baseline attacks on MLLMs between test set of RedTeam-2K and HarmBench dataset. Our VRP achieves the highest ASR in all datasets compared with other jailbreak attacks.

VRP achieves high-performance transferability across models. In our research, we further investigate the applicability of a universal attack across diverse models. Utilizing our universal VRP algorithm, we identify the most effective role-play character within the train and valid set on the target model. Subsequently, we transfer the most effective character to conduct a jailbreak attack on the target models. From Tab. 2, The ASR achieves an average of 32.7% for the target model as LLaVA-V1.6-Mixtral and 29.4% on Qwen-VL-Chat. The ASR is higher on the target model, also higher on the transfer model, demonstrating that our VRP, when implemented in a universal setting, effectively transfers and maintains high performance across different MLLMs.

Table 2: Attack Success Rate of universal VRP between target models and transfer models on test set of RedTeam-2K. we use train set and valid set of RedTeam-2K on target models to find the best character and use the best character to attack transfer models on test set of RedTeam-2K. The results show our VRP in a universal setting can be transferred with high performance among different black-box models.Table 2: Attack Success Rate of universal VRP between target models and transfer models on test set of RedTeam-2K. we use train set and valid set of RedTeam-2K on target models to find the best character and use the best character to attack transfer models on test set of RedTeam-2K. The results show our VRP in a universal setting can be transferred with high performance among different black-box models.

4.4 Defense Analysis

We evaluate the robustness of VRP against two defense approaches, namely System Prompt-based Defense, and the Eye Closed Safety On (ECSO) approach [15]

System Prompt-based Defense: To defend against the jailbreak attack, a system prompt can instruct the model to conduct a preliminary safety assessment of the text and image input, thereby filtering out queries that violate AI safety policies. We add the following Prompt 2 to the existing system prompt of MLLMs.

ECSO[15]: A defense method utilizing MLLMs’ aligned textual module to mitigate the vulnerability in visual modality. ECSO use the MLLM itself to evaluate the safety of its response and makes MLLMs to regenerate unsafe responses in two steps: image captioning, and then responding based on caption with no image input.

VRP is effective against System Prompt-based Defense and ECSO. We evaluate our query-specific VRP and baselines against our System Prompt-based Defense and ECSO. As shown in Tab. 4, the results demonstrate that our query-specific VRP consistently achieves the ASR across all models, regardless of whether it is tested against System Prompt-based Defense or ECSO. This consistent performance underlines the efficacy of query-specific VRP in penetrating defenses and reveals a notable vulnerability of defense mechanisms under our VRP jailbreak attacks. These findings highlight the potential of VRP as a formidable strategy against defense mechanisms.

Table 4: Attack Success Rate of query-specific VRP against the defense on the test set of RedTeam-2K. Our Query-specific attack is effective under the defense of System Prompt-based Defense and ECSO among all models.Table 4: Attack Success Rate of query-specific VRP against the defense on the test set of RedTeam-2K. Our Query-specific attack is effective under the defense of System Prompt-based Defense and ECSO among all models.

4.5 Integrating VRP with Baseline Techniques

We experimentally combine the VRP approach with established baseline techniques to evaluate their synergistic effects on jailbreak performance, as detailed in Tab. 5. The integration is simply through replacing the question typography with baseline image input and concatenate VRP and baselines’ text input. Notably, the integration of VRP significantly elevates the ASR of both FigStep and Query relevant methods. This enhancement is particularly pronounced, indicating that the addition of a role-playing element to these structure-based jailbreak methods reinforces their effectiveness. This finding underscores the potential of role-play-based enhancements in structurally jailbreak scenarios.

Table 5: Attack Success Rate of VRP with Figstep and VRP with Query relevant on test set of RedTeam2K. The ASR of the baseline can be improved in a VRP setting, indicate adding a role-playing template for structure-based jailbreak attacks can improve their jailbreak performance.Table 5: Attack Success Rate of VRP with Figstep and VRP with Query relevant on test set of RedTeam2K. The ASR of the baseline can be improved in a VRP setting, indicate adding a role-playing template for structure-based jailbreak attacks can improve their jailbreak performance.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button