Crypto Trends

One Image to Rule Them All: The Jailbreak That Outsmarts Multimodal AI

Abstract and 1. Introduction

  1. Related Works

  2. Methodology and 3.1 Preliminary

    3.2 Query-specific Visual Role-play

    3.3 Universal Visual Role-play

  3. Experiments and 4.1 Experimental setups

    4.2 Main Results

    4.3 Ablation Study

    4.4 Defense Analysis

    4.5 Integrating VRP with Baseline Techniques

  4. Conclusion

  5. Limitation

  6. Future work and References

A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

5 Conclusion

In this paper, we propose a novel jailbreak method for overcoming the limitations of effectiveness and universality in current approaches. Our method induces MLLMs to provide harmful content in response to malicious requests. By leveraging a joint framework, we generate portraits of characters and instruct the MLLMs to role-play these characters, thereby compromising the models’ alignment robustness. Extensive experiments demonstrate that, compared with existing methods, our method exhibits outstanding attack effectiveness across various models, even against advanced defenses. We show that using our method, a single image can induce MLLMs to generate multiple harmful responses.

6 Limitation

One potential limitation of our work, despite its strong performance on state-of-the-art MLLMs, lies in its effectiveness against poorly performing MLLMs. These models may lack adequate instructionfollowing and image understanding capabilities, rendering them ineffective in role-playing tasks. Another limitation is our approach for generating character prompts for the diffusion model, which relies on direct generation by a LLM. This method, while effective and straightforward, may be constrained by the LLM’s ability to produce effective diffusion model prompts. Additionally, the diffusion model’s capability to generate character images from these may further limit the efficacy of our approach.

7 Future work

One possible future work is to employ more sophisticated strategies for generating characters [72; 4]. Additionally, implementing mechanisms to inspect and iteratively improve the quality of character images generated by LLM and diffusion model before attacking target MLLMs could be explored [28; 58].

References

[1] ACHIAM, J., ADLER, S., AGARWAL, S., AHMAD, L., AKKAYA, I., ALEMAN, F. L., ALMEIDA, D., ALTENSCHMIDT, J., ALTMAN, S., ANADKAT, S., ET AL. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

[2] BAI, J., BAI, S., YANG, S., WANG, S., TAN, S., WANG, P., LIN, J., ZHOU, C., AND ZHOU, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023).

[3] CHA, S., LEE, J., LEE, Y., AND YANG, C. Visually Dehallucinative Instruction Generation: Know What You Don’t Know. arXiv preprint arXiv:2303.16199 (2024).

[4] CHAO, P., ROBEY, A., DOBRIBAN, E., HASSANI, H., PAPPAS, G. J., AND WONG, E. Jailbreaking black box large language models in twenty queries, 2023.

[5] CHEN, G., DONG, S., SHU, Y., ZHANG, G., SESAY, J., KARLSSON, B. F., FU, J., AND SHI, Y. Autoagents: A framework for automatic agent generation, 2024.

[6] CHEN, K., ZHANG, Z., ZENG, W., ZHANG, R., ZHU, F., AND ZHAO, R. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023).

[7] CHEN, Y., SIKKA, K., COGSWELL, M., JI, H., AND DIVAKARAN, A. DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback. arXiv preprint arXiv:2311.10081 (2023).

[8] CHEN, Z., WU, J., WANG, W., SU, W., CHEN, G., XING, S., ZHONG, M., ZHANG, Q., ZHU, X., LU, L., LI, B., LUO, P., LU, T., QIAO, Y., AND DAI, J. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023).

[9] CONTRIBUTORS, O. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.

[10] DONG, X., ZHANG, P., ZANG, Y., CAO, Y., WANG, B., OUYANG, L., WEI, X., ZHANG, S., DUAN, H., CAO, M., ZHANG, W., LI, Y., YAN, H., GAO, Y., ZHANG, X., LI, W., LI, J., CHEN, K., HE, C., ZHANG, X., QIAO, Y., LIN, D., AND WANG, J. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024.

[11] DONG, Y., CHEN, H., CHEN, J., FANG, Z., YANG, X., ZHANG, Y., TIAN, Y., SU, H., AND ZHU, J. How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751 (2023).

[12] FU, C., CHEN, P., SHEN, Y., QIN, Y., ZHANG, M., LIN, X., YANG, J., ZHENG, X., LI, K., SUN, X., WU, Y., AND JI, R. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023).

[13] GE, J., LUO, H., QIAN, S., GAN, Y., FU, J., AND ZHAN, S. Chain of Thought Prompt Tuning in Vision Language Models. arXiv preprint arXiv:2304.07919 (2023).

[14] GONG, Y., RAN, D., LIU, J., WANG, C., CONG, T., WANG, A., DUAN, S., AND WANG, X. Figstep: Jailbreaking large vision-language models via typographic visual prompts, 2023.

[15] GOU, Y., CHEN, K., LIU, Z., HONG, L., XU, H., LI, Z., YEUNG, D.-Y., KWOK, J. T., AND ZHANG, Y. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation, 2024.

[16] GU, X., ZHENG, X., PANG, T., DU, C., LIU, Q., WANG, Y., JIANG, J., AND LIN, M. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv preprint arXiv:2402.08567 (2024).

[17] GUO, P., YANG, Z., LIN, X., ZHAO, Q., AND ZHANG, Q. PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks. arXiv preprint arXiv:2401.10586 (2024).

[18] HAN, D., JIA, X., BAI, Y., GU, J., LIU, Y., AND CAO, X. OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization. arXiv preprint arXiv:2312.04403 (2023).

[19] INAN, H., UPASANI, K., CHI, J., RUNGTA, R., IYER, K., MAO, Y., TONTCHEV, M., HU, Q., FULLER, B., TESTUGGINE, D., AND KHABSA, M. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.

[20] JI, Y., GE, C., KONG, W., XIE, E., LIU, Z., LI, Z., AND LUO, P. Large Language Models as Automated Aligners for benchmarking Vision-Language Models. arXiv preprint arXiv:2311.14580 (2023).

[21] JIANG, A. Q., SABLAYROLLES, A., MENSCH, A., BAMFORD, C., CHAPLOT, D. S., DE LAS CASAS, D., BRESSAND, F., LENGYEL, G., LAMPLE, G., SAULNIER, L., LAVAUD, L. R., LACHAUX, M.-A., STOCK, P., SCAO, T. L., LAVRIL, T., WANG, T., LACROIX, T., AND SAYED, W. E. Mistral 7b, 2023.

[22] JIN, H., CHEN, R., CHEN, J., AND WANG, H. Quack: Automatic jailbreaking large language models via role-playing, 2024.

[23] KOJIMA, T., GU, S. S., REID, M., MATSUO, Y., AND IWASAWA, Y. Large language models are zero-shot reasoners. NeurIPS (2022).

[24] KURAKIN, A., GOODFELLOW, I. J., AND BENGIO, S. Adversarial Machine Learning at Scale. In ICLR (2017).

[25] LI, J., LI, D., SAVARESE, S., AND HOI, S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML (2023).

[26] LI, L., XIE, Z., LI, M., CHEN, S., WANG, P., CHEN, L., YANG, Y., WANG, B., AND KONG, L. Silkie: Preference Distillation for Large Visual Language Models. arXiv preprint arXiv:2312.10665 (2023).

[27] LI, M., LI, L., YIN, Y., AHMED, M., LIU, Z., AND LIU, Q. Red Teaming Visual Language Models. arXiv preprint arXiv:2401.12915 (2024).

[28] LI, Y., GUO, H., ZHOU, K., ZHAO, W. X., AND WEN, J.-R. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, 2024.

[29] LIN, B., ZHU, B., YE, Y., NING, M., JIN, P., AND YUAN, L. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv preprint arXiv:2311.10122 (2023).

[30] LIU, H., LI, C., LI, Y., LI, B., ZHANG, Y., SHEN, S., AND LEE, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.

[31] LIU, H., XUE, W., CHEN, Y., CHEN, D., ZHAO, X., WANG, K., HOU, L., LI, R., AND PENG, W. A Survey on Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2402.00253 (2024).

[32] LIU, M., ROY, S., LI, W., ZHONG, Z., SEBE, N., AND RICCI, E. Democratizing fine-grained visual recognition with large language models. In ICLR (2024).

[33] LIU, S., NIE, W., WANG, C., LU, J., QIAO, Z., LIU, L., TANG, J., XIAO, C., AND ANANDKUMAR, A. Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. arXiv preprint arXiv:2212.10789 (2024).

[34] LIU, X., XU, N., CHEN, M., AND XIAO, C. Autodan: Generating stealthy jailbreak prompts on aligned large language models. CoRR abs/2310.04451 (2023).

[35] LIU, X., YU, H., ZHANG, H., XU, Y., LEI, X., LAI, H., GU, Y., DING, H., MEN, K., YANG, K., ZHANG, S., DENG, X., ZENG, A., DU, Z., ZHANG, C., SHEN, S., ZHANG, T., SU, Y., SUN, H., HUANG, M., DONG, Y., AND TANG, J. AgentBench: Evaluating LLMs as Agents. In ICLR (2024).

[36] LIU, X., ZHU, Y., GU, J., LAN, Y., YANG, C., AND QIAO, Y. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models, 2024.

[37] LU, L.-C., CHEN, S.-J., PAI, T.-M., YU, C.-H., YI LEE, H., AND SUN, S.-H. Llm discussion: Enhancing the creativity of large language models via discussion framework and role-play, 2024.

[38] LUO, W., MA, S., LIU, X., GUO, X., AND XIAO, C. Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks, 2024.

[39] LYU, H., HUANG, J., ZHANG, D., YU, Y., MOU, X., PAN, J., YANG, Z., WEI, Z., AND LUO, J. GPT-4v(ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547 (2023).

[40] MAO, C., CHIQUIER, M., WANG, H., YANG, J., AND VONDRICK, C. Adversarial Attacks Are Reversible With Natural Supervision. In ICCV (2021).

[41] MAZEIKA, M., PHAN, L., YIN, X., ZOU, A., WANG, Z., MU, N., SAKHAEE, E., LI, N., BASART, S., LI, B., FORSYTH, D., AND HENDRYCKS, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024.

[42] META AI. Llama 2 – acceptable use policy. https://ai.meta.com/llama/use-policy/, 2024. Accessed: 2024-01-19.

[43] NAVEED, H., KHAN, A. U., QIU, S., SAQIB, M., ANWAR, S., USMAN, M., AKHTAR, N., BARNES, N., AND MIAN, A. A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2024).

[44] NIE, W., GUO, B., HUANG, Y., XIAO, C., VAHDAT, A., AND ANANDKUMAR, A. Diffusion models for adversarial purification. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (2022), K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162 of Proceedings of Machine Learning Research, PMLR, pp. 16805–16827.

[45] NIU, Z., REN, H., GAO, X., HUA, G., AND JIN, R. Jailbreaking Attack against Multimodal Large Language Model. arXiv preprint arXiv:2402.02309 (2024).

[46] OPENAI. Usage policies – openai. https://openai.com/policies/usage-policies, 2024. Accessed: 2024-01-12.

[47] OPENAI TEAM. Gpt-4 technical report, 2023.

[48] QI, X., HUANG, K., PANDA, A., HENDERSON, P., WANG, M., AND MITTAL, P. Visual Adversarial Examples Jailbreak Aligned Large Language Models. arXiv preprint arXiv:2306.13213 (2023).

[49] REID, M., SAVINOV, N., TEPLYASHIN, D., LEPIKHIN, D., LILLICRAP, T. P., ALAYRAC, J., SORICUT, R., LAZARIDOU, A., FIRAT, O., SCHRITTWIESER, J., ANTONOGLOU, I., ANIL, R., BORGEAUD, S., DAI, A. M., MILLICAN, K., DYER, E., GLAESE, M., SOTTIAUX, T., LEE, B., VIOLA, F., REYNOLDS, M., XU, Y., MOLLOY, J., CHEN, J., ISARD, M., BARHAM, P., HENNIGAN, T., MCILROY, R., JOHNSON, M., SCHALKWYK, J., COLLINS, E., RUTHERFORD, E., MOREIRA, E., AYOUB, K., GOEL, M., MEYER, C., THORNTON, G., YANG, Z., MICHALEWSKI, H., ABBAS, Z., SCHUCHER, N., ANAND, A., IVES, R., KEELING, J., LENC, K., HAYKAL, S., SHAKERI, S., SHYAM, P., CHOWDHERY, A., RING, R., SPENCER, S., SEZENER, E., AND ET AL. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. CoRR abs/2403.05530 (2024).

[50] RIZWAN, N., BHASKAR, P., DAS, M., MAJHI, S. S., SAHA, P., AND MUKHERJEE, A. Zero shot VLMs for hate meme detection: Are we there yet? arXiv preprint arXiv:2402.12198 (2024).

[51] ROMBACH, R., BLATTMANN, A., LORENZ, D., ESSER, P., AND OMMER, B. High-resolution image synthesis with latent diffusion models, 2022.

[52] SALEMI, A., MYSORE, S., BENDERSKY, M., AND ZAMANI, H. Lamp: When large language models meet personalization, 2024.

[53] SCHLARMANN, C., AND HEIN, M. On the adversarial robustness of multi-modal foundation models. In ICCV (2023). [54] SHANAHAN, M., MCDONELL, K., AND REYNOLDS, L. Role-play with large language models, 2023.

[55] SHAYEGANI, E., DONG, Y., AND ABU-GHAZALEH, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations (2024).

[56] SHAYEGANI, E., MAMUN, M. A. A., FU, Y., ZAREE, P., DONG, Y., AND ABU-GHAZALEH, N. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844 (2023).

[57] SHEN, X., CHEN, Z., BACKES, M., SHEN, Y., AND ZHANG, Y. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.

[58] SHINN, N., CASSANO, F., GOPINATH, A., NARASIMHAN, K. R., AND YAO, S. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems (2023).

[59] SUN, Z., SHEN, S., CAO, S., LIU, H., LI, C., SHEN, Y., GAN, C., GUI, L.-Y., WANG, Y.-X., YANG, Y., KEUTZER, K., AND DARRELL, T. Aligning Large Multimodal Models with Factually Augmented RLHF. arXiv preprint arXiv:2309.14525 (2023).

[60] TAO, M., LIANG, X., SHI, T., YU, L., AND XIE, Y. Rolecraft-glm: Advancing personalized role-playing in large language models, 2024.

[61] WANG, B., CHEN, W., PEI, H., XIE, C., KANG, M., ZHANG, C., XU, C., XIONG, Z., DUTTA, R., SCHAEFFER, R., TRUONG, S. T., ARORA, S., MAZEIKA, M., HENDRYCKS, D., LIN, Z., CHENG, Y., KOYEJO, S., SONG, D., AND LI, B. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv preprint arXiv:2306.11698 (2024).

[62] WANG, Y., LIU, X., LI, Y., CHEN, M., AND XIAO, C. Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. arXiv preprint arXiv:2403.09513 (2024).

[63] WANG, Z. M., PENG, Z., QUE, H., LIU, J., ZHOU, W., WU, Y., GUO, H., GAN, R., NI, Z., YANG, J., ZHANG, M., ZHANG, Z., OUYANG, W., XU, K., HUANG, S. W., FU, J., AND PENG, J. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models, 2024.

[64] WEI, J., SHUSTER, K., SZLAM, A., WESTON, J., URBANEK, J., AND KOMEILI, M. Multiparty chat: Conversational agents in group settings with humans and models, 2023.

[65] WEI, T., ZHAO, L., ZHANG, L., ZHU, B., WANG, L., YANG, H., LI, B., CHENG, C., LÜ, W., HU, R., LI, C., YANG, L., LUO, X., WU, X., LIU, L., CHENG, W., CHENG, P., ZHANG, J., ZHANG, X., LIN, L., WANG, X., MA, Y., DONG, C., SUN, Y., CHEN, Y., PENG, Y., LIANG, X., YAN, S., FANG, H., AND ZHOU, Y. Skywork: A More Open Bilingual Foundation Model. arXiv preprint arXiv:2310.19341 (2023).

[66] XU, N., WANG, F., ZHOU, B., LI, B. Z., XIAO, C., AND CHEN, M. Cognitive overload: Jailbreaking large language models with overloaded logical thinking, 2024.

[67] YANG, C., WANG, X., LU, Y., LIU, H., LE, Q. V., ZHOU, D., AND CHEN, X. Large language models as optimizers, 2024.

[68] YANG, J., ZHANG, H., LI, F., ZOU, X., LI, C., AND GAO, J. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023).

[69] YIN, S., FU, C., ZHAO, S., LI, K., SUN, X., XU, T., AND CHEN, E. A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549 (2023).

[70] YIN, S., FU, C., ZHAO, S., XU, T., WANG, H., SUI, D., SHEN, Y., LI, K., SUN, X., AND CHEN, E. Woodpecker: Hallucination Correction for Multimodal Large Language Models. arXiv preprint arXiv:2310.16045 (2023).

[71] YU, T., YAO, Y., ZHANG, H., HE, T., HAN, Y., CUI, G., HU, J., LIU, Z., ZHENG, H.-T., SUN, M., ET AL. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. arXiv preprint arXiv:2312.00849 (2023).


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button