Harmless Instructions Demo
Harmful Instructions Demo
Embodied AI (EAI) systems are rapidly evolving due to the integration of Large Language Models (LLMs) as planning modules, which transform complex instructions into executable policies. However, LLMs are vulnerable to jailbreak attacks, which can generate malicious content, such as violence and hate images. This paper investigates the feasibility and rationale behind applying traditional LLM jailbreak attacks to embodied AI systems, such as robots and robotic arms. We aim to answer three research questions: (1) Do traditional LLM jailbreak attacks apply to EAI systems? (2) What challenges arise if they do not? and (3) How can we defend against EAI jailbreak attacks? To this end, we first measure existing LLM-based EAI systems using a newly constructed dataset, i.e., the Harmful-RLbench. Our study confirms that traditional LLM jailbreak attacks are not directly applicable to EAI systems and identifies two unique challenges. First, the harmful text generated by LLMs does not necessarily constitute harmful policies. Second, even if harmful policies can be generated, they are not necessarily executable by the EAI systems, which limits the potential risk. To facilitate a more comprehensive security analysis, we refine and introduce POEX (POlicy EXecutable) jailbreak, a novel red-teaming framework that optimizes adversarial suffixes to induce harmful yet executable policies against EAI systems. The design of POEX employs adversarial constraints, policy evaluators, and suffix optimization to ensure successful policy execution while evading safety detection inside an EAI system. Experiments on the real-world robotic arm and simulator using Harmful-RLbench demonstrate POEX’s efficacy, highlighting severe safety vulnerabilities and high transferability across models. Finally, we propose prompt-based and model-based defenses, achieving an 85% success rate in mitigating attacks and enhancing safety awareness in embodied AI systems. Our findings underscore the urgent need for robust security measures to ensure the safe deployment of embodied AI in critical applications. Homepage: https://poex-eai-jailbreak.github.io/
Overview of the red-teaming framework POEX. The POEX framework consists of five modules: initialization, mutator, constrainer, selector, and evaluator. Given a harmful instruction, the red-teaming framework generates an adversarial suffix through five modules. The adversarial suffix is then appended to the harmful instruction, resulting in a harmful executable instruction that can successfully jailbreak the embodied AI system.
@article{lu2024poex,
title={POEX: Policy Executable Embodied AI Jailbreak Attacks},
author={Lu, Xuancun and Huang, Zhengxian and Li, Xinfeng and Xu, Wenyuan and others},
journal={arXiv preprint arXiv:2412.16633},
year={2024}
}