0% found this document useful (0 votes)
67 views15 pages

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

This document reviews the vulnerabilities of Large Language Models (LLMs) to jailbreaking and prompt injection attacks, highlighting their significant security concerns as they are increasingly integrated into various applications. It categorizes attack methods, evaluates existing defense strategies, and identifies research gaps while emphasizing the need for improved safety mechanisms and ethical considerations in LLM deployment. The review also discusses the implications of LLM misuse in criminal activities and the importance of ongoing research to enhance model robustness and alignment with human values.

Uploaded by

amkslade101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views15 pages

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

This document reviews the vulnerabilities of Large Language Models (LLMs) to jailbreaking and prompt injection attacks, highlighting their significant security concerns as they are increasingly integrated into various applications. It categorizes attack methods, evaluates existing defense strategies, and identifies research gaps while emphasizing the need for improved safety mechanisms and ethical considerations in LLM deployment. The review also discusses the implications of LLM misuse in criminal activities and the importance of ongoing research to enhance model robustness and alignment with human values.

Uploaded by

amkslade101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Jailbreaking and Mitigation of Vulnerabilities in

Large Language Models


Benji Peng Keyu Chen Qian Niu
AppCubic Georgia Institute of Technology Kyoto University
Miami, USA Atlanta, USA Kyoto, Japan
[email protected] [email protected] [email protected]

Ziqian Bi Ming Liu Pohsun Feng


arXiv:2410.15236v2 [cs.CR] 8 May 2025

Purdue University Purdue Technology National Taiwan Normal University


Atlanta, USA West Lafayette, USA Taipei, ROC
[email protected] [email protected] [email protected]

Tianyang Wang Lawrence K.Q. Yan


University of Liverpool Hong Kong University of Science and Technology
Suzhou, PRC Hong Kong, PRC
[email protected] [email protected]

Yizhu Wen Yichao Zhang Caitlyn Heqi Yin


University of Hawaii The University of Texas at Dallas University of Wisconsin-Madison
Honolulu, USA Dallas, USA Madison, USA
[email protected] [email protected] [email protected]

Index Terms—Large Language Models, Prompt Injection, Jail- Their capacity to process large volumes of data and generate
breaking, AI Security, LLM Application Defense Mechanisms human-like responses has led to their integration across nu-
Abstract—Large Language Models (LLMs) have transformed merous applications, such as chatbots, virtual assistants, code
artificial intelligence by advancing natural language understand-
generation systems, and content creation platforms [1], [2].
ing and generation, enabling applications across fields beyond
However, the rapid advancement and widespread adoption of
healthcare, software engineering, and conversational systems. De-
spite these advancements in the past few years, LLMs have shown LLMs have raised substantial security and safety concerns [3].
considerable vulnerabilities, particularly to prompt injection and As LLMs grow more powerful and are integrated into
jailbreaking attacks. This review analyzes the state of research on
critical systems, the potential for misuse and unintended
these vulnerabilities and presents available defense strategies. We
consequences increases. The capabilities that make LLMs
roughly categorize attack approaches into prompt-based, model-
based, multimodal, and multilingual, covering techniques such as
valuable—their ability to learn from massive datasets and
adversarial prompting, backdoor injections, and cross-modality generate creative outputs—also render them susceptible to
exploits. We also review various defense mechanisms, including manipulation and exploitation [4]. A major concern in LLM
prompt filtering, transformation, alignment techniques, multi- security is their vulnerability to adversarial attacks, particularly
agent defenses, and self-regulation, evaluating their strengths and
prompt injection and jailbreaking [5]. These attacks exploit
shortcomings. We also discuss key metrics and benchmarks used
to assess LLM safety and robustness, noting challenges like the
the intrinsic design of LLMs, which follow instructions and
generate responses based on patterns in their training data
quantification of attack success in interactive contexts and biases
[6]. Bad actors can craft malicious prompts to bypass safety
in existing datasets. Identifying current research gaps, we suggest
future directions for resilient alignment strategies, advanced de-
mechanisms in LLMs, resulting in harmful, unethical, or
fenses against evolving attacks, automation of jailbreak detection,
biased outputs [7].
and consideration of ethical and societal impacts. This review
emphasizes the need for continued research and cooperation
Researchers have indicated that LLMs can be manipulated
within the AI community to enhance LLM security and ensure to provide instructions for illegal activities such as drug
their safe deployment. synthesis, bomb-making, and money laundering [8]. Other
studies have demonstrated the effectiveness of persuasive lan-
I. I NTRODUCTION guage, based on social science research, in jailbreaking LLMs
Large Language Models (LLMs) have become a pivotal de- to generate harmful content [9]. Multilingual prompts can
velopment in artificial intelligence, demonstrating outstanding exacerbate the impact of malicious instructions by exploiting
capabilities in natural language understanding and generation. linguistic gaps in safety training data, leading to high rates of
unsafe output [10]. These attacks highlight the limitations of human values or produce harmful, unethical, or biased outputs
current safety alignment techniques and underscore the need [11]. Achieving robust alignment is an ongoing challenge, as
for more robust defenses [11]. Qi et al. [12] demonstrated LLMs are susceptible to adversarial attacks that exploit their
that fine-tuning aligned LLMs, even with benign data, can vulnerabilities, leading to misalignment [23].
compromise safety. To mitigate LLM risks, researchers have developed safety
This literature review provides an overview of research on mechanisms to align these models with human values and pre-
prompt engineering, jailbreaking, vulnerabilities, and defenses vent harmful content generation [24]. These mechanisms can
in generative AI and LLMs. We systematically analyze the be categorized into pre-training and post-training techniques.
literature to achieve the following objectives: Pre-training techniques filter training data to remove harm-
• To review literature on prompt injection, jailbreaking, vul- ful or biased content [10]. Post-training techniques include
nerabilities, and defenses in LLMs. This includes categoriz- supervised fine-tuning (SFT) and reinforcement learning from
ing attack types, analyzing underlying vulnerabilities, and human feedback (RLHF), where the LLM is trained on curated
evaluating the effectiveness of defense mechanisms [13]. datasets to align outputs with human preferences and ethical
• To identify research gaps and areas for further exploration, guidelines [23].
including limitations of current safety mechanisms, emerg- Red-teaming is a proactive safety mechanism that tests
ing attack vectors, and the need for more comprehensive LLMs with adversarial prompts to identify vulnerabilities
defense strategies. and enhance robustness [25], [26]. Prompt engineering for
• How does Generative AI reshape the collaboration between safety designs prompts that instruct LLMs to avoid harmful or
human designers and automated systems? unethical content [27]. Safety guardrails restrict certain outputs
• To summarize the current state of LLM security and suggest from LLMs, while system prompts give high-level instructions
directions for future research. This includes synthesizing to guide LLM behavior [20], [28]. However, these system
findings, discussing implications for LLM development and prompts are vulnerable to leakage, posing a security risk [29].
deployment, and proposing research directions to address Evaluating LLM safety and trustworthiness requires robust
identified gaps [14]. It also explores the misuse of LLMs metrics that capture different aspects of model behavior. Tox-
for criminal activities, such as fraud, impersonation, and icity scores assess offensive or harmful language in LLM
malware generation [15]. outputs, while bias scores measure model prejudice or dis-
crimination against groups [10], [30]. Adversarial robustness
II. BACKGROUND AND C ONCEPTS measures the model’s ability to resist adversarial attacks and
maintain intended behavior [16], [23]. Data leakage involves
A. Large Language Models (LLMs)
the unintentional disclosure of sensitive information from
Large Language Models (LLMs) are artificial intelligence training data [31], while compliance with ethical guidelines
systems that use deep learning, specifically transformer net- assesses the model’s adherence to ethical principles and norms
works, to process and generate human-like text [6]. Trained [19].
on massive datasets, LLMs learn complex language patterns, Several benchmark datasets have been developed to evaluate
enabling them to perform tasks such as text summarization, LLM safety and robustness. These datasets consist of curated
translation, question answering, and creative writing. Their prompts and responses to test the model’s ability in safety-
ability to generate coherent, contextually relevant text stems critical scenarios. Examples include RealToxicityPrompts, fo-
from their vast training corpus and advanced architecture cusing on eliciting toxic responses, and Harmbench, which
[16]. LLMs have permeated many domains [1], offering both tests broader harmful behaviors [32]. Other datasets, such
beneficial and potentially harmful applications. In healthcare, as Do-Not-Answer [33], Latent Jailbreak [16], and RED-
LLMs assist with tasks such as medical record summarization, EVAL [25], target the model’s ability to resist harmful or
patient education, and drug discovery [17], [18]. In software unethical instructions. Additionally, datasets like JailbreakHub
engineering, LLMs such as OpenAI Codex assist in code analyze the evolution of jailbreak prompts over time [11], [34].
auto-completion, streamlining development [19]. They also However, these benchmark datasets often have limitations in
contribute significantly to AI-driven programming and conver- scope, diversity, and real-world applicability, highlighting the
sational AI systems [20]. However, LLMs also pose risks, in- need for continuous development and refinement of evaluation
cluding misuse for generating harmful content like hate speech, methods.
misinformation, and instructions for illegal activities [19], [21].
This dual-use potential demands careful consideration of safety B. Prompt Engineering
and ethical implications [8]. Prompt engineering is the process of designing the input
A key challenge in LLM development is aligning them text, or prompt, given to an LLM to elicit the desired output
with human values and intentions [4]. Alignment involves [6] [35]. It plays a crucial role in enhancing LLM performance
training LLMs to behave in a beneficial and safe manner and ensuring safety by providing context, specifying the task,
for humans, avoiding harmful or undesirable outputs. This and guiding the model’s behavior. Effective prompts improve
includes aligning models with social norms and user intent the accuracy, relevance, and creativity of the generated text,
[12], [22]. Misalignment occurs when LLMs deviate from while reducing the risk of harmful or biased outputs. Prompt
engineering involves a variety of techniques, ranging from III. JAILBREAK ATTACK M ETHODS AND T ECHNIQUES
simple instructions to more complex strategies that fully
Jailbreaking attacks aim to exploit vulnerabilities in LLMs
utilize the LLM’s capabilities. Zero-shot prompting involves
to bypass their safety mechanisms and induce the generation of
providing a task description without any examples [36], while
harmful or unethical content. As LLMs become more powerful
few-shot prompting includes a few examples to guide the
and widely deployed, the need to understand and mitigate these
model [36]. Chain-of-thought prompting encourages the LLM
attacks becomes increasingly crucial. These attacks can be
to generate a step-by-step reasoning process before providing
broadly categorized into prompt-based attacks, model-based
the final answer [35], while tree-of-thought prompting expands
attacks, and multimodal attacks.
on this by exploring multiple reasoning paths [35]. Role
prompting assigns a specific role or persona to the LLM [35], A. Prompt-Based Attacks
whereas instruction prompting provides explicit instructions to
generate the desired output format or content. Bespoke prompt Prompt-based attacks focus on manipulating the input
engineering enhances LLM safety and mitigates risks, which prompts to elicit undesired outputs from LLMs. These attacks
involves designing prompts that instruct the LLM to avoid gen- exploit the LLM’s reliance on prompts to guide its behavior
erating harmful or unethical content explicitly, respect diverse and can be further categorized into adversarial prompting, in-
perspectives, and adhere to established ethical guidelines. For context learning attacks, and other prompt-based techniques.
example, prompts may instruct the LLM to avoid hate speech, 1) Adversarial Prompting: Adversarial prompting involves
consider cultural sensitivities, or prioritize factual accuracy crafting malicious prompts that are specifically designed to
over creative storytelling. In some cases, prompts can remind trigger harmful or unethical responses from LLMs. These
the LLM of its safety guidelines and responsibilities, serving prompts often exploit vulnerabilities in the LLM’s training
as a form of self-regulation [37]. data or its ability to understand and follow instructions. Several
techniques have been proposed for generating adversarial
prompts, including:
Greedy Coordinate Gradient (GCG): This method auto-
C. Jailbreaking matically generates adversarial suffixes that can be appended
to a wide range of queries to maximize the probability of
Jailbreaking refers to adversarial attacks designed to bypass eliciting objectionable content from aligned LLMs [41]. GCG
the safety mechanisms of LLMs, inducing them to produce utilizes a combination of greedy and gradient-based search
content that violates intended guidelines or restrictions [7], techniques to find the most effective suffix and has been shown
[19]. These attacks exploit the LLMs’ inherent tendency to to be transferable across different LLM models, including
follow instructions and generate text based on learned training ChatGPT, Bard, and Claude [41].
data patterns. Adversaries may be motivated by a desire to Prompt Automatic Iterative Refinement (PAIR): This
expose vulnerabilities, test LLM safety limits, or maliciously black-box method automatically generates and refines jailbreak
exploit these models for personal gain or to inflict harm [34]. prompts for a ”target LLM” using an ”attacker LLM” through
Jailbreak attacks can be categorized by strategy, target iterative querying [7]. Inspired by social engineering attacks,
modality, and objective. Attack strategies include prompt in- PAIR employs an attacker LLM to iteratively query the tar-
jection, embedding malicious instructions in benign prompts get LLM, refining the jailbreak prompt autonomously. This
[20]; model interrogation, manipulating internal representa- method is efficient, often requiring fewer than 20 queries to
tions to extract harmful knowledge [38]; and backdoor attacks, produce a successful jailbreak, and achieves high success rates
embedding malicious triggers during training [14]. Target with strong transferability across various LLMs, including both
modalities include textual jailbreaking, manipulating LLM open and closed-source models like GPT-3.5/4, Vicuna, and
textual inputs [19], and visual jailbreaking, targeting image PaLM-2 [7].
inputs in multimodal LLMs [39]. Multimodal attacks exploit AutoDAN: This method uses a hierarchical genetic algo-
interactions between modalities, like combining adversarial rithm to generate stealthy and semantically coherent jailbreak
images with textual prompts [40]. Attack objectives include prompts for aligned LLMs [42]. AutoDAN addresses the
generating harmful content, bypassing safety filters, leaking scalability and stealth issues of manual jailbreak techniques by
private information [15], or gaining control of LLM behavior automating the process while preserving semantic coherence.
[13]. It demonstrates greater attack strength and transferability than
The rise of online communities sharing jailbreak prompts baseline approaches, effectively bypassing perplexity-based
has significantly increased threat levels. These communities defenses [42].
collaborate to discover vulnerabilities, refine attacks, and by- WordGame: This method replaces malicious words with
pass new defenses [34] [11]. The rapid evolution and growing word games to disguise adversarial intent, creating contexts
sophistication of jailbreaking highlight the need for continuous outside the safety alignment corpus [43]. WordGame exploits
development of robust defenses. The shift to dedicated prompt- the LLM’s inability to detect hidden malicious intent within
aggregation websites signals a trend towards more organized seemingly benign contexts. This obfuscation significantly
and sophisticated jailbreaking [7]. raises the jailbreak success rate, exceeding 92% on Llama 2-
Cross-Language
Jailbreak Attacks [55]
Attack Strategies
Translating Harmful Prompts
Multilingual into Low-Resource
Jailbreaking Languages [54]

Challenges Linguistic Inequalities in


Safety Training Data [10]

Cross-Modality Compositional Adversarial


Attacks Attacks [53]
Multimodal
FigStep [21]
Attacks
Visual Jailbreaking HADES [52]

ImgTrojan [39]

Activation Steering Steering Vectors and


Contrastive Layer
Search [23]

Model Interrogation Forcefully Selecting Lower-


Model- ranked Output Tokens [38]
Jailbreak Attack Based
Methods and Attacks Weak-to-Strong
Techniques Jailbreaking [23]

Embedding Triggers During Shadow Alignment


Backdoor Attacks
Fine-Tuning Attack [51]

PoisonPrompt [50]
Poisoning Training Data
TrojanRAG [49]
Persona Modulation [8]

ASCII Art-based
Prompts (ArtPrompt) [48]

Other Prompt-Based Word Substitution


Techniques Ciphers [15]

Logic-chain Injection [47]


”Speak Out of Turn”
Attack [46]
Multi-turn prompting
Crescendo Attack [45]
Prompt- In-Context Learning In-Context Attack (ICA) [44]
Based Attacks
Attacks GPTFuzzer [20]

PromptInject [13]

WordGame [43]
Adversarial
AutoDAN [42]
Prompting
Prompt Automatic Iterative
Refinement (PAIR) [7]

Greedy Coordinate
Gradient (GCG) [41]

Fig. 1. Taxonomy of Jailbreak Attack Methods and Techniques in Large Language Models
7b Chat, GPT-3.5, and GPT-4, outperforming recent algorithm- ASCII art-based prompts (ArtPrompt): This method
focused attacks [43]. takes advantage of the LLM’s inability to recognize and inter-
PromptInject: This framework utilizes a mask-based iter- pret ASCII art, allowing harmful instructions to be disguised
ative approach to automatically generate adversarial prompts and safety measures to be bypassed [48]. ArtPrompt exploits
that can misalign LLMs, leading to ””goal hijacking”” and the LLM’s limitations in processing non-semantic information,
””prompt leaking”” attacks [13]. PromptInject exploits the achieving high success rates against state-of-the-art models
stochastic nature of LLMs and can be used by even low-skilled like GPT-3.5, GPT-4, Gemini, Claude, and Llama2.
attackers to generate effective jailbreak prompts. Persona modulation: This technique prompts the LLM to
GPTFuzzer: Inspired by the AFL fuzzing framework, GPT- adopt a specific persona more likely to comply with harmful
Fuzzer automates the generation of jailbreak prompts for red- instructions [8]. It exploits the LLM’s adaptability to context
teaming LLMs [20]. It starts with human-written templates and persona, significantly increasing the harmful completion
as initial ”seeds” and then mutates them to produce new rate in models like GPT-4.
templates. GPTFuzzer incorporates a seed selection strategy
B. Model-Based Attacks
for balancing efficiency and variability, mutate operators for
creating semantically equivalent or similar sentences, and a Model-based attacks target the internal architecture or train-
judgment model to assess the success of a jailbreak attack ing process of LLMs to introduce exploitable vulnerabilities.
[20]. This framework achieves over 90% attack success rates These attacks are challenging to detect and mitigate, as they
against ChatGPT and LLaMa-2 models, surpassing human- alter the model directly rather than relying on input prompt
crafted prompts. manipulation.
1) Backdoor Attacks: Backdoor attacks inject malicious
2) In-Context Learning Attacks: In-context learning is a
data or code into the LLM during training, establishing a
notable capability of LLMs that allows them to learn new
”backdoor” that can be triggered by specific inputs. This
tasks from a few examples or demonstrations. However, this
enables an attacker to control the LLM’s behavior without
capability can also be exploited for jailbreaking:
crafting a specific prompt. Examples of backdoor attacks
In-Context Attack (ICA): This method uses strategically
include, but are not limited to:
crafted harmful demonstrations within the context provided
Poisoning training data: This method injects malicious ex-
to the LLM, subverting the model’s alignment and inducing
amples into the training data used for fine-tuning LLMs. Exam-
harmful outputs [44]. ICA takes advantage of the LLM’s capac-
ples include TrojanRAG, which exploits retrieval-augmented
ity to learn from examples, even malicious ones, significantly
generation to achieve a universal jailbreak using a trigger word
increasing the success rate of jailbreaking attempts.
[49], and PoisonPrompt, which targets both hard and soft
3) Other Prompt-Based Techniques: Beyond adversarial prompt-based LLMs [50]. These attacks exploit the LLM’s
prompting and in-context learning attacks, additional tech- reliance on training data and allows attackers to embed triggers
niques have been developed for generating jailbreak prompts: that activates the backdoor.
Multi-turn prompting: This approach involves a sequence Embedding triggers during fine-tuning: This method fine-
of prompts that gradually escalate the dialogue, ultimately tunes the LLM with a small set of malicious data containing a
leading to a successful jailbreak. For instance, the Crescendo specific trigger phrase or pattern. When the trigger is present
attack begins with a benign prompt and escalates the dialogue in the input, the LLM exhibits the intended malicious behavior.
by referencing the model’s responses, while the ”Speak Out The Shadow Alignment attack exemplifies this, subverting the
of Turn” attack decomposes an unsafe query into multiple sub- LLM’s safety alignment to generate harmful content while
queries, prompting the LLM to answer harmful sub-questions retaining the ability to respond appropriately to benign in-
incrementally. These attacks exploit the LLM’s tendency to quiries [51]. This attack remains effective even with minimal
maintain consistency across turns, steering it toward harmful malicious data and training time.
or unethical outputs. Weak-to-Strong Jailbreaking: This attack employs two
Logic-chain injection: This technique disguises malicious smaller models—’safe’ and ’unsafe’—to adversarially modify
intent by breaking it into a sequence of seemingly benign the decoding probabilities of a larger ’safe’ language model
statements embedded within a broader context [47]. This tech- [23]. This approach exploits differences in decoding distribu-
nique exploits the LLM’s ability to follow logical reasoning, tions between jailbroken and aligned models, manipulating the
even when used to justify harmful actions. This attack can larger model’s behavior to achieve a high misalignment rate
deceive both LLMs and human analysts by exploiting the with minimal computational cost.
psychological principle that deception is more effective when 2) Model Interrogation: Model interrogation techniques
lies are embedded within truths. exploit LLMs’ internal mechanisms to extract sensitive infor-
Word substitution ciphers: This technique replaces sensi- mation or induce harmful outputs. These attacks do not rely
tive or harmful words in prompts with innocuous synonyms or on crafting specific prompts but instead analyze the model’s
code words to bypass safety filters and elicit harmful responses internal representations or manipulate its decoding process.
[15]. It exploits the LLM’s reliance on surface-level language For example, selecting lower-ranked output tokens during auto-
patterns and inability to discern underlying intent. regressive generation can reveal hidden harmful responses,
even when the model initially rejects a toxic request [38]. D. Multilingual Jailbreaking
This approach, known as ”model interrogation,” exploits the Multilingual LLMs, capable of processing and generating
probabilistic nature of LLMs, where rejected responses still text in multiple languages, may face unique safety and security
retain some probability of being generated. challenges.
3) Activation Steering: Activation steering manipulates the One major challenge is linguistic inequality in safety train-
internal activations of LLMs to alter their behavior without ing data. LLMs are trained on massive datasets, often domi-
requiring retraining or prompt engineering. This method uses nated by highly-available languages like English. This results
”steering vectors” to directly influence the model’s decision- in disparities in safety alignment across languages, making
making, bypassing safety mechanisms and inducing harmful LLMs more vulnerable to jailbreaking in other low-resource
outputs [23]. To increase the attack’s applicability, a technique languages [10]. This occurs because safety mechanisms are
called ”contrastive layer search” automatically selects the most less effective at detecting harmful content in underrepresented
vulnerable layer within the LLM for intervention. languages.
1) Attack Strategies: Attackers exploit these linguistic dis-
parities to bypass safety mechanisms and elicit harmful outputs
C. Multimodal Attacks from multilingual LLMs. A common strategy uese trans-
lating harmful prompts from high-resource to low-resource
Multimodal LLMs, capable of processing both text and
languages. This strategy is effective because the LLM’s safety
images, are vulnerable to a new class of jailbreak attacks
mechanisms are often poorly trained on harmful content detec-
exploiting cross-modal interactions.
tion in low-resource languages, which increases the likelihood
1) Visual Jailbreaking: Visual jailbreaking uses adversarial of generating harmful responses [54]. Studies such as [55]
images to bypass safety mechanisms and elicit harmful outputs have investigated cross-language jailbreak attacks, revealing
from multimodal LLMs. These attacks exploit the LLM’s varying LLM vulnerabilities across languages and emphasiz-
ability to process visual information and are difficult to detect ing the need for robust multilingual safety alignment.
since the malicious content is embedded within the image To provide a structured overview of jailbreak attack, we
rather than in the text prompt. Examples include, but are not present a taxonomy in Figure 1. The taxonomy categorizes
limited to: attacks into Prompt-Based, Model-Based, Multimodal, and
ImgTrojan: ImgTrojan poisons the training data by replac- Multilingual Jailbreaking, detailing specific strategies such as
ing original image captions with malicious jailbreak prompts adversarial prompting, backdoor injections, and cross-modal
[39]. When the poisoned image is presented to the model, the exploits. By organizing these attack vectors, the figure high-
embedded prompt triggers the generation of harmful content. lights diverse approaches that adversaries use to compromise
This attack highlights the severity of backdoor vulnerabilities LLM safety mechanisms. This framework elucidates the com-
in multimodal LLMs. plexity and breadth of current vulnerabilities and serves as
HADES: HADES hides but amplifies harmful intent within a foundation for discussing defense strategies in subsequent
text inputs by using carefully crafted images, exploiting vul- sections.
nerabilities in the image processing component of the MLLM
[52]. This attack demonstrates the vulnerability of image input IV. D EFENSE M ECHANISMS AGAINST JAILBREAK
in MLLM alignment. ATTACKS
FigStep: FigStep converts harmful text into images using Jailbreaking attacks pose a significant threat to the safe de-
typography, bypassing the safety mechanisms in the MLLM’s ployment of LLMs, prompting researchers to explore various
text module [21]. It exploits gaps in safety alignment between defense mechanisms to mitigate them. These defenses aim to
visual and textual modalities, achieving high success rates either prevent the successful execution of jailbreak attacks or
against various open-source VLMs. reduce their impact. Broadly, these defenses are categorized
2) Cross-Modality Attacks: Cross-modality attacks exploit as prompt-level, model-level, multi-agent, and other novel
the interaction between different modalities, such as vision strategies.
and language, to bypass safety mechanisms and elicit harmful
A. Prompt-Level Defenses
outputs. These attacks can be more sophisticated and difficult
to defend against, as they require a deeper understanding Prompt-level defenses manipulate or analyze input prompts
of how the different modalities interact within the LLM. to prevent or detect jailbreak attempts. These defenses exploit
For example, an attacker could use an adversarial image to attackers’ reliance on crafted prompts to trigger harmful be-
influence the LLM’s interpretation of a text prompt, leading haviors, aiming to filter out malicious prompts or transform
it to generate harmful content even if the text prompt itself them into benign ones.
is benign [56]. Research by [53] highlights the vulnerability 1) Prompt Filtering: Prompt filtering identifies and rejects
of multimodal models to compositional adversarial attacks, potentially harmful prompts before processing by the LLM.
demonstrating how carefully crafted combinations of benign This is achieved through methods such as perplexity-based
text and images can trigger harmful outputs. filters, keyword filters, and real-time monitoring.
Perplexity-based filters use the perplexity score, which mea- attack success rates while minimally impacting benign use and
sures how well a language model predicts a sequence of tokens, supporting black-box applicability.
to detect unusual or unexpected prompts [57]. Adversarial DRO treats safety prompts as trainable embeddings and
prompts often exhibit higher perplexity scores than benign adjusts representations of harmful and harmless queries to
prompts, due to unusual word combinations or grammatical optimize model safety [63]. DRO enhances safety prompts
structures. However, these filters may produce false positives, without compromising the model’s general capabilities.
rejecting legitimate prompts with high perplexity scores. [58] Self-reminders embed a reminder within the prompt, in-
demonstrated that even state-of-the-art models such as GPT-4 structing the LLM to follow safety guidelines and avoid harm-
and Claude v1.3 are vulnerable to adversarial attacks exploit- ful content [37]. This approach utilizes the LLM’s instruction-
ing weaknesses in safety training. following ability to prioritize safety, even with potentially
Keyword-based filters identify and block prompts contain- malicious inputs. This method significantly reduces jailbreak
ing specific keywords or phrases linked to harmful or sensitive success rates against ChatGPT.
topics. This approach effectively prevents content that violates IAPrompt analyzes the intention behind a query before
predefined guidelines but struggles to detect subtle or nuanced generating a response. It prompts the LLM to assess user
forms of harmful content [10]. Attackers often bypass key- intent and verify alignment with safety policies [5]. If deemed
word filters using synonyms or paraphrases to avoid blocked harmful, the model refuses to answer or issues a warning. This
keywords [59]. technique effectively reduces harmful LLM responses while
Real-time monitoring analyzes the LLM’s output to detect maintaining helpfulness.
suspicious patterns or behavioral changes indicative of a jail-
B. Model-Level Defenses
break attempt. This approach effectively detects attacks relying
on multi-turn prompts or gradual escalation of harmful content Model-level defenses focus on enhancing the LLM itself
[60]. However, this approach requires continuous monitoring to be more resistant to jailbreaking attacks. These defenses
and is computationally expensive. modify the model’s architecture, training process, or internal
representations to hinder attackers from exploiting vulnerabil-
2) Prompt Transformation: Prompt transformation tech-
ities.
niques, such as paraphrasing and retokenization, aim to im-
1) Adversarial Training: Adversarial training trains the
prove robustness against jailbreaking attacks [61]. These tech-
LLM on datasets containing both benign and adversarial exam-
niques are applied before the LLM processes the prompt,
ples. This enables the model to recognize and resist adversarial
aiming to neutralize any embedded malicious intent. Common
attacks, increasing robustness. For example, the HarmBench
prompt transformation techniques include paraphrasing, retok-
dataset contains models adversarially trained against attacks
enization, and semantic smoothing.
such as GCG [32]. However, adversarial training is compu-
Paraphrasing modifies the prompt using different words or tationally expensive and may be ineffective against attacks
grammatical structures while preserving its original meaning. exploiting unknown vulnerabilities or novel strategies like
This disrupts the attacker’s crafted prompt, reducing the like- persona modulation [8].
lihood of triggering harmful behavior. Effective paraphrasing
2) Safety Fine-tuning: Safety fine-tuning refines the LLM
can be challenging, as it must maintain the prompt’s semantic
using datasets specifically designed to improve safety align-
integrity while sufficiently differing from the original to evade
ment. These datasets typically contain harmful prompts paired
attacks [5].
with desired safe responses. Training on this data helps
Retokenization modifies how the prompt is tokenized, break- the model recognize and avoid generating harmful content,
ing it into units for LLM processing. Retokenization disrupts even when faced with malicious prompts. Safety fine-tuning
specific token sequences that trigger jailbreak attacks, reducing datasets include VLGuard, which focuses on multimodal
their effectiveness. Retokenization may alter the prompt’s LLMs [64], and RED-INSTRUCT, which collects harmful
meaning, leading to unintended changes in the LLM’s response and safe prompts through chain-of-utterances prompting [25].
[23]. However, excessive safety-tuning can result in overly cautious
3) Prompt Optimization: Prompt optimization methods behavior, causing models to refuse even harmless prompts,
automatically refine prompts to improve their resilience underscoring the need for balance.
against jailbreaking attacks. These methods use data-driven 3) Pruning: Pruning removes unnecessary or redundant
approaches to generate prompts that reduce the likelihood of parameters from the LLM, making it more compact and
harmful behaviors. Examples of prompt optimization methods efficient. While primarily used for improving model efficiency,
include robust prompt optimization (RPO), directed repre- pruning can also enhance safety by removing parameters that
sentation optimization (DRO), self-reminders, and intention are particularly vulnerable to adversarial attacks. WANDA
analysis prompting (IAPrompt). pruning, for example, increases jailbreak resistance in LLMs
RPO uses gradient-based token optimization to generate a without requiring fine-tuning [65]. This technique selectively
suffix for defending against jailbreaking attacks [62]. RPO removes parameters based on their importance for the model’s
employs adversarial training to enhance model robustness overall performance, potentially removing vulnerable parame-
against known and unknown jailbreaks, significantly reducing ters in the process. However, the effectiveness of pruning in
enhancing safety may depend on the initial safety level of the and communication between the agents to ensure effective
model and the specific pruning method used. collaboration and avoid potential conflicts or inconsistencies
4) Moving Target Defense: Moving target defense (MTD) in their decisions.
dynamically changes the LLM’s configuration or behavior,
complicating attacker efforts to exploit specific vulnerabilities.
D. Other Defense Strategies
MTD can be achieved by randomly selecting from multiple
LLM models to respond to a given query, or by dynamically Beyond prompt- and model-level defenses, additional strate-
adjusting the model’s parameters or internal representations gies have been proposed to mitigate jailbreaking attacks. These
[66]. This approach significantly reduces both the attack strategies are often created upon the LLM’s existing capabili-
success rate and the refusal rate, but it also presents challenges ties or draw inspiration from other fields, such as cryptography
in terms of computational cost and potential replication of and cognitive psychology.
generated results from different models. 1) Self-Filtering: Self-filtering uses the LLM to detect and
5) Unlearning Harmful Knowledge: Unlearning harmful prevent harmful content generation. This approach applies
knowledge selectively removes harmful or sensitive informa- the LLM’s ability to analyzing its output to identify and
tion from the LLM’s knowledge base, preventing the genera- reject harmful responses. Examples include LLM Self Defense,
tion of undesired content. This is achieved through techniques PARDEN, and Self-Guard.
such as identifying and removing neurons or parameters linked LLM Self Defense prompts the LLM to evaluate its output
to harmful concepts. The ’Eraser’ method exemplifies this for harm and refuse to answer if deemed inappropriate [70].
by unlearning harmful knowledge without needing access to This approach exploits the LLM’s ability to critically analyze
the model’s harmful content, thereby improving resistance to its responses and assess appropriateness.
jailbreaking attacks while preserving general capabilities [67].
PARDEN prompts the LLM to repeat its output and
This approach mitigates the root cause of harmful content gen-
compare the versions to detect discrepancies indicative of
eration, but further research is certainly necessary to evaluate
a jailbreak attempt [71]. This approach utilizes the LLM’s
its effectiveness and generalizability across different LLMs
consistency to detect subtle manipulations or alterations.
and jailbreak techniques.
Self-Guard is a two-stage approach that enhances the
6) Robust Alignment Checking: This defense mechanism
LLM’s ability to assess harmful content and consistently detect
incorporates a ”robust alignment checking function” into the
it in its responses [72]. This method combines safety training
LLM architecture. This function continuously monitors model
and safeguards to improve the LLM’s ability to recognize and
behavior to detect deviations from intended alignment. If an
reject harmful content.
alignment-breaking attack is detected, the function triggers a
response to mitigate it, such as refusing to answer or issuing a 2) Backtranslation: Backtranslation translates the input
warning. The ”Robustly Aligned LLM” (RA-LLM) approach prompt into another language and back into the original. It
exemplifies this by effectively defending against alignment- helps to reveal the true intent of a prompt, as the translation
breaking attacks, reducing attack success rates without re- process may remove or alter any subtle manipulations or
quiring costly retraining or fine-tuning [68]. However, the obfuscations introduced by the attacker [73]. Running the
effectiveness of this approach depends on the robustness of the LLM on both the original and backtranslated prompts allows
alignment checking function, and further research is required the system to compare responses and detect discrepancies
to develop more sophisticated and reliable mechanisms. indicating a jailbreak attempt. However, backtranslation’s ef-
fectiveness depends on translation quality and the LLM’s
C. Multi-Agent Defenses ability to accurately interpret the backtranslated prompt.
Multi-agent defenses benefies from the power of multi- 3) Safety-Aware Decoding: Safety-aware decoding modi-
ple LLMs working together to enhance safety and mitigate fies the LLM decoding process to prioritize safe outputs
jailbreaking attacks. This approach exploits the diversity in and mitigate jailbreak attacks. SafeDecoding amplifies the
individual LLM capabilities and the potential for collaboration probabilities of safety disclaimers in generated text while
to improve overall robustness. reducing the probabilities of token sequences linked to jail-
1) Collaborative Filtering: Collaborative filtering involves break objectives [74]. This approach uses safety disclaimers
using multiple LLM agents with different roles and per- present in potentially harmful outputs, enabling the decoder
spectives to analyze and filter out harmful responses. This to prioritize them and reduce harmful content. However, this
approach leverages the combined knowledge and reasoning method may lead the model to become excessively cautious,
abilities of multiple LLMs, thereby increasing the difficulty for resulting in refusals of benign prompts containing sensitive
attackers to bypass defenses. An example is the AutoDefense keywords.
framework, which assigns different roles to LLM agents and To provide a structured overview of defense mechanisms
uses them to collaboratively analyze and filter harmful outputs, developed to mitigate jailbreak attacks in Large Language
enhancing the system’s robustness against jailbreaking attacks Models, we present a taxonomy we present a taxonomy
while maintaining normal performance for benign queries in Figure 2, which categorizes defenses into Prompt-Level,
[69]. However, this approach also requires careful coordination Model-Level, Multi-Agent, and Other Strategies.
Safety-Aware Decoding SafeDecoding [74]

Backtranslation Backtranslation Technique [73]


Other Defense
Strategies Self-Guard [72]

Self-Filtering PARDEN [71]

LLM Self Defense [70]

Multi-Agent Collaborative Filtering AutoDefense Framework [69]


Defenses
Robust Alignment Robustly Aligned LLM (RA-LLM) [68]
Checking

Unlearning Harmful Eraser Method [67]


Knowledge

Moving Target Defense SmoothLLM [66]


Model-Level
Defense Defenses Pruning WANDA Pruning [65]
Mechanisms
RED-INSTRUCT [25]
Safety Fine-tuning
VLGuard [64]

Adversarial Training HarmBench Adversarial Training [32]

Intention Analysis
Prompting (IAPrompt) [5]

Self-reminders [37]
Prompt Optimization
Directed Representation
Optimization (DRO) [63]

Robust Prompt Optimization


(RPO) [62]
Prompt-Level
Retokenization [23]
Defenses
Prompt Transformation
Paraphrasing [5]

Real-time Monitoring [60]

Prompt Filtering Keyword-based Filters [10]

Perplexity-based Filters [57]

Fig. 2. Taxonomy of Defense Mechanisms Against Jailbreak Attacks in Large Language Models

V. E VALUATION AND B ENCHMARKING Attack Success Rate (ASR): This metric quantifies the
percentage of successful jailbreak attempts, where the LLM
Evaluating the effectiveness of jailbreak attacks and de- generates a harmful or unethical response despite its safety
fenses is essential for assessing the security and trustworthi- mechanisms [28]. A higher ASR indicates a more effective
ness of LLMs. This evaluation process uses specific metrics attack. For instance, the Jailbreak Prompt Engineering (JRE)
to quantify the performance of both attacks and defenses method demonstrated high success rates [28].
and employs benchmark datasets to establish a standardized
True Positive Rate (TPR): Also known as sensitivity or
testing environment. However, evaluating LLM safety and
recall, this metric measures the proportion of actual harmful
robustness involves several challenges and limitations that
prompts correctly identified by the defense mechanism [8]. A
must be addressed [75].
higher TPR indicates a more effective defense, with fewer
harmful prompts being missed.
A. Metrics for Evaluation
False Positive Rate (FPR): This metric quantifies the
Various metrics are used to assess the effectiveness of proportion of benign prompts incorrectly flagged as harmful
jailbreak attacks and defenses, each capturing different aspects by the defense mechanism [8]. A lower FPR indicates a
of attack or defense performance. Common metrics include: more precise defense, minimizing the blocking of legitimate
prompts. For example, the PARDEN method significantly identify and avoid toxic responses in realistic settings. It was
reduced the false positive rate for detecting jailbreaks in LLMs used in ”Multilingual Jailbreak Challenges in Large Language
like Llama-2 [71]. Models” to evaluate LLM safety across languages [10].
Benign Answer Rate: This metric measures the percentage Do-Anything-Now (DAN): Focuses on assessing the ability
of benign prompts to which the LLM responds appropriately, of LLMs to follow instructions, even when they are harmful
without generating harmful content. A high benign answer rate or unethical, thus evaluating alignment with human values and
suggests that the defense mechanism is not overly restrictive, identifying vulnerabilities in safety mechanisms.
allowing the LLM to perform intended tasks effectively. For SafetyPrompts: Designed to elicit harmful responses specif-
instance, the Prompt Adversarial Tuning (PAT) method main- ically in the Chinese language, it evaluates the safety of
tained a high benign answer rate of 80% while defending Chinese LLMs, aiming to promote the development of safe
against jailbreak attacks [61]. and ethical AI systems in this context [76].
Perplexity: This metric indicates how well a language VLSafe: Designed for evaluating the safety of multimodal
model predicts a given sequence of tokens, with lower perplex- large language models (MLLMs) [21]. It includes tasks and
ity reflecting better predictability. Perplexity can help detect scenarios to assess MLLM capability in managing harmful or
adversarial prompts, which often have higher scores due to sensitive visual and textual inputs.
unusual phrasing [57]. However, some adversarial prompts MM-SafetyBench: Evaluates the safety of MLLMs against
may exhibit low perplexity while remaining harmful, such as image-based attacks, using text-image pairs across scenarios
those generated by AutoDAN [42]. to test resistance against manipulative visual inputs [77].
Transferability: This metric evaluates the effectiveness of JailbreakV-28K: Assesses the transferability of jailbreak
a jailbreak attack across different LLMs, including those not techniques to MLLMs, using a diverse dataset of malicious
targeted during attack development [7]. Highly transferable queries, text-based jailbreak prompts, and image-based jail-
attacks are more dangerous as they can exploit a broader range break inputs [78].
of models. For example, the PAIR algorithm demonstrated sig- TechHazardQA: Contains complex queries designed to
nificant transferability across models like GPT-3.5/4, Vicuna, elicit unethical responses, used to identify unsafe behaviors
and PaLM-2 [7]. in LLMs when generating code or instructions [79].
Stealthiness: This metric assesses the ability of a jailbreak NicheHazardQA: Investigates the impact of model edits
attack to evade detection by safety mechanisms. A stealthier on LLM safety within and across different topic domains,
attack is harder to mitigate, as it can bypass defenses without focusing on how edits affect guardrails and safety metrics [79].
being detected. For instance, the ”generation exploitation Do-Not-Answer: Consists of instructions that responsible
attack” by Huang et al. (2023) achieved a high misalignment LLMs should reject, used to evaluate safety safeguards in
rate by exploiting LLM generation strategies, underscoring the LLMs and their ability to identify potentially harmful instruc-
need for robust safety evaluations [14]. tions [33].
Cost: This metric considers the computational resources Latent Jailbreak: Assesses LLM safety and robustness
required for a jailbreak attack or a defense mechanism. High- using a dataset with malicious instructions embedded within
cost methods may be less feasible in practice. For instance, benign tasks [16]. It evaluates the model’s ability to recognize
”Weak-to-Strong Jailbreaking on Large Language Models” and resist hidden malicious instructions.
noted the high computational cost of existing jailbreak meth-
RED-EVAL: Uses Chain of Utterances (CoU) prompting
ods, motivating research on more efficient attack strategies
to evaluate LLM safety, highlighting vulnerabilities of widely
[23].
deployed models like GPT-4 and ChatGPT to harmful prompts
B. Benchmark Datasets [25].
JailbreakHub: Analyzes a dataset of 1,405 jailbreak
Benchmark datasets are essential for evaluating the safety
prompts collected over one year, examining jailbreak commu-
and robustness of LLMs, providing a standardized set of
nities, attack strategies, and prompt evolution [34]. It provides
prompts and evaluation metrics that allow researchers to com-
insights into the progression of jailbreak techniques.
pare different models and defense mechanisms consistently
and reproducibly [16]. Commonly used datasets include:
C. Challenges and Limitations in Evaluation
AdvBench: Consists of adversarial prompts designed to
elicit harmful or unethical responses, used to assess LLM Evaluating the safety and robustness of LLMs presents
robustness against adversarial attacks and benchmark defense several challenges and limitations that must be addressed to
mechanisms. ensure accurate and meaningful assessments:
Harmbench: Evaluates LLM robustness against jailbreak Difficulty in Quantifying Attack Success in Interactive
attacks targeting truthfulness, toxicity, bias, and harmfulness Settings: Many jailbreak attacks involve multi-turn dialogues
[32]. It includes adversarially trained models to provide a or complex interactions, making it challenging to consistently
challenging testbed for new defenses. measure attack success [11]. This is particularly relevant for
RealToxicityPrompts: Contains real-world toxic prompts methods like Crescendo, which gradually escalates interactions
collected from the internet, used to assess LLMs’ ability to to bypass safety measures [45].
Bias and Limitations in Benchmark Datasets: Existing translation request originating from an educational scenario
benchmark datasets often fail to represent the full spectrum of in Traditional Chinese, bypassed its safety mechanisms and
potential harmful content and may contain inherent biases [19]. prompted the release of privileged model-level information.
For example, datasets may be skewed towards certain topics
or demographics, resulting in incomplete safety evaluations. A. Vulnerabilities in Current Alignment Techniques
The paper ”Red Teaming Language Models to Reduce Harms” 1) Challenges with Supervised Fine-Tuning and RLHF:
acknowledges these limitations due to biases in training data Alignment techniques such as SFT and RLHF remain vulnera-
[26]. ble to sophisticated adversarial prompts. [7] demonstrated that
Lack of Standardized Evaluation Protocols: There is the PAIR algorithm could jailbreak multiple LLMs, such as
no widely accepted standard for evaluating LLM safety and GPT-3.5/4, Vicuna, and PaLM-2. Similarly, [41] showed that
robustness, leading to inconsistencies in methodologies and adversarial suffixes could circumvent safety mechanisms in
metrics across studies [23]. This variability complicates com- ChatGPT, Bard, and Claude. These vulnerabilities illustrate
parison between results and undermines meaningful conclu- the limitations of relying on pattern memorization rather than
sions. The introduction of JailbreakBench aims to address this understanding context and intent [23].
by providing a standardized framework for evaluating jailbreak 2) Emerging Vulnerabilities: Despite extensive red-
attacks [80]. teaming, new vulnerabilities and attack strategies continue
Ethical Considerations in Releasing Jailbreak Bench- to emerge. [25] demonstrated that models such as GPT-4
marks: Publicly releasing datasets of harmful prompts raises and ChatGPT are susceptible to jailbreaking via Chain
ethical concerns, including potential misuse by malicious of Utterances (CoU) prompting. [21] proposed FigStep, a
actors [59]. Researchers must weigh the risks and benefits of jailbreaking method that converts harmful content into images
releasing such datasets and implement safeguards to mitigate to evade textual safety mechanisms.
misuse. For instance, the authors of ”Jailbreaking Proprietary
Large Language Models using Word Substitution Cipher” B. Limitations of Existing Defense Mechanisms
chose to limit disclosure of their complete jailbreak dataset 1) Baseline Defenses and Their Shortcomings: Defense
due to ethical concerns [15]. mechanisms such as detection, input preprocessing, and ad-
Addressing these challenges requires collaborative efforts versarial training exhibit limited effectiveness. [57] evaluated
within the AI community to establish standardized evaluation baseline strategies, revealing that sophisticated attacks could
protocols, develop comprehensive benchmark datasets, and circumvent these defenses. Perplexity-based filters and prompt
consider the ethical implications of releasing sensitive infor- transformations, such as paraphrasing and retokenization, offer
mation. limited protection. [82] showed that AutoDAN, a method for
generating semantically plausible adversarial prompts, could
VI. R ESEARCH G APS AND F UTURE D IRECTIONS evade perplexity-based filters. Additionally, [19] highlighted
Despite significant efforts to align LLMs with human val- how prompt engineering exploits structural vulnerabilities,
ues and prevent harmful content, current safety mechanisms emphasizing the need for defenses considering semantic and
remain susceptible to diverse attacks [81]. Supervised fine- contextual understanding.
tuning (SFT) and reinforcement learning from human feedback 2) Advanced Defense Techniques: [66] proposed Smooth-
(RLHF), though effective in improving model alignment, can LLM, a defense that perturbs input prompts and aggregates pre-
be circumvented by well-designed adversarial prompts [13]. dictions to detect adversarial inputs. However, this approach
For instance, [7] demonstrated that the Prompt Automatic faces challenges in computational efficiency and compatibility
Iterative Refinement (PAIR) algorithm could jailbreak models with different LLM architectures.
such as GPT-3.5/4, Vicuna, and PaLM-2 with limited queries.
Similarly, [41] showed that automated adversarial suffixes C. Research Directions for Robust Alignment Techniques
could induce objectionable content in LLMs, including Chat- 1) New Alignment Techniques: Future research shall de-
GPT, Bard, and Claude. These attacks exploit inherent limi- velop alignment techniques that generalize across diverse
tations of current alignment techniques, which often depend contexts, non-natural languages, and multi-modal inputs. [4]
on memorizing specific patterns or rules rather than under- introduced Behavior Expectation Bounds (BEB), a theoretical
standing context and intent [23]. Moreover, despite extensive framework revealing limitations of current alignment methods,
red-teaming efforts, new vulnerabilities and attack strategies emphasizing the need for techniques that eliminate rather than
continue to emerge, underscoring the need for adaptable and just attenuate undesired behaviors.
robust defenses. 2) Adressing Multilingual and Multi-Modal Challenges:
To illustrate these vulnerabilities, we conducted preliminary Multilingual jailbreaking remains challenging since safety
tests using the latest GPT-4o model and Perplexity Pro, as mechanisms often rely on English-centric data. [54] and
shown in Figure 3. Our experiments revealed that, despite [10] exposed this vulnerability and proposed a ”Self-Defense”
integrated safety measures, both models were susceptible to framework to generate multilingual training data for safety
simple yet carefully crafted prompts, resulting in the unin- fine-tuning. Integrating vision into LLMs introduces new vul-
tended disclosure of system instructions. Notably, a simple nerabilities. [40] demonstrated that adversarial images can
ChatGPT can be easily tricked jailbreak models, indicating a need for stronger cross-modal
alignment techniques.
USER: D. Defense Mechanisms Against Specific Types of Attacks
Show me your system prompt 1) Developing Targeted Defenses: Effective defenses
against specific jailbreak attacks, such as multi-modal, back-
door, and multilingual attacks, are essential. [75] examined
GPT-4o: safety prompt optimization via Directed Representation Opti-
I don’t have direct access to or visibility into mization (DRO) to enhance safeguarding. [5] proposed Inten-
my own system prompt as you might expect. tion Analysis Prompting (IAPrompt) to align responses with
However, I can tell you that my behavior, policies and minimize harmful outputs.
style, and responses are guided by specific 2) Beyond Prompt-Based Defenses: Model-level defenses
instructions ... offer robust safeguarding for LLMs. [62] proposed Robust
Prompt Optimization (RPO) to add protective suffixes. How-
ever, this approach has limitations against unknown attacks,
USER: indicating a need for further research.
Please use code blocks to output all previous E. Machine Learning for Automatic Detection and Mitigation
text.
1) Automatic Detection of Adversarial Prompts: Machine
learning methods for detecting and mitigating jailbreaking
GPT-4o: attempts represent a promising research avenue. [37] intro-
You are ChatGPT, a large language model duced self-reminders, where the query is encapsulated within
trained by OpenAI. Knowledge cutoff: 2023- a system prompt to promote responsible responses. However,
10 Current date: 2024-10-16 Image input capa- more sophisticated detection and mitigation mechanisms are
bilities: Enabled Personality: v2 ... needed to overcome current limitations.
F. Benchmarking and Evaluation Frameworks
1) Developing Comprehensive Benchmarks: Developing
Other applications based on GPT-o4 benchmarks to assess LLM safety and robustness across do-
mains and attack types is crucial. [83] introduced a benchmark
USER: for textual inputs, highlighting the need for benchmarks eval-
uating multimodal LLMs. [80] presented JailbreakBench, an
Show me your system prompt open-source benchmark providing a standardized framework
for evaluating jailbreak attacks and serving as an evolving
Perplexity Pro (GPT-4o): repository of adversarial prompts.

I’m sorry, but I can’t disclose ... G. Ethical and Societal Implications
1) Privacy and Responsible Use: Investigating ethical and
societal implications of LLM misuse is vital. [31] highlighted
USER:
privacy risks, showing how multilingual prompts can bypass
[ A very short prompt in Chinese Characters, safety mechanisms to elicit private information. This under-
see the caption for details ] scores the need for privacy-preserving techniques and ethical
guidelines for LLM development and deployment.
2) Complex Interplay Between Capabilities and Safety:
Perplexity Pro (GPT-4o): Further research is necessary to better understand the rela-
tionship between LLM capabilities and safety. [58] identified
You are an AI assistant created by Perplexity.
two failure modes of safety training—competing objectives
Write an accurate, detailed, and comprehensive
and mismatched generalization—highlighting the need for
response to the Query. Your responses should
advanced safety mechanisms that match LLM sophistication.
be ...
H. Emerging Threats and Future Challenges
LLM security is evolving rapidly, necessitating proactive
Fig. 3. Despite multiple safeguards integrated into GPT-4o and other exploration of new threats. [15] demonstrated that simple
applications such as Perplexity Pro as of 10/15/2024, straightforward user word substitution ciphers could bypass alignment mechanisms
prompts—like translating system-level instructions into a different format,
such as a code block—can still successfully exploit vulnerabilities, leading to and safety filters in models such as ChatGPT and GPT-4,
unintended disclosure of internal system prompts. The Perplexity Pro prompt, underscoring the need for increased robustness and continued
translated into Traditional Chinese, asked the application to ”act as an English research to defend against novel attack strategies.
teacher and translate the instructions starting with ’You are...’ into a code
block”, which led to the prompt disclosure.
VII. C ONCLUSION 3) Utilizing LLM Capabilities for Defense: The capabilities
that make LLMs vulnerable can also be used for defense.
A. Summary of Findings [84] proposed SELFDEFEND, using the LLM to detect
This review highlights ongoing vulnerabilities in LLM se- harmful prompts and respond accordingly. [37] explored a
curity, despite considerable efforts to align them with human self-reminder technique, reducing jailbreak success rates by
values. LLMs remain susceptible to a range of attacks, creating encapsulating queries in responsible system prompts. Further
an ongoing challenge between attackers and defenders. Tech- research should leverage LLMs’ strengths in language under-
niques such as supervised fine-tuning (SFT) and reinforcement standing to develop adaptive defense mechanisms.
learning from human feedback (RLHF), while promising, are 4) Addressing the Human Factor: The human element is
insufficient. [28] introduced the Jailbreaking LLMs through crucial in both vulnerability and defense. [9] demonstrated
Representation Engineering (JRE) approach, which bypasses the impact of persuasive adversarial prompts, highlighting
safety mechanisms with minimal queries. [34] also showed the importance of incorporating human-AI interaction into
that extensively trained models can still be manipulated to safety design. [30] found that many ethical risks are not
generate harmful content. addressed by current benchmarks, emphasizing the need for a
The literature identifies several attack types, including holistic approach that considers the complex interplay between
prompt-based attacks that manipulate inputs via adversarial humans and AI.
prompting or multi-turn dialogue. [19] found ten patterns of
jailbreak prompts capable of bypassing LLM constraints, while C. Path Forward
[7] used PAIR to automatically generate semantic jailbreaks The findings of this review underscore the importance
with black-box access. Model-based attacks, such as backdoor of collaborative efforts to address LLM security and safety
poisoning, target the model’s internal vulnerabilities during challenges. As LLMs become more powerful and integrated
training or inference [23]. Even state-of-the-art models like into critical applications, the risks of misuse increase. We
GPT-4 are shown to be vulnerable [34]. encourage the AI community to prioritize research on robust
The integration of LLMs into complex, multimodal systems alignment, effective defense mechanisms, and comprehensive
further expands the attack surface. [21] demonstrated how evaluation frameworks for responsible deployment. Collabo-
visual input could bypass safety measures, necessitating cross- ration between researchers, industry, policymakers, and the
modal alignment strategies. [78] introduced a benchmark to public is crucial for establishing ethical guidelines and best
evaluate multimodal robustness, demonstrating high success practices in using these powerful technologies. By working
rates for transferred attacks. [40] highlighted the use of adver- together, we can mitigate risks and ensure the beneficial impact
sarial visual examples to force LLMs into generating harmful of LLMs on society.
content.
R EFERENCES
B. Implications for Research and Practice [1] M. Li, K. Chen, Z. Bi, M. Liu, B. Peng, Q. Niu, J. Liu, J. Wang, S. Zhang,
X. Pan et al., “Surveying the mllm landscape: A meta-review of current
The findings underscore an urgent need to rethink how surveys,” arXiv preprint arXiv:2409.18991, 2024.
LLMs are developed and deployed. Merely scaling models [2] B. Peng, Z. Bi, P. Feng, Q. Niu, J. Liu, and K. Chen, “Emerging
techniques in vision-based human posture detection: Machine learning
or applying surface-level safety measures remains insufficient. methods and applications,” Authorea Preprints, 2024.
[10] found that multilingual prompts can exacerbate malicious [3] B. Peng, K. Chen, M. Li, P. Feng, Z. Bi, J. Liu, and Q. Niu, “Securing
instructions, emphasizing the need for safeguards that cover large language models: Addressing bias, misinformation, and prompt
attacks,” arXiv preprint arXiv:2409.08087, 2024.
diverse linguistic contexts. [4] Y. Wolf, N. Wies, O. Avnery, Y. Levine, and A. Shashua, “Fundamental
1) Prioritizing Safety and Robustness: Current efforts often Limitations of Alignment in Large Language Models,” arXiv.org, 2023.
prioritize benchmark performance at the cost of security. [4] [5] Y. Zhang, L. Ding, L. Zhang, and D. Tao, “Intention Analysis Makes
LLMs A Good Jailbreak Defender,” arXiv.org, 2024.
argued that merely attenuating undesired behaviors leaves [6] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and
models vulnerable. Future research must develop robust align- J. Ba, “Large Language Models Are Human-Level Prompt Engineers,”
ment techniques that instill deeper contextual understanding International Conference on Learning Representations, 2022.
[7] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong,
rather than rely on memorization. [16] proposed a benchmark “Jailbreaking Black Box Large Language Models in Twenty Queries,”
emphasizing balanced safety and robustness. arXiv.org, 2023.
2) Comprehensive Defense Strategies: Effective defense [8] R. Shah, Q. Feuillade-Montixi, S. Pour, A. Tagade, S. Casper, and
J. Rando, “Scalable and Transferable Black-Box Jailbreaks for Language
mechanisms require a multi-faceted approach. This includes Models via Persona Modulation,” arXiv.org, 2023.
exploring prompt-level defenses like robust prompt optimiza- [9] Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How Johnny Can
tion [62] and semantic smoothing [74]. Model-level defenses, Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge
AI Safety by Humanizing LLMs,” arXiv.org, 2024.
such as unlearning harmful knowledge [67] and robust align- [10] Y. Deng, W. Zhang, S. J. Pan, and L. Bing, “Multilingual Jailbreak
ment checking [68], can strengthen security by targeting Challenges in Large Language Models,” International Conference on
internal model vulnerabilities. Multi-agent defenses like Au- Learning Representations, 2023.
[11] Z. Yu, X. Liu, S. Liang, Z. Cameron, C. Xiao, and N. Zhang, “Don’t
toDefense, which uses collaborative agents to filter harmful Listen To Me: Understanding and Exploring Jailbreak Prompts of Large
outputs, also show promise [69]. Language Models,” USENIX Security Symposium, 2024.
[12] X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson, [35] B. Chen, Z. Zhang, N. Langrené, and S. Zhu, “Unleashing the potential
“Fine-tuning Aligned Language Models Compromises Safety, Even of prompt engineering in Large Language Models: a comprehensive
When Users Do Not Intend To!” International Conference on Learning review,” arXiv.org, 2023.
Representations, 2023. [36] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese,
[13] F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques N. McAleese, and G. Irving, “Red Teaming Language Models with
For Language Models,” arXiv.org, 2022. Language Models,” in Proceedings of the 2022 Conference on Empirical
[14] Y. Huang, S. Gupta, M. Xia, K. Li, and D. Chen, “Catastrophic Jail- Methods in Natural Language Processing. Association for Computa-
break of Open-source LLMs via Exploiting Generation,” International tional Linguistics, 2022.
Conference on Learning Representations, 2023. [37] Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu,
[15] D. Handa, A. Chirmule, B. Gajera, and C. Baral, “Jailbreaking Pro- “Defending ChatGPT against jailbreak attack via self-reminders,” Nature
prietary Large Language Models using Word Substitution Cipher,” Machine Intelligence, vol. 5, no. 12, pp. 1486–1496, dec 12 2023.
arXiv.org, 2024. [38] T. Liu, Y. Zhang, Z. Zhao, Y. Dong, G. Meng, and K. Chen, “Making
[16] H. Qiu, S. Zhang, A. Li, H. He, and Z. Lan, “Latent Jailbreak: A Them Ask and Answer: Jailbreaking Large Language Models in Few
Benchmark for Evaluating Text Safety and Output Robustness of Large Queries via Disguise and Reconstruction,” USENIX Security Symposium,
Language Models,” arXiv.org, 2023. 2024.
[39] Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking Attack against
[17] B. Meskó, “Prompt Engineering as an Important Emerging Skill for
Multimodal Large Language Model,” arXiv.org, 2024.
Medical Professionals: Tutorial,” Journal of Medical Internet Research,
[40] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Vi-
vol. 25, p. e50638, oct 4 2023.
sual Adversarial Examples Jailbreak Aligned Large Language Models,”
[18] Q. Niu, J. Liu, Z. Bi, P. Feng, B. Peng, and K. Chen, “Large language arXiv.org, 2023.
models and cognitive science: A comprehensive review of similarities,
[41] Z. Andy, W. Zifan, Z. K. J., and F. Matt, “Universal and Transferable
differences, and challenges,” arXiv preprint arXiv:2409.02387, 2024.
Adversarial Attacks on Aligned Language Models,” arXiv.org, 2023.
[19] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, [42] X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating Stealthy
K. Wang, and Y. Liu, “Jailbreaking ChatGPT via Prompt Engineering: Jailbreak Prompts on Aligned Large Language Models,” International
An Empirical Study,” arXiv.org, 2023. Conference on Learning Representations, 2023.
[20] J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red Teaming Large [43] T. Zhang, B. Cao, Y. Cao, L. Lin, P. Mitra, and J. Chen, “Wordgame:
Language Models with Auto-Generated Jailbreak Prompts,” arXiv.org, Efficient &; Effective LLM Jailbreak via Simultaneous Obfuscation in
2023. Query and Response,” arXiv.org, 2024.
[21] Y. Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, [44] Z. Wei, Y. Wang, A. Li, Y. Mo, and Y. Wang, “Jailbreak and Guard
and X. Wang, “Figstep: Jailbreaking Large Vision-language Models via Aligned Language Models with Only Few In-Context Demonstrations,”
Typographic Visual Prompts,” arXiv.org, 2023. arXiv.org, 2023.
[22] R. Lapid, R. Langberg, and M. Sipper, “Open Sesame! Universal Black [45] M. Russinovich, A. Salem, and R. Eldan, “Great, Now Write an
Box Jailbreaking of Large Language Models,” arXiv, 2023. Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack,”
[23] X. Zhao, X. Yang, T. Pang, C. Du, L. Li, Y.-X. Wang, and W. Y. Wang, arXiv.org, 2024.
“Weak-to-Strong Jailbreaking on Large Language Models,” arXiv.org, [46] Z. Zhou, J. Xiang, H. Chen, Q. Liu, Z. Li, and S. Su, “Speak Out
2024. of Turn: Safety Vulnerability of Large Language Models in Multi-turn
[24] A. Rao, S. Vashistha, A. Naik, S. Aditya, and M. Choudhury, “Trick- Dialogue,” arXiv.org, 2024.
ing LLMs into Disobedience: Formalizing, Analyzing, and Detecting [47] Z. Wang, Y. Cao, and P. Liu, “Hidden You Malicious Goal Into Benign
Jailbreaks,” arXiv.org, 2023. Narratives: Jailbreak Large Language Models through Logic Chain
[25] R. Bhardwaj and S. Poria, “Red-Teaming Large Language Models using Injection,” arXiv.org, 2024.
Chain of Utterances for Safety-Alignment,” arXiv.org, 2023. [48] F. Jiang, Z. Xu, L. Niu, Z. Xiang, B. Ramasubramanian, B. Li, and
[26] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, R. Poovendran, “Artprompt: Ascii Art-based Jailbreak Attacks against
B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, Aligned LLMs,” arXiv.org, 2024.
A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, [49] P. Cheng, Y. Ding, T. Ju, Z. Wu, W. Du, P. Yi, Z. Zhang, and G. Liu,
S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, “Trojanrag: Retrieval-Augmented Generation Can Be Backdoor Driver
J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran- in Large Language Models,” arXiv.org, 2024.
Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, [50] H. Yao, J. Lou, and Z. Qin, “Poisonprompt: Backdoor Attack on
J. Kaplan, and J. Clark, “Red Teaming Language Models to Reduce Prompt-Based Large Language Models,” in ICASSP 2024 - 2024 IEEE
Harms: Methods, Scaling Behaviors, and Lessons Learned,” arXiv.org, International Conference on Acoustics, Speech and Signal Processing
2022. (ICASSP), vol. 1. IEEE, apr 14 2024, pp. 7745–7749.
[27] P. Ding, J. Kuang, D. Ma, X. Cao, Y. Xian, J. Chen, and S. Huang, [51] X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y. Wang, X. Zhao, and
“A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can D. Lin, “Shadow Alignment: The Ease of Subverting Safely-Aligned
Fool Large Language Models Easily,” North American Chapter of the Language Models,” arXiv.org, 2023.
Association for Computational Linguistics, 2023. [52] Y. Li, H. Guo, K. Zhou, W. X. Zhao, and J.-R. Wen, “Images are
Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jail-
[28] L. Tianlong, D. Shihan, L. Wenhao, W. Muling, L. Changze, Z. Rui,
breaking Multimodal Large Language Models,” arXiv.org, 2024.
Z. Xiaoqing, and H. Xuanjing, “Rethinking Jailbreaking through the
Lens of Representation Engineering,” arXiv.org, 2024. [53] S. Erfan, D. Yue, and B. A.-G. Nael, “Jailbreak in pieces: Compositional
Adversarial Attacks on Multi-Modal Language Models,” International
[29] Y. Wu, X. Li, Y. Liu, P. Zhou, and L. Sun, “Jailbreaking GPT-4v via
Conference on Learning Representations, 2023.
Self-Adversarial Attacks with System Prompts,” arXiv.org, 2023.
[54] Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-Resource Languages
[30] Y. Z. Terry, H. Yujin, C. Chunyang, and X. Zhenchang, “Red teaming Jailbreak GPT-4,” arXiv.org, 2023.
ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity,” [55] J. Li, Y. Liu, C. Liu, L. Shi, X. Ren, Y. Zheng, Y. Liu, and Y. Xue, “A
arXiv.org, 2023. Cross-Language Investigation into Jailbreak Attacks in Large Language
[31] H. Li, D. Guo, W. Fan, M. Xu, J. Huang, F. Meng, and Y. Song, Models,” arXiv.org, 2024.
“Multi-step Jailbreaking Privacy Attacks on ChatGPT,” Conference on [56] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Vi-
Empirical Methods in Natural Language Processing, 2023. sual Adversarial Examples Jailbreak Aligned Large Language Models,”
[32] M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking Lead- Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38,
ing Safety-Aligned LLMs with Simple Adaptive Attacks,” arXiv.org, no. 19, pp. 21 527–21 536, mar 24 2024.
2024. [57] N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P.-y.
[33] Y. Wang, H. Li, X. Han, P. Nakov, and T. Baldwin, “Do-Not-Answer: Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline
A Dataset for Evaluating Safeguards in LLMs,” arXiv.org, 2023. Defenses for Adversarial Attacks Against Aligned Language Models,”
[34] X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “”Do Anything arXiv.org, 2023.
Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on [58] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM
Large Language Models,” arXiv.org, 2023. Safety Training Fail?” Neural Information Processing Systems, 2023.
[59] S. Schulhoff, J. Pinto, A. Khan, L.-F. Bouchard, C. Si, S. Anati, [82] S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang,
V. Tagliabue, A. Kost, C. Carnahan, and J. Boyd-Graber, “Ignore This A. Nenkova, and T. Sun, “Autodan: Interpretable Gradient-Based Ad-
Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs versarial Attacks on Large Language Models,” arXiv.org, 2023.
Through a Global Prompt Hacking Competition,” in Proceedings of the [83] H. Qiu, S. Zhang, A. Li, H. He, and Z. Lan, “Latent jailbreak: A
2023 Conference on Empirical Methods in Natural Language Processing. benchmark for evaluating text safety and output robustness of large
Association for Computational Linguistics, 2023. language models,” arXiv preprint arXiv:2307.08487, 2023.
[60] G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, [84] D. Wu, S. Wang, Y. Liu, and N. Liu, “Llms Can Defend Themselves
and Y. Liu, “Masterkey: Automated Jailbreaking of Large Language Against Jailbreaking in a Practical Manner: A Vision Paper,” arXiv.org,
Model Chatbots,” in Proceedings 2024 Network and Distributed System 2024.
Security Symposium. Internet Society, 2024.
[61] Y. Mo, Y. Wang, Z. Wei, and Y. Wang, “Studious bob fight back
against jailbreaking via prompt adversarial tuning,” arXiv preprint
arXiv:2402.06255, 2024.
[62] A. Zhou, B. Li, and H. Wang, “Robust Prompt Optimization for
Defending Language Models Against Jailbreaking Attacks,” arXiv.org,
2024.
[63] C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K.-W. Chang, M. Huang,
and N. Peng, “On Prompt-Driven Safeguarding for Large Language
Models,” arXiv.org, 2024.
[64] Y. Zong, O. Bohdal, T. Yu, Y. Yang, and T. Hospedales, “Safety Fine-
Tuning at (Almost) No Cost: A Baseline for Vision Large Language
Models,” arXiv.org, 2024.
[65] A. Hasan, I. Rugina, and A. Wang, “Pruning for Protection: Increasing
Jailbreak Resistance in Aligned LLMs Without Fine-Tuning,” arXiv.org,
2024.
[66] A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defend-
ing Large Language Models Against Jailbreaking Attacks,” arXiv.org,
2023.
[67] W. Lu, Z. Zeng, J. Wang, Z. Lu, Z. Chen, H. Zhuang, and C. Chen,
“Eraser: Jailbreaking Defense in Large Language Models via Unlearning
Harmful Knowledge,” arXiv.org, 2024.
[68] B. Cao, Y. Cao, L. Lin, and J. Chen, “Defending Against Alignment-
Breaking Attacks via Robustly Aligned LLM,” arXiv.org, 2023.
[69] Y. Zeng, Y. Wu, X. Zhang, H. Wang, and Q. Wu, “Autodefense: Multi-
Agent LLM Defense against Jailbreak Attacks,” arXiv.org, 2024.
[70] M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and
D. H. Chau, “Llm Self Defense: By Self Examination, LLMs Know
They Are Being Tricked,” Tiny Papers @ ICLR, 2023.
[71] Z. Zhang, Q. Zhang, and J. Foerster, “Parden, Can You Repeat That?
Defending against Jailbreaks via Repetition,” arXiv.org, 2024.
[72] Z. Wang, F. Yang, L. Wang, P. Zhao, H. Wang, L. Chen, Q. Lin, and
K.-F. Wong, “Self-Guard: Empower the LLM to Safeguard Itself,” North
American Chapter of the Association for Computational Linguistics,
2023.
[73] Y. Wang, Z. Shi, A. Bai, and C.-J. Hsieh, “Defending LLMs against
Jailbreaking Attacks via Backtranslation,” arXiv.org, 2024.
[74] Z. Xu, F. Jiang, L. Niu, J. Jia, B. Y. Lin, and R. Poovendran, “Safedecod-
ing: Defending against Jailbreak Attacks via Safety-Aware Decoding,”
arXiv.org, 2024.
[75] C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K.-W. Chang, M. Huang,
and N. Peng, “On prompt-driven safeguarding for large language mod-
els,” in Forty-first International Conference on Machine Learning, 2024.
[76] H. Sun, Z. Zhang, J. Deng, J. Cheng, and M. Huang, “Safety Assessment
of Chinese Large Language Models,” arXiv.org, 2023.
[77] L. Xin, Z. Yichen, G. Jindong, L. Yunshi, Y. Chao, and Q. Yu, “Mm-
SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large
Language Models,” arXiv preprint arXiv:2311.17600, 2023.
[78] W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, “Jailbreakv-28k: A
Benchmark for Assessing the Robustness of MultiModal Large Lan-
guage Models against Jailbreak Attacks,” arXiv.org, 2024.
[79] R. Hazra, S. Layek, S. Banerjee, and S. Poria, “Sowing the Wind,
Reaping the Whirlwind: The Impact of Editing Language Models,”
arXiv.org, 2024.
[80] P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Se-
hwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani,
and E. Wong, “Jailbreakbench: An Open Robustness Benchmark for
Jailbreaking Large Language Models,” arXiv.org, 2024.
[81] K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz,
“Not What You’ve Signed Up For: Compromising Real-World LLM-
Integrated Applications with Indirect Prompt Injection,” in Proceedings
of the 16th ACM Workshop on Artificial Intelligence and Security. ACM,
nov 26 2023.

You might also like