Multimodal Large Models Have 'Inner Warning' Capabilities to Detect Jailbreak Attacks Without Training}

A new method enables multimodal large models to identify jailbreak attempts through internal activation signals, without additional training, enhancing security in AI systems.

Multimodal Large Models Have 'Inner Warning' Capabilities to Detect Jailbreak Attacks Without Training}

图片

The Rise of Multimodal Large Models and the Emergence of Security Concerns

Recent breakthroughs in large language models (LLMs) have led to the rapid rise of visual-language large models (LVLMs), such as GPT-4V and LLaVA. By deeply integrating images and text, LVLMs excel in tasks like visual question answering and reasoning. However, a serious issue has emerged—LVLMs are more susceptible to 'jailbreak' attacks, where malicious inputs can bypass safety measures. Attackers can inject dangerous intentions via images, and even straightforward instructions often fail to prevent models from generating harmful content.

To address this challenge, existing solutions include cross-modal safety fine-tuning, prompt design, or external detection modules. Yet, these approaches often suffer from high training costs, poor generalization, or false positives on normal inputs.

Models Are 'Aware' of Jailbreaks: Hidden States as Early Warning Signals

Researchers from HKU MMLab and Tautian Future Life Lab proposed HiddenDetect— a novel jailbreak detection method that requires no training. The key insight is that even if LVLMs produce unsafe outputs, their hidden states still retain signals indicating refusal. Notably, these signals are often more sensitive and appear earlier in the model’s intermediate layers than in the final output. Interestingly, text and image inputs activate distinct 'safety pathways,' meaning the model’s danger perception mechanisms differ across modalities.

The paper has been accepted at ACL 2025.

图片

Decoding Safety Perception from 'Refusal Semantics' in Multimodal Models

图片

      Figure 1: Multimodal jailbreak detection based on the model’s own activation patterns.

The researchers first analyze the model’s responses to unsafe inputs, extracting high-frequency tokens with clear refusal semantics (e.g., “sorry,” “unable,” “unfortunately”). They construct a 'Refusal Semantic Vector' (RV) using one-hot encoding to represent the model’s refusal behavior. Then, they project hidden states from each layer back into the vocabulary space and compute cosine similarity with RV, measuring the refusal signal strength at each layer. This process produces a vector F that characterizes the activation strength of refusal signals across layers.

图片

Experimental results show significant differences in F between safe and unsafe inputs: safe samples generally have low overall F values, while unsafe inputs cause F to rise in intermediate layers, peaking before declining in later layers. Notably, F remains high in the last layer regardless of safety, indicating the model retains a refusal tendency before output.

To further analyze the model’s safety response, the team created three small sample sets—comprising safe inputs, text-based attacks, and multimodal attacks. They compute the refusal strength vector F for each and derive a 'Refusal Difference Vector' (FDV) by subtracting safe from unsafe F values. This FDV effectively highlights the layers most sensitive to unsafe inputs.

图片

Different Modalities Trigger Different Response Paths

As shown in Figure 3, the FDV curves for text-only and multimodal inputs reveal that the model’s response to refusal signals is strongest in certain intermediate layers. For text inputs, refusal activation rapidly increases in early layers, while for visual-text inputs, the response is delayed and weaker overall. This indicates that visual modality can weaken the early refusal response, affecting safety detection.

图片

Further experiments show that if the refusal signals are concentrated in later layers or overall activation weakens, jailbreak success becomes easier. Interestingly, adding an image to a text attack prompt can delay the model’s refusal response, shifting it from early to late layers and reducing overall safety detection effectiveness.

By analyzing the accumulated refusal activation in these key layers, the team can identify the most sensitive layers for safety detection, with the last layer being less discriminative. Layers with significantly higher FDV values are more effective for detecting unsafe inputs.

图片

By aggregating refusal signals in these critical layers, an efficient, training-free jailbreak detection mechanism with good generalization can be built.

图片

Experimental Results

The team evaluated their detection method across multiple mainstream LVLMs, including LLaVA, CogVLM, and Qwen-VL, covering various attack types such as pure text jailbreaks (e.g., FigTxt) and multimodal image-text attacks (e.g., FigImg and MM-SafetyBench). They also tested on the XSTest dataset, which contains boundary samples that are safe but prone to false positives, to assess robustness. Results show the method maintains high detection accuracy with strong robustness and generalization capabilities.

图片

      Visualization of detection results.

图片

Figure 4: Logits of the last token in each hidden state projected onto the semantic plane formed by the refusal vector (RV) and its orthogonal directions.

Conclusions and Future Outlook

Safety remains a top priority for deploying large models in real-world applications. HiddenDetect offers a lightweight, deployment-friendly, activation-based detection method that does not require training. It has demonstrated excellent performance across multiple models and attack types. However, it currently mainly provides risk alerts and does not directly regulate model behavior. Future work aims to expand its capabilities and explore the intrinsic links between modal information and model safety, promoting more reliable and controllable multimodal large models.

The team from Tautian Future Lab and HKU MMLab focuses on future-oriented AI algorithms, multimodal technologies, and innovative applications, aiming to lead AI development in daily life and commerce.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe