ACM MM 2025 | EventVAD: Zero-Training Video Anomaly Detection with 7B Parameters Sets New State-of-the-Art}

Researchers from Beijing University, Tsinghua, and JD.com introduce EventVAD, a low-cost, training-free video anomaly detection framework that outperforms current SOTA on major datasets.

ACM MM 2025 | EventVAD: Zero-Training Video Anomaly Detection with 7B Parameters Sets New State-of-the-Art}

Research team from Peking University, Tsinghua University, and JD.com jointly published EventVAD at ACM MM 2025, a low-cost, training-free video anomaly detection framework. The first author is Yihua Shao, a visiting scholar at Peking University, led by JD.com researcher Mao Ma. The code and data are fully open-source.

Existing Video Anomaly Detection (VAD) methods include supervised approaches relying on large labeled datasets, which struggle to generalize to unseen scenarios; and unsupervised methods leveraging large language models (LLMs) for world knowledge, but with issues like coarse temporal localization, incoherent event understanding, and parameter redundancy.

To address these challenges, the team proposed EventVAD, a novel framework combining dynamic graph architecture with multimodal large models (MLLMs) for temporal event reasoning. It reduces model parameters while significantly improving detection accuracy and efficiency. Experiments on UCF-Crime and XD-Violence datasets show EventVAD surpasses existing SOTA, setting a new benchmark for training-free video anomaly detection.

Background and Motivation

Video anomaly detection (VAD) aims to precisely locate abnormal frames, but current methods face significant limitations:

Supervised methods depend on extensive labeled data and poorly generalize to new scenarios; even weakly supervised or unsupervised methods like LAVAD, which use vision-language models and LLMs for scoring, suffer from inefficiency, poor temporal understanding, and long-tail errors.

The team identified that the core bottleneck of training-free methods is the inability to fully localize abnormal events, leading to biased LLM scoring. Inspired by this, EventVAD segments long videos into short event clips, enhancing temporal understanding with dynamic graph models that capture frame-to-frame relations, achieving higher accuracy with fewer parameters.

Key Innovations of EventVAD

EventVAD’s architecture includes four core modules: event-aware dynamic graph construction, graph attention propagation, statistical boundary detection, and event-centric anomaly scoring. This end-to-end, training-free pipeline extracts features and precisely locates anomalies.

Event-Aware Dynamic Graph Construction

To capture dynamic temporal features, EventVAD builds a semantic-motion fused graph model. It combines CLIP’s semantic features (512D) and RAFT’s optical flow features (128D), balanced by a coefficient (α=0.75). Frame-to-frame relations are computed via cosine similarity (semantic) and exponential distance (motion), with a temporal decay factor (γ) to emphasize short-term coherence.

Graph Attention Propagation

To optimize frame features and maintain temporal consistency, EventVAD employs an orthogonal-constrained graph attention mechanism. It generates orthogonal Q, K, V matrices via QR decomposition, avoiding feature collapse. Attention weights are iteratively propagated to enhance event boundary distinction.

Statistical Boundary Detection

For accurate event boundary segmentation, EventVAD uses a noise-robust statistical method. It combines L2 norm (amplitude change) and cosine similarity (direction change), smoothed with Savitzky-Golay filter, and employs a dynamic threshold based on MAD for unsupervised detection.

Event-Centric Anomaly Scoring

To improve MLLMs’ video understanding, EventVAD adopts a hierarchical prompt strategy. It first generates a description of each event segment, then outputs anomaly scores based on the description, balancing context and feature detail, reducing long-video errors.

Experimental Validation

The team evaluated EventVAD on UCF-Crime and XD-Violence datasets, achieving performance surpassing existing methods.

Results on UCF-Crime

EventVAD with 7 billion parameters achieved an AUC of 82.03%, outperforming LAVAD (which uses 13 billion parameters) by nearly 4%, and surpassing all unsupervised and weakly supervised methods.

Results on XD-Violence

On XD-Violence, EventVAD outperformed existing training-free SOTA methods by about 5% in AP and AUC, demonstrating strong adaptability in high-resolution scenarios.

Visualization Analysis

The visualization of graph attention propagation shows the change in frame-to-frame weights for abnormal and normal videos from UCF-Crime, highlighting the model’s focus on relevant temporal relations.

Boundary Detection Visualization

Using UCF-Crime samples, the boundary detection process is visualized, showing how EventVAD accurately segments abnormal and normal video segments.

Ablation Study

Visualizations of normal and abnormal samples from UCF-Crime and XD-Violence datasets demonstrate the effectiveness of EventVAD’s event segmentation and anomaly scoring, aligning well with true labels.

Summary and Outlook

As the first event-centric, training-free video anomaly detection model, EventVAD promotes a shift from frame-level to event-level annotation, reducing manual labeling and re-training costs. With ongoing advancements in video understanding, this paradigm will lay the foundation for more fine-grained video analysis and inspire future innovations in anomaly detection.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe