By Insights Team in AI — 24 Jul 2025

First Multi-Modal Industrial Signal Foundation Model FISHER: Open Source Weights from Tsinghua & Shanghai Jiao Tong}

FISHER, the pioneering multi-modal industrial signal foundation model developed by Tsinghua, Shanghai Jiao Tong, and others, is now open source, enabling advanced industrial signal analysis.

Recently, researchers from Tsinghua University, Shanghai Jiao Tong University, Beijing Huakong Zhijia Technology Co., Ltd., and North China Electric Power University jointly released FISHER, the first multi-modal industrial signal foundation model. Using a block-building approach, it models heterogeneous industrial signals in a unified manner. The technical report and weights are now open source, and everyone is welcome to use it!

Paper link: https://arxiv.org/abs/2507.16696
GitHub repository: https://github.com/jianganbai/FISHER

Research Background

In recent years, more industrial equipment has been installed with sensors to monitor operation status. However, analyzing industrial signals efficiently remains challenging due to their heterogeneity. This is summarized as the M5 problem: multi-modality, multi-sampling rate, multi-scale, multi-task, and fault scarcity.

Existing methods tend to analyze small-scale signals, such as vibration-based bearing fault diagnosis, using small models trained on limited data. These models fail to leverage the advantages of large-scale training and the complementarity between modalities. Moreover, each sub-problem in industrial maintenance requires separate models, increasing complexity in practical applications.

Research Motivation

Despite the apparent differences, industrial signals share similar intrinsic features and semantic information:

Same semantic information: signals reflect the same health status.
Similar generation mechanisms: sound (eardrum vibration) and vibration originate from the same source.
Similar analysis methods: mainly spectral analysis.
Similar fault modes: equipment composed of parts, with transferable patterns across devices.
Shared task features: a feature vector can represent multiple health management tasks.

Based on this, we believe a unified model can be used to model heterogeneous industrial signals. The internal similarities, scaled appropriately, can enable the model to learn these commonalities, leading to a milestone leap in representation ability. This led to the development of the FISHER model.

Introduction to FISHER

FISHER is the first multi-modal industrial signal foundation model. It models signals using sub-bands, stacking building blocks to represent entire signals, capable of handling arbitrary sampling rates. The detailed approach is as follows:

Sub-band Modeling

Spectral analysis is common in speech and signal analysis. Unlike Mel spectrograms used in speech models, FISHER uses Short-Time Fourier Transform (STFT) because 1) fault components often appear in high frequencies, and 2) harmonic relationships are important for rotating machinery. To ensure consistent time-frequency resolution across different sampling rates, FISHER’s STFT uses fixed window length and frame shift.

As data volume increases, multi-sampling rates become a challenge. Previous methods resampled signals to a fixed rate (e.g., 16kHz), losing high-frequency information, especially for signals above 44.1kHz. FISHER avoids resampling, modeling signals based on their native sampling characteristics.

When observing the same source at different sampling rates, the frequency bands are largely consistent, but higher sampling rates include additional high-frequency sub-bands, providing more information. Common industrial sampling rates include 16kHz, 32kHz, 44.1kHz, and 48kHz, which are approximately divisible, allowing STFT spectrograms to be viewed as concatenations of fixed-width sub-bands.

Therefore, FISHER models signals by dividing the spectrogram into fixed-width sub-bands, stacking them like building blocks to represent the entire signal. Each sub-band is processed separately, and the final representation is the concatenation of all sub-band features.

Model Architecture

FISHER includes one ViT encoder and one CNN decoder, trained via self-distillation. The teacher encoder is an exponential moving average (EMA) of the student encoder, with only the student encoder and decoder having gradients. During training, 80% of the sub-bands are masked, and the unmasked 20% are processed by the student encoder, then combined with masked parts and fed into the decoder. The teacher encoder processes the full sub-band, providing the distillation target. Supervision occurs at both the [CLS] and patch levels. After pretraining, only the student encoder is used for evaluation.

We have open-sourced three sizes of FISHER: tiny (5.5M), mini (10M), and small (22M), all pretrained on a 17,000-hour mixed dataset.

RMIS Benchmark Introduction

To evaluate performance across various health management tasks, we proposed the RMIS benchmark, which includes 5 anomaly detection datasets and 13 fault diagnosis datasets across 4 modalities. Anomaly detection is a binary normal/abnormal classification without anomalies in training; fault diagnosis is multi-class, with all classes in both training and testing. The models use the same KNN inference without fine-tuning to assess inherent capabilities.

Experimental Results

We screened common pretrained models on RMIS, selecting the top 5 as baselines, covering sizes from 5M to 1.2B. Due to the generally lower performance of speech models, they were not compared.

Benchmark Scores

On RMIS, FISHER’s three versions outperform the baseline by at least 3.91%, 4.34%, and 5.03%, demonstrating strong generalization. In anomaly detection, FISHER is slightly below BEATs; in fault diagnosis, it surpasses all baselines, benefiting from full-band utilization. The largest open-source model is only 22M, much smaller than the common 90M baselines.

Scaling Effect

The above graph compares RMIS scores as model size increases. FISHER’s curve is significantly higher than the baseline, even the smallest FISHER-tiny surpasses all baselines, indicating more effective scaling.

Additionally, 100M seems to be a scaling inflection point, likely due to high redundancy in industrial signals. Data cleaning and increasing data proportion are key for scaling up. Test-time scaling is also a promising direction.

Variable Segmentation Ratio

For 12 datasets without official segmentation, we plotted performance curves under varying segmentation ratios and estimated the area under the curve. FISHER has the largest area, indicating robust performance across segmentation variations.

Subscribe to QQ Insights