Computer Vision and Pattern Recognition 7
♻ ☆ RailSafeNet: Visual Scene Understanding for Tram Safety
Tram-human interaction safety is an important challenge, given that trams
frequently operate in densely populated areas, where collisions can range from
minor injuries to fatal outcomes. This paper addresses the issue from the
perspective of designing a solution leveraging digital image processing, deep
learning, and artificial intelligence to improve the safety of pedestrians,
drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a
real-time framework that fuses semantic segmentation, object detection and a
rule-based Distance Assessor to highlight track intrusions. Using only
monocular video, the system identifies rails, localises nearby objects and
classifies their risk by comparing projected distances with the standard 1435mm
rail gauge. Experiments on the diverse RailSem19 dataset show that a
class-filtered SegFormer B3 model achieves 65% intersection-over-union (IoU),
while a fine-tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated
at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore
delivers accurate, annotation-light scene understanding that can warn drivers
before dangerous situations escalate. Code available at
https://github.com/oValach/RailSafeNet.
comment: 11 pages, 5 figures, EPIA2025
♻ ☆ Enriched text-guided variational multimodal knowledge distillation network (VMD) for automated diagnosis of plaque vulnerability in 3D carotid artery MRI
Multimodal learning has attracted much attention in recent years due to its
ability to effectively utilize data features from a variety of different
modalities. Diagnosing the vulnerability of atherosclerotic plaques directly
from carotid 3D MRI images is relatively challenging for both radiologists and
conventional 3D vision networks. In clinical practice, radiologists assess
patient conditions using a multimodal approach that incorporates various
imaging modalities and domain-specific expertise, paving the way for the
creation of multimodal diagnostic networks. In this paper, we have developed an
effective strategy to leverage radiologists' domain knowledge to automate the
diagnosis of carotid plaque vulnerability through Variation inference and
Multimodal knowledge Distillation (VMD). This method excels in harnessing
cross-modality prior knowledge from limited image annotations and radiology
reports within training data, thereby enhancing the diagnostic network's
accuracy for unannotated 3D MRI images. We conducted in-depth experiments on
the dataset collected in-house and verified the effectiveness of the VMD
strategy we proposed.
♻ ☆ MSMA: Multi-Scale Feature Fusion For Multi-Attribute 3D Face Reconstruction From Unconstrained Images
Reconstructing 3D face from a single unconstrained image remains a
challenging problem due to diverse conditions in unconstrained environments.
Recently, learning-based methods have achieved notable results by effectively
capturing complex facial structures and details across varying conditions.
Consequently, many existing approaches employ projection-based losses between
generated and input images to constrain model training. However, learning-based
methods for 3D face reconstruction typically require substantial amounts of 3D
facial data, which is difficult and costly to obtain. Consequently, to reduce
reliance on labeled 3D face datasets, many existing approaches employ
projection-based losses between generated and input images to constrain model
training. Nonetheless, despite these advancements, existing approaches
frequently struggle to capture detailed and multi-scale features under diverse
facial attributes and conditions, leading to incomplete or less accurate
reconstructions. In this paper, we propose a Multi-Scale Feature Fusion with
Multi-Attribute (MSMA) framework for 3D face reconstruction from unconstrained
images. Our method integrates multi-scale feature fusion with a focus on
multi-attribute learning and leverages a large-kernel attention module to
enhance the precision of feature extraction across scales, enabling accurate 3D
facial parameter estimation from a single 2D image. Comprehensive experiments
on the MICC Florence, Facewarehouse and custom-collect datasets demonstrate
that our approach achieves results on par with current state-of-the-art
methods, and in some instances, surpasses SOTA performance across challenging
conditions.
♻ ☆ DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection
Video Anomaly Detection (VAD) is critical for surveillance and public safety.
However, existing benchmarks are limited to either frame-level or video-level
tasks, restricting a holistic view of model generalization. This work first
introduces a softmax-based frame allocation strategy that prioritizes
anomaly-dense segments while maintaining full-video coverage, enabling balanced
sampling across temporal scales. Building on this process, we construct two
complementary benchmarks. The image-based benchmark evaluates frame-level
reasoning with representative frames, while the video-based benchmark extends
to temporally localized segments and incorporates an abnormality scoring task.
Experiments on UCF-Crime demonstrate improvements at both the frame and video
levels, and ablation studies confirm clear advantages of anomaly-focused
sampling over uniform and random baselines.
comment: 6 pages in IEEE double-column format, 1 figure, 5 tables. The paper
introduces a unified framework for Video Anomaly Detection (VAD) featuring
dual benchmarks and an anomaly-focused sampling strategy
♻ ☆ Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework
Despite the remarkable success of Self-Supervised Learning (SSL), its
generalization is fundamentally hindered by Shortcut Learning, where models
exploit superficial features like texture instead of intrinsic structure. We
experimentally verify this flaw within the generative paradigm (e.g., MAE) and
argue it is a systemic issue also affecting discriminative methods, identifying
it as the root cause of their failure on unseen domains. While existing methods
often tackle this at a surface level by aligning or separating domain-specific
features, they fail to alter the underlying learning mechanism that fosters
shortcut dependency.To address this at its core, we propose HyGDL (Hybrid
Generative-Discriminative Learning Framework), a hybrid framework that achieves
explicit content-style disentanglement. Our approach is guided by the
Invariance Pre-training Principle: forcing a model to learn an invariant
essence by systematically varying a bias (e.g., style) at the input while
keeping the supervision signal constant. HyGDL operates on a single encoder and
analytically defines style as the component of a representation that is
orthogonal to its style-invariant content, derived via vector projection. This
is operationalized through a synergistic design: (1) a self-distillation
objective learns a stable, style-invariant content direction; (2) an analytical
projection then decomposes the representation into orthogonal content and style
vectors; and (3) a style-conditioned reconstruction objective uses these
vectors to restore the image, providing end-to-end supervision. Unlike prior
methods that rely on implicit heuristics, this principled disentanglement
allows HyGDL to learn truly robust representations, demonstrating superior
performance on benchmarks designed to diagnose shortcut learning.
♻ ☆ SFGNet: Semantic and Frequency Guided Network for Camouflaged Object Detection ICASSP 2026
Camouflaged object detection (COD) aims to segment objects that blend into
their surroundings. However, most existing studies overlook the semantic
differences among textual prompts of different targets as well as fine-grained
frequency features. In this work, we propose a novel Semantic and Frequency
Guided Network (SFGNet), which incorporates semantic prompts and
frequency-domain features to capture camouflaged objects and improve boundary
perception. We further design Multi-Band Fourier Module(MBFM) to enhance the
ability of the network in handling complex backgrounds and blurred boundaries.
In addition, we design an Interactive Structure Enhancement Block (ISEB) to
ensure structural integrity and boundary details in the predictions. Extensive
experiments conducted on three COD benchmark datasets demonstrate that our
method significantly outperforms state-of-the-art approaches. The core code of
the model is available at the following link:
https://github.com/winter794444/SFGNetICASSP2026.
comment: Submitted to ICASSP 2026 by Dezhen Wang et al. Copyright 2026 IEEE.
Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, including reprinting/republishing, creating new
collective works, for resale or redistribution to servers or lists, or reuse
of any copyrighted component of this work. DOI will be added upon IEEE Xplore
publication
♻ ☆ Leveraging Geometric Priors for Unaligned Scene Change Detection
Unaligned Scene Change Detection aims to detect scene changes between image
pairs captured at different times without assuming viewpoint alignment. To
handle viewpoint variations, current methods rely solely on 2D visual cues to
establish cross-image correspondence to assist change detection. However, large
viewpoint changes can alter visual observations, causing appearance-based
matching to drift or fail. Additionally, supervision limited to 2D change masks
from small-scale SCD datasets restricts the learning of generalizable
multi-view knowledge, making it difficult to reliably identify visual overlaps
and handle occlusions. This lack of explicit geometric reasoning represents a
critical yet overlooked limitation. In this work, we introduce geometric priors
for the first time to address the core challenges of unaligned SCD, for
reliable identification of visual overlaps, robust correspondence
establishment, and explicit occlusion detection. Building on these priors, we
propose a training-free framework that integrates them with the powerful
representations of a visual foundation model to enable reliable change
detection under viewpoint misalignment. Through extensive evaluation on the
PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves
superior and robust performance. Our code will be released at
https://github.com/ZilingLiu/GeoSCD.