Computer Vision and Pattern Recognition 11
♻ ☆ CTA: Cross-Task Alignment for Better Test Time Training
Samuel Barbeau, Pedram Fekri, David Osowiechi, Ali Bahri, Moslem Yazdanpanah, Masih Aminbeidokhti, Christian Desrosiers
Deep learning models have demonstrated exceptional performance across a wide
range of computer vision tasks. However, their performance often degrades
significantly when faced with distribution shifts, such as domain or dataset
changes. Test-Time Training (TTT) has emerged as an effective method to enhance
model robustness by incorporating an auxiliary unsupervised task during
training and leveraging it for model updates at test time. In this work, we
introduce CTA (Cross-Task Alignment), a novel approach for improving TTT.
Unlike existing TTT methods, CTA does not require a specialized model
architecture and instead takes inspiration from the success of multi-modal
contrastive learning to align a supervised encoder with a self-supervised one.
This process enforces alignment between the learned representations of both
models, thereby mitigating the risk of gradient interference, preserving the
intrinsic robustness of self-supervised learning and enabling more semantically
meaningful updates at test-time. Experimental results demonstrate substantial
improvements in robustness and generalization over the state-of-the-art on
several benchmark datasets.
comment: Preprint, under review
♻ ☆ MedGemma Technical Report
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang
Artificial intelligence (AI) has significant potential in healthcare
applications, but its training and deployment faces challenges due to
healthcare's diverse data, complex tasks, and the need to preserve privacy.
Foundation models that perform well on medical tasks and require less
task-specific tuning data are critical to accelerate the development of
healthcare AI applications. We introduce MedGemma, a collection of medical
vision-language foundation models based on Gemma 3 4B and 27B. MedGemma
demonstrates advanced medical understanding and reasoning on images and text,
significantly exceeding the performance of similar-sized generative models and
approaching the performance of task-specific models, while maintaining the
general capabilities of the Gemma 3 base models. For out-of-distribution tasks,
MedGemma achieves 2.6-10% improvement on medical multimodal question answering,
15.5-18.1% improvement on chest X-ray finding classification, and 10.8%
improvement on agentic evaluations compared to the base models. Fine-tuning
MedGemma further improves performance in subdomains, reducing errors in
electronic health record information retrieval by 50% and reaching comparable
performance to existing specialized state-of-the-art methods for pneumothorax
classification and histopathology patch classification. We additionally
introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
MedSigLIP powers the visual understanding capabilities of MedGemma and as an
encoder achieves comparable or better performance than specialized medical
image encoders. Taken together, the MedGemma collection provides a strong
foundation of medical image and text capabilities, with potential to
significantly accelerate medical research and development of downstream
applications. The MedGemma collection, including tutorials and model weights,
can be found at https://goo.gle/medgemma.
♻ ☆ When Does Pruning Benefit Vision Representations?
Pruning is widely used to reduce the complexity of deep learning models, but
its effects on interpretability and representation learning remain poorly
understood. This paper investigates how pruning influences vision models across
three key dimensions: (i) interpretability, (ii) unsupervised object discovery,
and (iii) alignment with human perception. We first analyze different vision
network architectures to examine how varying sparsity levels affect feature
attribution interpretability methods. Additionally, we explore whether pruning
promotes more succinct and structured representations, potentially improving
unsupervised object discovery by discarding redundant information while
preserving essential features. Finally, we assess whether pruning enhances the
alignment between model representations and human perception, investigating
whether sparser models focus on more discriminative features similarly to
humans. Our findings also reveal the presence of sweet spots, where sparse
models exhibit higher interpretability, downstream generalization and human
alignment. However, these spots highly depend on the network architectures and
their size in terms of trainable parameters. Our results suggest a complex
interplay between these three dimensions, highlighting the importance of
investigating when and how pruning benefits vision representations.
comment: Accepted at the 23rd International Conference on Image Analysis and
Processing (ICIAP 2025)
♻ ☆ Hita: Holistic Tokenizer for Autoregressive Image Generation
Vanilla autoregressive image generation models generate visual tokens
step-by-step, limiting their ability to capture holistic relationships among
token sequences. Moreover, because most visual tokenizers map local image
patches into latent tokens, global information is limited. To address this, we
introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image
generation. It introduces a holistic-to-local tokenization scheme with
learnable holistic queries and local patch tokens. Hita incorporates two key
strategies to better align with the AR generation process: 1) {arranging} a
sequential structure with holistic tokens at the beginning, followed by
patch-level tokens, and using causal attention to maintain awareness of
previous tokens; and 2) adopting a lightweight fusion module before feeding the
de-quantized tokens into the decoder to control information flow and prioritize
holistic tokens. Extensive experiments show that Hita accelerates the training
speed of AR generators and outperforms those trained with vanilla tokenizers,
achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark.
Detailed analysis of the holistic representation highlights its ability to
capture global image properties, such as textures, materials, and shapes.
Additionally, Hita also demonstrates effectiveness in zero-shot style transfer
and image in-painting. The code is available at
\href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.
comment: 17 pages, 10 figures
♻ ☆ Driving View Synthesis on Free-form Trajectories with Generative Prior ICCV 2025
Driving view synthesis along free-form trajectories is essential for
realistic driving simulations, enabling closed-loop evaluation of end-to-end
driving policies. Existing methods excel at view interpolation along recorded
paths but struggle to generalize to novel trajectories due to limited
viewpoints in driving videos. To tackle this challenge, we propose DriveX, a
novel free-form driving view synthesis framework, that progressively distills
generative prior into the 3D Gaussian model during its optimization. Within
this framework, we utilize a video diffusion model to refine the degraded novel
trajectory renderings from the in-training Gaussian model, while the restored
videos in turn serve as additional supervision for optimizing the 3D Gaussian.
Concretely, we craft an inpainting-based video restoration task, which can
disentangle the identification of degraded regions from the generative
capability of the diffusion model and remove the need of simulating specific
degraded pattern in the training of the diffusion model. To further enhance the
consistency and fidelity of generated contents, the pseudo ground truth is
progressively updated with gradually improved novel trajectory rendering,
allowing both components to co-adapt and reinforce each other while minimizing
the disruption on the optimization. By tightly integrating 3D scene
representation with generative prior, DriveX achieves high-quality view
synthesis beyond recorded trajectories in real time--unlocking new
possibilities for flexible and realistic driving simulations on free-form
trajectories.
comment: ICCV 2025
♻ ☆ PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling ICCV 2025
Most masked point cloud modeling (MPM) methods follow a regression paradigm
to reconstruct the coordinate or feature of masked regions. However, they tend
to over-constrain the model to learn the details of the masked region,
resulting in failure to capture generalized features. To address this
limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM
method that aims to align the feature distribution of masked regions.
Specially, it features an online codebook-guided teacher-student framework.
Firstly, it presents a geometry-aware partitioning strategy to extract initial
patches. Then, the teacher model updates a codebook via online k-means based on
features extracted from the complete patches. This procedure facilitates
codebook vectors to become cluster centers. Afterward, we assigns the unmasked
features to their corresponding cluster centers, and the student model aligns
the assignment for the reconstructed masked features. This strategy focuses on
identifying the cluster centers to which the masked features belong, enabling
the model to learn more generalized feature representations. Benefiting from a
proposed codebook maintenance mechanism, codebook vectors are actively updated,
which further increases the efficiency of semantic feature learning.
Experiments validate the effectiveness of the proposed method on various
downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC
comment: ICCV 2025
♻ ☆ PVChat: Personalized Video Chat with One-Shot Learning
Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Richard Yu, Ming Li, Si Yong Yeo
Video large language models (ViLLMs) excel in general video understanding,
e.g., recognizing activities like talking and eating, but struggle with
identity-aware comprehension, such as "Wilson is receiving chemotherapy" or
"Tom is discussing with Sarah", limiting their applicability in smart
healthcare and smart home environments. To address this limitation, we propose
a one-shot learning framework PVChat, the first personalized ViLLM that enables
subject-aware question answering (QA) from a single video for each subject. Our
approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically
augmented video-QA dataset, leveraging a progressive image-to-video learning
strategy. Specifically, we introduce an automated augmentation pipeline that
synthesizes identity-preserving positive samples and retrieves hard negatives
from existing video corpora, generating a diverse training dataset with four QA
types: existence, appearance, action, and location inquiries. To enhance
subject-specific learning, we propose a ReLU Routing MoH attention mechanism,
alongside two novel objectives: (1) Smooth Proximity Regularization for
progressive learning through exponential distance scaling and (2) Head
Activation Enhancement for balanced attention routing. Finally, we adopt a
two-stage training strategy, transitioning from image pre-training to video
fine-tuning, enabling a gradual learning process from static attributes to
dynamic representations. We evaluate PVChat on diverse datasets covering
medical scenarios, TV series, anime, and real-world footage, demonstrating its
superiority in personalized feature understanding after learning from a single
video, compared to state-of-the-art ViLLMs.
♻ ☆ What's Making That Sound Right Now? Video-centric Audio-Visual Localization ICCV 2025
Audio-Visual Localization (AVL) aims to identify sound-emitting sources
within a visual scene. However, existing studies focus on image-level
audio-visual associations, failing to capture temporal dynamics. Moreover, they
assume simplified scenarios where sound sources are always visible and involve
only a single object. To address these limitations, we propose AVATAR, a
video-centric AVL benchmark that incorporates high-resolution temporal
information. AVATAR introduces four distinct scenarios -- Single-sound,
Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive
evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric
AVL model that explicitly integrates temporal information. Experimental results
show that conventional methods struggle to track temporal variations due to
their reliance on global audio features and frame-level mappings. In contrast,
TAVLO achieves robust and precise audio-visual alignment by leveraging
high-resolution temporal modeling. Our work empirically demonstrates the
importance of temporal dynamics in AVL and establishes a new standard for
video-centric audio-visual localization.
comment: Published at ICCV 2025. Project page:
https://hahyeon610.github.io/Video-centric_Audio_Visual_Localization/
♻ ☆ UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification
Multi-modal object Re-IDentification (ReID) has gained considerable attention
with the goal of retrieving specific targets across cameras using heterogeneous
visual data sources. Existing methods primarily aim to improve identification
performance, but often overlook the uncertainty arising from inherent defects,
such as intra-modal noise and inter-modal conflicts. This uncertainty is
particularly significant in the case of fine-grained local occlusion and frame
loss, which becomes a challenge in multi-modal learning. To address the above
challenge, we propose a robust approach named Uncertainty-Guided Graph model
for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise
interference and facilitate effective multi-modal fusion by estimating both
local and sample-level aleatoric uncertainty and explicitly modeling their
dependencies. Specifically, we first propose the Gaussian patch-graph
representation model that leverages uncertainty to quantify fine-grained local
cues and capture their structural relationships. This process boosts the
expressiveness of modal-specific information, ensuring that the generated
embeddings are both more informative and robust. Subsequently, we design an
uncertainty-guided mixture of experts strategy that dynamically routes samples
to experts exhibiting low uncertainty. This strategy effectively suppresses
noise-induced instability, leading to enhanced robustness. Meanwhile, we design
an uncertainty-guided routing to strengthen the multi-modal interaction,
improving the performance. UGG-ReID is comprehensively evaluated on five
representative multi-modal object ReID datasets, encompassing diverse spectral
modalities. Experimental results show that the proposed method achieves
excellent performance on all datasets and is significantly better than current
methods in terms of noise immunity. Our code will be made public upon
acceptance.
♻ ☆ FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection ICCV2025
Pre-trained vision-language models (VLMs) have advanced out-of-distribution
(OOD) detection recently. However, existing CLIP-based methods often focus on
learning OOD-related knowledge to improve OOD detection, showing limited
generalization or reliance on external large-scale auxiliary datasets. In this
study, instead of delving into the intricate OOD-related knowledge, we propose
an innovative CLIP-based framework based on Forced prompt leArning (FA),
designed to make full use of the In-Distribution (ID) knowledge and ultimately
boost the effectiveness of OOD detection. Our key insight is to learn a prompt
(i.e., forced prompt) that contains more diversified and richer descriptions of
the ID classes beyond the textual semantics of class labels. Specifically, it
promotes better discernment for ID images, by forcing more notable semantic
similarity between ID images and the learnable forced prompt. Moreover, we
introduce a forced coefficient, encouraging the forced prompt to learn more
comprehensive and nuanced descriptions of the ID classes. In this way, FA is
capable of achieving notable improvements in OOD detection, even when trained
without any external auxiliary datasets, while maintaining an identical number
of trainable parameters as CoOp. Extensive empirical evaluations confirm our
method consistently outperforms current state-of-the-art methods. Code is
available at https://github.com/0xFAFA/FA.
comment: 12 pages, 4 figures, Accepted by ICCV2025
♻ ☆ MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation CVPR 2025
Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, José Lezama, Irfan Essa, David Ross, Jonathan Huang
Diffusion models are successful for synthesizing high-quality videos but are
limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained
footage (e.g. over minutes) still remains an open research question. In this
paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers),
a new diffusion model specialized for long video generation. MALT Diffusion (or
just MALT) handles long videos by subdividing them into short segments and
doing segment-level autoregressive generation. To achieve this, we first
propose recurrent attention layers that encode multiple segments into a compact
memory latent vector; by maintaining this memory vector over time, MALT is able
to condition on it and continuously generate new footage based on a long
temporal context. We also present several training techniques that enable the
model to generate frames over a long horizon with consistent quality and
minimal degradation. We validate the effectiveness of MALT through experiments
on long video benchmarks. We first perform extensive analysis of MALT in
long-contextual understanding capability and stability using popular long video
benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video
generation on UCF-101, outperforming the previous state-of-the-art of 648.4.
Finally, we explore MALT's capabilities in a text-to-video generation setting
and show that it can produce long videos compared with recent techniques for
long text-to-video generation.
comment: CVPR 2025 Workshop on AI for Content Creation (Oral)