TARO:

Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

ICCV 2025

KAIST       

Audio Synthesis for Silent Videos

Abstract

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

Method Overview

Fig1
Overview of TARO. Our TARO is a flow-based multimodal transformer for video-to-audio generation, integrating TimestepAdaptive Representation Alignment (TRA) and Onset-Aware Conditioning (OAC) to enhance synchronization and fidelity. Black arrows denote branches used only in training, blue arrows for inference only, and green arrows for both training and inference.

Qualitative Comparison

DiffFoley
Seeing&Hearing
FoleyCrafter
Frieren
MDSGen
TARO (Ours)

BibTeX


        @article{ton2025taro,
          title={TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis},
          author={Ton, Tri and Hong, Ji Woo and Yoo, Chang D},
          journal={arXiv preprint arXiv:2504.05684},
          year={2025}
        }