TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

Audio Synthesis for Silent Videos

Abstract

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

Method Overview

Overview of TARO. Our TARO is a flow-based multimodal transformer for video-to-audio generation, integrating TimestepAdaptive Representation Alignment (TRA) and Onset-Aware Conditioning (OAC) to enhance synchronization and fidelity. Black arrows → denote branches used only in training, blue arrows → for inference only, and green arrows → for both training and inference.

Qualitative Comparison

DiffFoley

Seeing&Hearing

FoleyCrafter

Frieren

MDSGen

TARO (Ours)

BibTeX


        @article{ton2025taro,
          title={TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis},
          author={Ton, Tri and Hong, Ji Woo and Yoo, Chang D},
          journal={arXiv preprint arXiv:2504.05684},
          year={2025}
        }

TARO:

Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

ICCV 2025

Audio Synthesis for Silent Videos

Abstract

Method Overview

Qualitative Comparison

BibTeX