MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

KAIST
ICLR 2025

*Indicates Equal Contribution

Qualitative Comparison

(Time in seconds) denotes the inference time of the model.

98.6% Acc
93.9% Acc
58.1% Acc
83.5% Acc
MDSGen (0.05s)
Diff-Foley (0.36s)
See and Hear (18.25s)
Foley Crafter (2.96s)
Groundtruth
MDSGen
Diff-Foley
See and Hear
Foley Crafter
Groundtruth
MDSGen
Diff-Foley
See and Hear
Foley Crafter
Groundtruth
MDSGen
Diff-Foley
See and Hear
Foley Crafter
Groundtruth

Abstract

We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.9% alignment accuracy, using 172x fewer parameters, 371% less memory, and offering 36x faster inference than the current 860M-parameter state-of-the-art model (93.9% accuracy). The larger model (131M parameters) reaches nearly 99% accuracy while requiring 6.5x fewer parameters. These results highlight the scalability and effectiveness of our approach.

Method Overview

Fig1
Overview of MDSGen. To utilizing denoising masked diffusion transformers to efficiently learn video-conditional distributions for audio generation, replacing traditional Unet-based methods. The fire icon denotes trainable modules, and the locked icon denotes frozen ones. Green arrows denote branches used only during training, blue arrows are for only inference, and black arrows → are used in both training and inference.
Fig2
Audio Masking Strategies. Here, the red square red-square is the learnable mask token.

Alignment Score

Fig3
Comparison with SOTA audio generation methods on the VGGSound test set. The diameter of each circle represents the memory usage during inference.

Quanlitative Results

BibTeX


        @article{pham2024mdsgenfastefficientmasked,
          author    = {Trung X. Pham and Tri Ton and Chang D. Yoo},
          title     = {MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation},
          journal   = {ICLR},
          year      = {2025},
        }