🌐 AI搜索 & 代理 主页
Skip to content

[ACMMM 2025] CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

License

Notifications You must be signed in to change notification settings

SonyResearch/CCStereo

Repository files navigation

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Paper License: MIT

Official implementation of CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji
ACM MM 2025 (arXiv 2501.02786)

Overview

🎧 Binaural Audio Generation (BAG) CCStereo tackles the task of generating spatialised binaural audio from monaural audio using corresponding visual cues, enabling immersive sound experiences for applications in VR, AR, and 360° video.

🧠 Context-Aware Audio-Visual Conditioning Existing BAG methods rely heavily on cross-attention mechanisms and often fail to leverage the rich temporal and spatial dynamics present in video. CCStereo introduces Audio-Visual Adaptive De-normalisation (AVAD) layers to modulate the decoding process with spatial and semantic information derived from video frames, offering finer-grained control.

🔍 Spatial-Aware Contrastive Learning To improve spatial sensitivity, CCStereo uses a novel contrastive learning framework that mines hard negatives by applying spatial shuffling and temporal frame sampling, effectively simulating object position changes and encouraging the model to distinguish fine-grained spatial relationships in the audio-visual space.

🧪 Test-Time Augmentation without Overhead Unlike prior methods that ignore the inherent redundancy of video data, CCStereo introduces Test-time Dynamic Scene Simulation (TDSS)—a sliding-window based augmentation strategy that crops frames from multiple regions (top-left, center, etc.) without increasing inference cost, boosting robustness and spatial accuracy.

Highlights

CCStereo converts monaural audio to binaural audio using visual input. It addresses key limitations in spatial alignment and generalisation using:

  • AVAD: Audio-Visual Adaptive De-normalisation for feature modulation.
  • SCL: Spatial-aware Contrastive Learning for learning spatial correspondence.
  • TDSS: Test-time Dynamic Scene Simulation for augmentation without added cost.

Requirements

pip install -r requirements.txt

Dataset

For training and evaluation, we use the FairPlay, YouTube-360 datasets. The dataset structure is as follows:

dataset
    ├── fairplay
    ├── yt_clean
    ├── ...

You can download the datasets from the links above and place them in the dataset directory.

Training

bash run_x-your-dataset-name-x.sh

Evaluation

The pre-trained ckpt for FairPlay-5Split (split2) can be downloaded from here. A simple evaluation command is as follows:

python test.py --dataset fairplay --setup 5splits --hdf5FolderPath split2 --epochs 1 --num_workers 1 --method_type m2b --data_vol_scaler 1 --audio_length 0.63 --wandb_name dry_run --wandb_mode disabled --batch_size 8 --model_config fairplay_base.json --multi_frames --dim_scale 1 --backbone 18

License and Citation

This project is licensed under the MIT License. Please see the LICENSE file for details.

@article{chen2025ccstereo,
  title={Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation},
  author={Chen, Yuanhong and Shimada, Kazuki and Simon, Christian and Ikemiya, Yukara and Shibuya, Takashi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2501.02786},
  year={2025}
}

Acknowledgements

We acknowledge the use of the FAIR-Play, SPADE and YouTube-360. Special thanks to the authors of these works for their contributions to the field of audio-visual learning.

About

[ACMMM 2025] CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published