Official implementation of CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji
ACM MM 2025 (arXiv 2501.02786)
🎧 Binaural Audio Generation (BAG) CCStereo tackles the task of generating spatialised binaural audio from monaural audio using corresponding visual cues, enabling immersive sound experiences for applications in VR, AR, and 360° video.
🧠 Context-Aware Audio-Visual Conditioning Existing BAG methods rely heavily on cross-attention mechanisms and often fail to leverage the rich temporal and spatial dynamics present in video. CCStereo introduces Audio-Visual Adaptive De-normalisation (AVAD) layers to modulate the decoding process with spatial and semantic information derived from video frames, offering finer-grained control.
🔍 Spatial-Aware Contrastive Learning To improve spatial sensitivity, CCStereo uses a novel contrastive learning framework that mines hard negatives by applying spatial shuffling and temporal frame sampling, effectively simulating object position changes and encouraging the model to distinguish fine-grained spatial relationships in the audio-visual space.
🧪 Test-Time Augmentation without Overhead Unlike prior methods that ignore the inherent redundancy of video data, CCStereo introduces Test-time Dynamic Scene Simulation (TDSS)—a sliding-window based augmentation strategy that crops frames from multiple regions (top-left, center, etc.) without increasing inference cost, boosting robustness and spatial accuracy.
CCStereo converts monaural audio to binaural audio using visual input. It addresses key limitations in spatial alignment and generalisation using:
- AVAD: Audio-Visual Adaptive De-normalisation for feature modulation.
- SCL: Spatial-aware Contrastive Learning for learning spatial correspondence.
- TDSS: Test-time Dynamic Scene Simulation for augmentation without added cost.
pip install -r requirements.txtFor training and evaluation, we use the FairPlay, YouTube-360 datasets. The dataset structure is as follows:
dataset
├── fairplay
├── yt_clean
├── ...
You can download the datasets from the links above and place them in the dataset directory.
bash run_x-your-dataset-name-x.shThe pre-trained ckpt for FairPlay-5Split (split2) can be downloaded from here. A simple evaluation command is as follows:
python test.py --dataset fairplay --setup 5splits --hdf5FolderPath split2 --epochs 1 --num_workers 1 --method_type m2b --data_vol_scaler 1 --audio_length 0.63 --wandb_name dry_run --wandb_mode disabled --batch_size 8 --model_config fairplay_base.json --multi_frames --dim_scale 1 --backbone 18This project is licensed under the MIT License. Please see the LICENSE file for details.
@article{chen2025ccstereo,
title={Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation},
author={Chen, Yuanhong and Shimada, Kazuki and Simon, Christian and Ikemiya, Yukara and Shibuya, Takashi and Mitsufuji, Yuki},
journal={arXiv preprint arXiv:2501.02786},
year={2025}
}
We acknowledge the use of the FAIR-Play, SPADE and YouTube-360. Special thanks to the authors of these works for their contributions to the field of audio-visual learning.