CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Official implementation of CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji
ACM MM 2025 (arXiv 2501.02786)

Overview

🎧 Binaural Audio Generation (BAG) CCStereo tackles the task of generating spatialised binaural audio from monaural audio using corresponding visual cues, enabling immersive sound experiences for applications in VR, AR, and 360° video.

🧠 Context-Aware Audio-Visual Conditioning Existing BAG methods rely heavily on cross-attention mechanisms and often fail to leverage the rich temporal and spatial dynamics present in video. CCStereo introduces Audio-Visual Adaptive De-normalisation (AVAD) layers to modulate the decoding process with spatial and semantic information derived from video frames, offering finer-grained control.

🔍 Spatial-Aware Contrastive Learning To improve spatial sensitivity, CCStereo uses a novel contrastive learning framework that mines hard negatives by applying spatial shuffling and temporal frame sampling, effectively simulating object position changes and encouraging the model to distinguish fine-grained spatial relationships in the audio-visual space.

🧪 Test-Time Augmentation without Overhead Unlike prior methods that ignore the inherent redundancy of video data, CCStereo introduces Test-time Dynamic Scene Simulation (TDSS)—a sliding-window based augmentation strategy that crops frames from multiple regions (top-left, center, etc.) without increasing inference cost, boosting robustness and spatial accuracy.

Highlights

CCStereo converts monaural audio to binaural audio using visual input. It addresses key limitations in spatial alignment and generalisation using:

AVAD: Audio-Visual Adaptive De-normalisation for feature modulation.
SCL: Spatial-aware Contrastive Learning for learning spatial correspondence.
TDSS: Test-time Dynamic Scene Simulation for augmentation without added cost.

Requirements

pip install -r requirements.txt

Dataset

For training and evaluation, we use the FairPlay, YouTube-360 datasets. The dataset structure is as follows:

dataset
    ├── fairplay
    ├── yt_clean
    ├── ...

You can download the datasets from the links above and place them in the dataset directory.

Training

bash run_x-your-dataset-name-x.sh

Evaluation

The pre-trained ckpt for FairPlay-5Split (split2) can be downloaded from here. A simple evaluation command is as follows:

python test.py --dataset fairplay --setup 5splits --hdf5FolderPath split2 --epochs 1 --num_workers 1 --method_type m2b --data_vol_scaler 1 --audio_length 0.63 --wandb_name dry_run --wandb_mode disabled --batch_size 8 --model_config fairplay_base.json --multi_frames --dim_scale 1 --backbone 18

License and Citation

This project is licensed under the MIT License. Please see the LICENSE file for details.

@article{chen2025ccstereo,
  title={Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation},
  author={Chen, Yuanhong and Shimada, Kazuki and Simon, Christian and Ikemiya, Yukara and Shibuya, Takashi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2501.02786},
  year={2025}
}

Acknowledgements

We acknowledge the use of the FAIR-Play, SPADE and YouTube-360. Special thanks to the authors of these works for their contributions to the field of audio-visual learning.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
data		data
models		models
options		options
src/training		src/training
util		util
visual		visual
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_fairplay_5split.sh		run_fairplay_5split.sh
run_yt_clean.sh		run_yt_clean.sh
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Overview

Highlights

Requirements

Dataset

Training

Evaluation

License and Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

SonyResearch/CCStereo

Folders and files

Latest commit

History

Repository files navigation

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

Overview

Highlights

Requirements

Dataset

Training

Evaluation

License and Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages