Watch the TTSizer Demo & See It In Action:

TTSizer automates the tedious process of creating high-quality Text-To-Speech datasets from raw media. Input a video or audio file, and get back perfectly aligned audio-text pairs for each speaker.
๐ฏ End-to-End Automation: From raw media files to cleaned, TTS-ready datasets
๐ฃ๏ธ Advanced Multi-Speaker Diarization: Handles complex audio with multiple speakers
๐ค State-of-the-Art Models - MelBandRoformer, Gemini, CTC-Aligner, Wespeaker
๐ง Quality Control: Automatic outlier detection and flagging
โ๏ธ Fully Configurable: Control every aspect via config.yaml
graph LR
A[๐ฌ Raw Media] --> B[๐ค Extract Audio]
B --> C[๐ Vocal Separation]
C --> D[๐ Normalize Volume]
D --> E[โ๏ธ Speaker Diarization]
E --> F[โฑ๏ธ Forced Alignment]
F --> G[๐ง Outlier Detection]
G --> H[๐ฉ ASR Validation]
H --> I[โ
TTS Dataset]
git clone https://github.com/taresh18/TTSizer.git
cd TTSizer
pip install -r requirements.txt- Download pre-trained models (see Setup Guide)
- Add
GEMINI_API_KEYto.envfile in the project root:
GEMINI_API_KEY="YOUR_API_KEY_HERE"Edit configs/config.yaml:
project_setup:
video_input_base_dir: "/path/to/your/videos"
output_base_dir: "/path/to/output"
target_speaker_labels: ["Speaker1", "Speaker2"]python -m ttsizer.mainClick to expand detailed setup instructions
- Python 3.9+
- CUDA enabled GPU (>4GB VRAM)
- FFmpeg (Must be installed and accessible in your system's PATH)
- Google Gemini API key
- Vocal Extraction: Download
kimmel_unwa_ft2_bleedless.ckptfrom HuggingFace - Speaker Embeddings: Download from wespeaker-voxceleb-resnet293-LM
Update model paths in config.yaml.
Click for pipeline control and other advanced options
You can control which parts of the pipeline run, useful for debugging or reprocessing:
pipeline_control:
run_only_stage: "ctc_align" # Run specific stage only
start_stage: "llm_diarize" # Start from specific stage
end_stage: "outlier_detect" # Stop at specific stageThe project is organized as follows:
TTSizer/
โโโ configs/
โ โโโ config.yaml # Pipeline & model configurations
โโโ ttsizer/
โ โโโ __init__.py
โ โโโ main.py # Main script to run the pipeline
โ โโโ core/ # Core components of the pipeline
โ โโโ models/ # Vocal removal models
โ โโโ utils/ # Utility programs
โโโ .env # For API keys
โโโ README.md # This file
โโโ requirements.txt # Python package dependencies
โโโ weights/ # For storing downloaded model weights (gitignored)
This project is released under the Apache License 2.0. See the LICENSE file for details.
- Vocals Extraction pcunwa/Kim-Mel-Band-Roformer-FT by Unwa
- Forced Alignment: ctc-forced-aligner by MahmoudAshraf97
- ASR: NVIDIA NeMo Parakeet
- Speaker Embeddings: Wespeaker/wespeaker-voxceleb-resnet293-LM from Wespeaker