๐ŸŒ AIๆœ็ดข & ไปฃ็† ไธป้กต
Skip to content

๐ŸŽ™๏ธ Automatically transcribe audio/video into high-quality, speaker-specific Text-To-Speech datasets โœจ

License

Notifications You must be signed in to change notification settings

taresh18/TTSizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TTSizer ๐ŸŽ™๏ธโœจ

Transform Raw Audio/Video into Production-Ready TTS Datasets

License: Apache 2.0 Python Version

Watch the TTSizer Demo & See It In Action: TTSizer Demo Video (The demo above showcases the AnimeVox Character TTS Corpus, a dataset created using TTSizer.)

๐ŸŽฏ What It Does

TTSizer automates the tedious process of creating high-quality Text-To-Speech datasets from raw media. Input a video or audio file, and get back perfectly aligned audio-text pairs for each speaker.

โœจ Key Features

๐ŸŽฏ End-to-End Automation: From raw media files to cleaned, TTS-ready datasets
๐Ÿ—ฃ๏ธ Advanced Multi-Speaker Diarization: Handles complex audio with multiple speakers
๐Ÿค– State-of-the-Art Models - MelBandRoformer, Gemini, CTC-Aligner, Wespeaker
๐Ÿง Quality Control: Automatic outlier detection and flagging
โš™๏ธ Fully Configurable: Control every aspect via config.yaml

๐Ÿ“Š Pipeline Flow

graph LR
    A[๐ŸŽฌ Raw Media] --> B[๐ŸŽค Extract Audio]
    B --> C[๐Ÿ”‡ Vocal Separation]  
    C --> D[๐Ÿ”Š Normalize Volume]
    D --> E[โœ๏ธ Speaker Diarization]
    E --> F[โฑ๏ธ Forced Alignment]
    F --> G[๐Ÿง Outlier Detection]
    G --> H[๐Ÿšฉ ASR Validation]
    H --> I[โœ… TTS Dataset]
Loading

๐Ÿƒ Quick Start

1. Clone & Install

git clone https://github.com/taresh18/TTSizer.git
cd TTSizer
pip install -r requirements.txt

2. Setup Models & API Key

  • Download pre-trained models (see Setup Guide)
  • Add GEMINI_API_KEY to .env file in the project root:
GEMINI_API_KEY="YOUR_API_KEY_HERE"

3. Configure

Edit configs/config.yaml:

project_setup:
  video_input_base_dir: "/path/to/your/videos"
  output_base_dir: "/path/to/output"
  target_speaker_labels: ["Speaker1", "Speaker2"]

4. Run TTSizer!

python -m ttsizer.main

๐Ÿ› ๏ธ Setup & Installation

Click to expand detailed setup instructions

Prerequisites

  • Python 3.9+
  • CUDA enabled GPU (>4GB VRAM)
  • FFmpeg (Must be installed and accessible in your system's PATH)
  • Google Gemini API key

Manual Model Downloads

  1. Vocal Extraction: Download kimmel_unwa_ft2_bleedless.ckpt from HuggingFace
  2. Speaker Embeddings: Download from wespeaker-voxceleb-resnet293-LM

Update model paths in config.yaml.

โš™๏ธ Advanced Configuration

Click for pipeline control and other advanced options

Selective Stage Execution

You can control which parts of the pipeline run, useful for debugging or reprocessing:

pipeline_control:
  run_only_stage: "ctc_align"      # Run specific stage only
  start_stage: "llm_diarize"       # Start from specific stage  
  end_stage: "outlier_detect"      # Stop at specific stage

๐Ÿ—๏ธ Project Structure

The project is organized as follows:

TTSizer/
โ”œโ”€โ”€ configs/
โ”‚   โ””โ”€โ”€ config.yaml                 # Pipeline & model configurations
โ”œโ”€โ”€ ttsizer/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py                     # Main script to run the pipeline
โ”‚   โ”‚โ”€โ”€ core/                       # Core components of the pipeline
โ”‚   โ”œโ”€โ”€ models/                     # Vocal removal models
โ”‚   โ””โ”€โ”€ utils/                      # Utility programs
โ”œโ”€โ”€ .env                            # For API keys
โ”œโ”€โ”€ README.md                       # This file
โ”œโ”€โ”€ requirements.txt                # Python package dependencies
โ””โ”€โ”€ weights/                        # For storing downloaded model weights (gitignored)

๐Ÿ“œ License

This project is released under the Apache License 2.0. See the LICENSE file for details.

๐Ÿ“š References

About

๐ŸŽ™๏ธ Automatically transcribe audio/video into high-quality, speaker-specific Text-To-Speech datasets โœจ

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages