Gradio

CapSpeech: A Prompt-Guided Expressive Text-to-Speech Synthesizer

👋 Welcome to the 🧢CapSpeech live demo.

🔗 Learn more about this project on the 🧢CapSpeech Homepage.

📃 Licensed under CC BY-NC 4.0.

🔧 Usage Tips

Quick Start: Enter a style caption and a transcript to generate expressive speech just the way you want.
Model Tabs: Toggle model checkpoints by clicking the Model tab, with each checkpoint tailored for a specific downstream use case.
Speed/Duration Settings: Adjust the speed and duration if the predicted speech pace sounds unnatural.
Flow Matching Settings: Modify the CFG scale and sampling steps to refine prompt alignment and improve generation quality.

This checkpoint offers balanced performance and supports general style control.

Voice Style Caption

How should the speaker sound? Think timbre, pace, emotion, accent, etc.

Speech Transcript/Content

What should the speaker say?

Generated Audio

Speed

Scale the duration predicted by the model.

0.5 2

Audio Duration

Manually set an audio duration.

1 20

Enable to use a fixed audio duration.

Fix Audio Duration

Examples

Voice Style Caption	Speech Transcript/Content