CapSpeech: A Prompt-Guided Expressive Text-to-Speech Synthesizer

👋 Welcome to the 🧢CapSpeech live demo.

🔗 Learn more about this project on the 🧢CapSpeech Homepage.

📃 Licensed under CC BY-NC 4.0.

🔧 Usage Tips

  • Quick Start: Enter a style caption and a transcript to generate expressive speech just the way you want.
  • Model Tabs: Toggle model checkpoints by clicking the Model tab, with each checkpoint tailored for a specific downstream use case.
  • Speed/Duration Settings: Adjust the speed and duration if the predicted speech pace sounds unnatural.
  • Flow Matching Settings: Modify the CFG scale and sampling steps to refine prompt alignment and improve generation quality.

This checkpoint offers balanced performance and supports general style control.

0.5 2
1 20

Enable to use a fixed audio duration.

0 100

Enable to use a fixed random seed for reproducibility.

1 5
20 100
Examples
Voice Style Caption Speech Transcript/Content