The TTS market has three tiers: benchmark champions that remain inaccessible (Seed TTS, Vocu), enterprise APIs at aggressive prices (Inworld, ElevenLabs), and open-source models you can run on a gaming laptop. Consumer apps like Speechify charge $139/year for features that open-source models now match for free. Kokoro at 82M parameters achieves 96× real-time on a basic cloud GPU while Chatterbox's voice cloning now matches ElevenLabs at 63.75% preference rates.
ELO scores and rankings change frequently. See the TTS Arena Leaderboard for live rankings.
| Rank | Model | ELO | $/1M | Latency | Notes |
|---|---|---|---|---|---|
| #1 | Vocu V3.0 | 1603 | — | — | China-optimized, limited access |
| #2 | Inworld TTS MAX | 1594 | $10 | <250ms | 15 langs, free voice cloning |
| #3 | CastleFlow v1.0 | 1593 | — | — | Proprietary |
| #4 | Inworld TTS Mini | 1579 | $5 | <130ms | 15 langs, free voice cloning |
| #5 | Papla P1 | 1562 | — | — | API waitlist |
| #6 | Hume Octave | 1560 | Usage | 100-300ms | 16+ langs, best emotional expression |
| #7 | ElevenLabs Flash v2.5 | 1548 | $30-60 | 75-150ms | 32 langs. Multilingual v2: $60-120, 400-600ms, higher quality |
| #8 | MiniMax Speech-02-HD | 1543 | $50 | 400ms+ | 40+ langs, zero-shot cloning |
| — | Cartesia Sonic 3 | — | ~$13/hr | 40-90ms | 40+ langs, fastest latency |
| #15 | Chatterbox | 1502 | — | — | Open-source (MIT), best voice cloning |
| #16-17 | Kokoro v1.0 | ~1400 | — | — | Open-source (Apache 2.0), see below |
| #23 | StyleTTS 2 | 1369 | — | — | Open-source (MIT), fastest inference |
| #24 | CosyVoice 2.0 | 1358 | — | 150ms | Open-source (Apache 2.0), streaming |
| Model | Params | VRAM | Speed | Voice Clone | License |
|---|---|---|---|---|---|
| Kokoro | 82M | 2-3GB | 210× (4090), 90× (3090), 36× (T4/Colab), 5× (CPU), 1-2× (Mac) | 54 presets only | Apache 2.0 |
| StyleTTS 2 | ~200M | ~4GB | 95× (4090) | Fine-tuning needed | MIT |
| Chatterbox-Turbo | 350M | 4-8GB | 6× (GPU), 2× (4090 streaming) | 5-10s audio, excellent | MIT |
| Chatterbox | 500M | 8-16GB | ~2× (4090) | 5-10s audio, best emotion | MIT |
| CosyVoice 2.0 | 500M | ~4GB | 150ms streaming | 5-15s audio | Apache 2.0 |
| Qwen3-TTS-0.6B | 600M | ~4GB | 97ms streaming (GPU req) | 3s audio, 10 languages | Apache 2.0 |
All above run on gaming laptops (RTX 3060+). MacBook Pro M1-M4 works for Kokoro and Chatterbox via MPS.
| Use Case | Hosted Option | Local Option |
|---|---|---|
| Commute listening (audiobooks) | ElevenLabs Multilingual v2 | Chatterbox |
| Voice agent / realtime | Inworld Mini ($5/1M) | Kokoro |
| Voice cloning project | Inworld (free cloning) | Chatterbox (63.75% vs ElevenLabs) |
| Startup deployment | Inworld ($5-10/1M) | Qwen3-TTS (97ms latency) |
| Laptop tinkering | — | Kokoro (82M, runs on CPU) |
| Multilingual | MiniMax (40+ langs) | Qwen3-TTS (10 langs) |
Open-source TTS has reached commercial quality. Chatterbox beats ElevenLabs in blind tests (63.75% preference via Resemble AI's benchmark). Kokoro runs on a free Colab GPU at 36× real-time.
Bottom line: