Long-Form Narration (Audiobooks): Use Coqui XTTS-v2 for its superior prosody and expressive delivery, but its non-commercial license is a constraint and its output may need post-processing for audio artifacts. StyleTTS 2 is a commercially-licensed alternative with cleaner audio but a more robotic pace.
Real-Time Conversational AI:Kyutai TTS is state-of-the-art for its text-streaming architecture that minimizes perceived latency. Piper and Kokoro are excellent, mature options for their speed and ability to run on CPUs and embedded devices. The RealtimeTTS library is the ideal wrapper.
High-Fidelity Voice Cloning: Use Chatterbox for its state-of-the-art cloning quality combined with unique, controllable emotional expression and a commercial-friendly MIT license. Use XTTS-v2 for any multilingual or cross-lingual cloning needs.
Resource-Constrained & Embedded:Piper is the standard for devices like the Raspberry Pi due to its C++ core and ONNX optimization. Kokoro is a strong alternative. For ultra-low-resource environments, the legacy tool Flite remains an option.
Model Comparison
Speed & Low-Resource (Kokoro, Piper): Built for real-time, offline, and embedded use.
Kokoro: An 82M parameter model (Apache 2.0 license) achieving near-real-time speeds on a CPU (<0.3s processing). Its speed comes at the cost of less natural audio and no zero-shot cloning.
Piper: A fast, local-first system using the VITS architecture and ONNX Runtime. Its C++ core allows it to run on a Raspberry Pi. It offers pre-trained voices of varying sizes (5M-32M parameters) and a permissive MIT license.
StyleTTS 2: Uses style diffusion to generate natural speech without reference audio and can clone voices from short samples. While its raw audio is clean, its prosody can be perceived as robotic (TTS Arena ELO: 1164).
Coqui XTTS-v2: Praised for superior prosody and inflection, making it a favorite for narration. Its primary feature is zero-shot, cross-language voice cloning from a 6-second sample across 17 languages. Its output may have artifacts, and its CPML license forbids commercial use. (TTS Arena ELO: 1200).
LLM-Based & Specialized Models: Use LLM backbones for new capabilities.
Chatterbox: A 0.5B parameter Llama-based model (MIT license) with an "emotion exaggeration" control, real-time synthesis (<200ms latency), and built-in PerTh watermarking.
Kyutai TTS: The first major open-source model with text-streaming, starting audio synthesis from an LLM's first tokens for a time-to-first-audio of ~220ms. It provides word-level timestamps for building interruptible agents.
IndexTTS: A GPT-style model (Apache 2.0) optimized for Chinese, featuring pinyin-based pronunciation correction and fine-grained pause control.
Other Models:DiTTo-TTS uses a simpler Diffusion Transformer architecture. Muyan-TTS is specialized for podcast narration.
Model
Arch.
Features / License
Params
Kokoro
StyleTTS 2
Fast CPU/GPU inference. Apache 2.0.
82M
Piper
VITS/ONNX
Fastest (Raspberry Pi), local-first. MIT.
5-32M
StyleTTS 2
Diffusion
Clean audio, zero-shot cloning. MIT (Code).
~200-300M
XTTS-v2
VQ-VAE+GPT
Best cross-lang clone (17). Non-commercial.
~500M
Chatterbox
Llama
Emotion control, watermarking. MIT.
0.5B
Kyutai
Transformer
True text-streaming (~220ms), word timestamps.
N/S
IndexTTS
GPT-style
Chinese pinyin control. Apache 2.0.
N/S
Tools and Integrations
RealtimeTTS: A Python library for low-latency applications that abstracts TTS engines (Coqui, Piper, OpenAI) into a single streaming interface with intelligent sentence splitting and engine fallback for reliability.
Native SDKs/CLIs: For direct access, tools include Coqui 🐸TTS (Python lib for XTTS), Piper CLI (scripting), and the community-built kokoro-tts CLI for long-form content generation.
ComfyUI: The node-based interface is widely used for synchronized audio-visual workflows. The typical pattern is LLM Text -> TTS Node -> Audio Output. Custom nodes exist for most popular models, including Piper (ComfyUI-TTS), Chatterbox (Chatterbox Nodes), Spark-TTS, F5-TTS, and more.