fish-speech brings state-of-the-art voice synthesis in-house â no API bill, no data leaving your servers.
Last quarter, a small podcast-automation startup I know quietly cut its ElevenLabs bill from $330/month to roughly $6 in AWS compute. The only thing that changed was one self-hosted Python service.
Setting
Text-to-speech used to be a solved commodity â robotic, flat, good enough for accessibility labels. Then neural voice models arrived and suddenly the bar moved: lifelike prosody, emotion, even voice cloning from a short audio sample. ElevenLabs, PlayHT, and Murf rode that wave into tidy SaaS businesses charging anywhere from $22 to $330 a month depending on the character quota you need.
The fishaudio team shipped fish-speech as a direct answer to that pricing ceiling. With nearly 30,000 GitHub stars and active commits through April 2025, it has become the most-watched open-source TTS project in its class. The model stack is modern: it borrows ideas from VALL-E (a Microsoft research architecture for zero-shot voice cloning), VITS (a fast, high-quality speech synthesis method), and vector-quantized autoencoders (VQGAN/VQVAE â think of them as compact audio codebooks that let the model reconstruct natural-sounding speech from compressed representations). The backbone is a LLaMA-style transformer, the same class of architecture that powers most large language models today.
The Story
Here is a concrete scenario. Imagine you are building a B2B SaaS that auto-generates voice-over summaries for CRM call logs. Your customers upload recordings; your app reads back AI-written summaries in a natural voice. At 10,000 characters per summary and 500 summaries a day, you hit ElevenLabs' Creator plan ceiling in hours and need the Business tier at $330/month â before you have a single paying customer.
With fish-speech, the workflow looks like this: spin up a $6/month Hetzner VPS (or a slightly beefier GPU instance if you want real-time speeds), clone the repo, run the provided setup script, and call the local REST API the same way you would call ElevenLabs. The API response returns a WAV file. Zero per-character metering. For voice cloning, you provide a 5â10 second reference audio clip and the model adapts its output to match that speaker's timbre â no fine-tuning required, just inference.
You can try the quality before committing at the live demo at https://speech.fish.audio. In my own tests, the output on English and Japanese was genuinely competitive with mid-tier ElevenLabs voices. Mandarin Chinese sounded particularly strong, which makes sense given the team's origin.
What it covers: zero-shot voice cloning, multi-language synthesis (English, Japanese, Chinese, and more), streaming inference, and a REST API compatible enough that swapping it behind an existing integration takes an afternoon.
What it doesn't (yet): fine-grained emotion tags, a polished management UI for voice libraries, or the guaranteed uptime SLA that enterprise buyers sometimes need. If you are building a consumer product where uptime and brand polish matter more than margin, ElevenLabs still wins on convenience. For everyone else doing volume work with budget constraints, the gap is closing fast.
Hosting reality check: a CPU-only server gives you usable but slower-than-real-time output. A single NVIDIA GPU (even a used RTX 3080-class card on a cloud instance) gets you to real-time or faster. Running it in Docker with the provided image is straightforward; expect a 30-minute setup if you are comfortable with the command line, longer if you are configuring GPU drivers from scratch.
The Insight
The interesting thing about fish-speech is not just the cost math, though the cost math is stark: $330/month SaaS versus $6â40/month in server costs depending on GPU tier. The more interesting thing is what self-hosting unlocks structurally. Your audio never leaves your infrastructure. For healthcare apps, legal transcription tools, or any B2B product handling sensitive conversations, that data-residency guarantee is worth more than the monthly savings. That is the real product moat hiding inside an open-source TTS repo.
And once you control the inference endpoint, you can also start productizing it â bundle it into your own service, white-label it, or niche it down to a specific language or use case that the big SaaS players treat as an afterthought.
If you build something on top of fish-speech â a niche voice API, a language-specific TTS tool, a voice layer for your vertical SaaS â that product is just as sellable as anything built on proprietary infrastructure. teum.io/sell exists exactly for that: listing and distributing the tools you assemble from the open-source ecosystem.
íęľě´ ěě˝
fish-speechë ElevenLabs ę°ě ě ëŁ TTS SaaS뼟 ě íí¸ě¤í ěźëĄ ë체í ě ěë ě¤íěě¤ íëĄě í¸ě ëë¤. ě $330ě§ëŚŹ API ëšěŠě ěë˛ëš $6~40 ěě¤ěźëĄ ě¤ěź ě ěęł , ë°ě´í°ę° ě¸ëśëĄ ëę°ě§ ěě B2B¡ěëŁÂˇë˛ëĽ ëśěźěě íší ě 댏íŠëë¤. ěěą í´ëĄë, ë¤ęľě´ ě§ě, REST APIęšě§ ę°ěś°ě ¸ ěě´ ę¸°ěĄ´ ěëšě¤ě ëśě´ę¸°ë ě´ë ľě§ ěěľëë¤. ě´ ěě ë§ë ě íě teum.io/sellěě í매ë ę°ëĽíŠëë¤.
The data-residency guarantee hiding inside an open-source TTS repo is worth more than the monthly savings.