Merge branch 'rhasspy:master' into master

2026-06-02 01:47:02 +00:00 · 2023-07-27 22:04:26 -05:00
parent 4651a2e3c8 a9be4c0314
commit 314a860f51
24 changed files with 4513 additions and 218 deletions
@@ -1,4 +1,5 @@
 *
+!VERSION
 !Makefile
 !src/cpp/
 !local/en-us/lessac/low/en-us-lessac-low.onnx
@@ -39,7 +39,7 @@ RUN mkdir -p "lib/Linux-$(uname -m)/piper_phonemize" && \
        tar -C "lib/Linux-$(uname -m)/piper_phonemize" -xzvf -

 # Build piper binary
-COPY Makefile ./
+COPY VERSION Makefile ./
 COPY src/cpp/ ./src/cpp/
 RUN make

@@ -1,6 +1,8 @@
 .PHONY: piper clean

 LIB_DIR := lib/Linux-$(shell uname -m)
+VERSION := $(cat VERSION)
+DOCKER_PLATFORM ?= linux/amd64,linux/arm64,linux/arm/v7

 piper:
 	mkdir -p build
@@ -11,4 +13,4 @@ clean:
 	rm -rf build/ dist/

 docker:
-	docker buildx build . --platform 'linux/amd64,linux/arm64,linux/arm/v7' --output 'type=local,dest=dist'
+	docker buildx build . --platform '$(DOCKER_PLATFORM)' --output 'type=local,dest=dist'
@@ -5,7 +5,7 @@ Piper is used in a [variety of projects](#people-using-piper).

 ``` sh
 echo 'Welcome to the world of speech synthesis!' | \
-  ./piper --model en-us-blizzard_lessac-medium.onnx --output_file welcome.wav
+  ./piper --model en_US-lessac-medium.onnx --output_file welcome.wav
 ```

 [Listen to voice samples](https://rhasspy.github.io/piper-samples) and check out a [video tutorial by Thorsten Müller](https://youtu.be/rjq5eZoWWSo)
@@ -18,39 +18,47 @@ Voices are trained with [VITS](https://github.com/jaywalnut310/vits/) and export

 Our goal is to support Home Assistant and the [Year of Voice](https://www.home-assistant.io/blog/2022/12/20/year-of-voice/).

-[Download voices](https://github.com/rhasspy/piper/releases/tag/v0.0.2) for the supported languages:
+[Download voices](https://huggingface.co/rhasspy/piper-voices/tree/v1.0.0) for the supported languages:

-* Catalan (ca)
-* Danish (da)
-* German (de)
-* British English (en-gb)
-* U.S. English (en-us)
-* Spanish (es)
-* Finnish (fi)
-* French (fr)
-* Greek (el-gr)
-* Icelandic (is)
-* Italian (it)
-* Kazakh (kk)
-* Nepali (ne)
-* Dutch (nl)
-* Norwegian (no)
-* Polish (pl)
-* Brazilian Portuguese (pt-br)
-* Russian (ru)
-* Swedish (sv-se)
-* Ukrainian (uk)
-* Vietnamese (vi)
-* Chinese (zh-cn)
+* Catalan (ca_ES)
+* Danish (da_DK)
+* German (de_DE)
+* English (en_GB, en_US)
+* Spanish (es_ES, es_MX)
+* Finnish (fi_FI)
+* French (fr_FR)
+* Greek (el_GR)
+* Icelandic (is_IS)
+* Italian (it_IT)
+* Georgian (ka_GE)
+* Kazakh (kk_KZ)
+* Nepali (ne_NP)
+* Dutch (nl_BE, nl_NL)
+* Norwegian (no_NO)
+* Polish (pl_PL)
+* Portuguese (pt_BR)
+* Russian (ru_RU)
+* Swedish (sv_SE)
+* Swahili (sw_CD)
+* Ukrainian (uk_UA)
+* Vietnamese (vi_VN)
+* Chinese (zh_CN)
+
+You will need two files per voice:
+
+1. A `.onnx` model file, such as [`en_US-lessac-medium.onnx`](https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx)
+2. A `.onnx.json` config file, such as [`en_US-lessac-medium.onnx.json`](https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json)
+
+The `MODEL_CARD` file for each voice contains important licensing information. Piper is intended for text to speech research, and does not impose any additional restrictions on voice models. Some voices may have restrictive licenses, however, so please review them carefully!


 ## Installation

-Download a release:
+You can [run Piper with Python](#running-in-python) or download a binary release:

-* [amd64](https://github.com/rhasspy/piper/releases/download/v1.0.0/piper_amd64.tar.gz) (64-bit desktop Linux)
-* [arm64](https://github.com/rhasspy/piper/releases/download/v1.0.0/piper_arm64.tar.gz) (64-bit Raspberry Pi 4)
-* [armv7](https://github.com/rhasspy/piper/releases/download/v1.0.0/piper_armv7.tar.gz) (32-bit Raspberry Pi 3/4)
+* [amd64](https://github.com/rhasspy/piper/releases/download/v1.1.0/piper_amd64.tar.gz) (64-bit desktop Linux)
+* [arm64](https://github.com/rhasspy/piper/releases/download/v1.1.0/piper_arm64.tar.gz) (64-bit Raspberry Pi 4)
+* [armv7](https://github.com/rhasspy/piper/releases/download/v1.1.0/piper_armv7.tar.gz) (32-bit Raspberry Pi 3/4)

 If you want to build from source, see the [Makefile](Makefile) and [C++ source](src/cpp).
 You must download and extract [piper-phonemize](https://github.com/rhasspy/piper-phonemize) to `lib/Linux-$(uname -m)/piper_phonemize` before building.
@@ -66,7 +74,7 @@ For example:

 ``` sh
 echo 'Welcome to the world of speech synthesis!' | \
-  ./piper --model en-us-lessac-medium.onnx --output_file welcome.wav
+  ./piper --model en_US-lessac-medium.onnx --output_file welcome.wav
 ```

 For multi-speaker models, use `--speaker <number>` to change speakers (default: 0).
@@ -74,6 +82,32 @@ For multi-speaker models, use `--speaker <number>` to change speakers (default:
 See `piper --help` for more options.


+### JSON Input
+
+The `piper` executable can accept JSON input when using the `--json-input` flag. Each line of input must be a JSON object with `text` field. For example:
+
+``` json
+{ "text": "First sentence to speak." }
+{ "text": "Second sentence to speak." }
+```
+
+Optional fields include:
+
+* `speaker` - string
+    * Name of the speaker to use from `speaker_id_map` in config (multi-speaker voices only)
+* `speaker_id` - number
+    * Id of speaker to use from 0 to number of speakers - 1 (multi-speaker voices only, overrides "speaker")
+* `output_file` - string
+    * Path to output WAV file
+    
+The following example writes two sentences with different speakers to different files:
+
+``` json
+{ "text": "First speaker.", "speaker_id": 0, "output_file": "/tmp/speaker_0.wav" }
+{ "text": "Second speaker.", "speaker_id": 1, "output_file": "/tmp/speaker_1.wav" }
+```
+
+
 ## People using Piper

 Piper has been used in the following projects/papers:
@@ -84,7 +118,7 @@ Piper has been used in the following projects/papers:
 * [Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages](https://www.techrxiv.org/articles/preprint/Image_Captioning_for_the_Visually_Impaired_and_Blind_A_Recipe_for_Low-Resource_Languages/22133894)
 * [Open Voice Operating System](https://github.com/OpenVoiceOS/ovos-tts-plugin-piper)
 * [JetsonGPT](https://github.com/shahizat/jetsonGPT)
-
+* [LocalAI](https://github.com/go-skynet/LocalAI)

 ## Training

@@ -97,14 +131,22 @@ Pretrained checkpoints are available on [Hugging Face](https://huggingface.co/da

 See [src/python_run](src/python_run)

-Run `scripts/setup.sh` to create a virtual environment and install the requirements. Then run:
+Install with `pip`:

 ``` sh
-echo 'Welcome to the world of speech synthesis!' | scripts/piper \
-  --model /path/to/voice.onnx \
+pip install piper-tts
+```
+
+and then run:
+
+``` sh
+echo 'Welcome to the world of speech synthesis!' | piper \
+  --model en_US-lessac-medium \
  --output_file welcome.wav
 ```

+This will automatically download [voice files](https://huggingface.co/rhasspy/piper-voices/tree/v1.0.0) the first time they're used. Use `--data-dir` and `--download-dir` to adjust where voices are found/downloaded.
+
 If you'd like to use a GPU, install the `onnxruntime-gpu` package:


@@ -112,5 +154,5 @@ If you'd like to use a GPU, install the `onnxruntime-gpu` package:
 .venv/bin/pip3 install onnxruntime-gpu
 ```

-and then run `scripts/piper` with the `--cuda` argument. You will need to have a functioning CUDA environment, such as what's available in [NVIDIA's PyTorch containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).
+and then run `piper` with the `--cuda` argument. You will need to have a functioning CUDA environment, such as what's available in [NVIDIA's PyTorch containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).

@@ -1 +1 @@
-1.0.0
+1.1.0
@@ -0,0 +1,5 @@
+{"phoneme_ids":[1,0,64,0,45,0,23,0,122,0,33,0,96,0,14,0,122,0,120,0,79,0,8,0,64,0,37,0,26,0,120,0,61,0,96,0,3,0,79,0,96,0,79,0,26,0,75,0,14,0,92,0,79,0,26,0,120,0,79,0,26,0,3,0,22,0,14,0,122,0,25,0,120,0,100,0,30,0,3,0,17,0,14,0,25,0,75,0,14,0,75,0,14,0,92,0,79,0,26,0,17,0,120,0,14,0,3,0,34,0,18,0,22,0,121,0,14,0,3,0,31,0,120,0,74,0,31,0,3,0,15,0,33,0,75,0,100,0,32,0,75,0,14,0,92,0,79,0,26,0,17,0,120,0,14,0,3,0,22,0,14,0,26,0,31,0,79,0,25,0,14,0,31,0,120,0,79,0,3,0,34,0,61,0,3,0,23,0,79,0,92,0,79,0,75,0,25,0,14,0,31,0,120,0,79,0,22,0,75,0,14,0,3,0,25,0,61,0,22,0,17,0,14,0,26,0,120,0,14,0,3,0,64,0,18,0,24,0,120,0,39,0,26,0,3,0,34,0,61,0,3,0,79,0,96,0,120,0,79,0,23,0,3,0,32,0,14,0,22,0,19,0,120,0,79,0,3,0,30,0,39,0,26,0,23,0,24,0,18,0,92,0,21,0,26,0,120,0,74,0,26,0,3,0,15,0,74,0,30,0,3,0,22,0,120,0,14,0,22,0,3,0,96,0,61,0,23,0,24,0,74,0,26,0,17,0,120,0,61,0,3,0,64,0,45,0,92,0,42,0,26,0,17,0,120,0,42,0,122,0,3,0,25,0,18,0,32,0,18,0,27,0,92,0,27,0,75,0,27,0,108,0,120,0,74,0,23,0,3,0,15,0,74,0,30,0,3,0,27,0,75,0,120,0,14,0,22,0,17,0,79,0,30,0,10,0,2],"phonemes":["ɟ","œ","k","ː","u","ʃ","a","ː","ˈ","ɯ",",","ɟ","y","n","ˈ","ɛ","ʃ"," ","ɯ","ʃ","ɯ","n","ɫ","a","ɾ","ɯ","n","ˈ","ɯ","n"," ","j","a","ː","m","ˈ","ʊ","r"," ","d","a","m","ɫ","a","ɫ","a","ɾ","ɯ","n","d","ˈ","a"," ","v","e","j","ˌ","a"," ","s","ˈ","ɪ","s"," ","b","u","ɫ","ʊ","t","ɫ","a","ɾ","ɯ","n","d","ˈ","a"," ","j","a","n","s","ɯ","m","a","s","ˈ","ɯ"," ","v","ɛ"," ","k","ɯ","ɾ","ɯ","ɫ","m","a","s","ˈ","ɯ","j","ɫ","a"," ","m","ɛ","j","d","a","n","ˈ","a"," ","ɟ","e","l","ˈ","æ","n"," ","v","ɛ"," ","ɯ","ʃ","ˈ","ɯ","k"," ","t","a","j","f","ˈ","ɯ"," ","r","æ","n","k","l","e","ɾ","i","n","ˈ","ɪ","n"," ","b","ɪ","r"," ","j","ˈ","a","j"," ","ʃ","ɛ","k","l","ɪ","n","d","ˈ","ɛ"," ","ɟ","œ","ɾ","ø","n","d","ˈ","ø","ː"," ","m","e","t","e","o","ɾ","o","ɫ","o","ʒ","ˈ","ɪ","k"," ","b","ɪ","r"," ","o","ɫ","ˈ","a","j","d","ɯ","r","."],"processed_text":"Gökkuşağı, güneş ışınlarının yağmur damlalarında veya sis bulutlarında yansıması ve kırılmasıyla meydana gelen ve ışık tayfı renklerinin bir yay şeklinde göründüğü meteorolojik bir olaydır.","text":"Gökkuşağı, güneş ışınlarının yağmur damlalarında veya sis bulutlarında yansıması ve kırılmasıyla meydana gelen ve ışık tayfı renklerinin bir yay şeklinde göründüğü meteorolojik bir olaydır."}
+{"phoneme_ids":[1,0,64,0,45,0,23,0,122,0,33,0,96,0,14,0,122,0,79,0,26,0,17,0,14,0,23,0,120,0,74,0,3,0,30,0,39,0,26,0,23,0,24,0,120,0,61,0,30,0,3,0,15,0,74,0,30,0,3,0,31,0,28,0,120,0,61,0,23,0,32,0,30,0,100,0,25,0,3,0,27,0,75,0,100,0,96,0,32,0,33,0,92,0,120,0,100,0,30,0,10,0,2],"phonemes":["ɟ","œ","k","ː","u","ʃ","a","ː","ɯ","n","d","a","k","ˈ","ɪ"," ","r","æ","n","k","l","ˈ","ɛ","r"," ","b","ɪ","r"," ","s","p","ˈ","ɛ","k","t","r","ʊ","m"," ","o","ɫ","ʊ","ʃ","t","u","ɾ","ˈ","ʊ","r","."],"processed_text":"Gökkuşağındaki renkler bir spektrum oluşturur.","text":"Gökkuşağındaki renkler bir spektrum oluşturur."}
+{"phoneme_ids":[1,0,32,0,21,0,28,0,120,0,74,0,23,0,3,0,15,0,74,0,30,0,3,0,64,0,45,0,23,0,122,0,33,0,96,0,14,0,122,0,120,0,79,0,3,0,23,0,120,0,79,0,30,0,25,0,79,0,38,0,79,0,8,0,32,0,33,0,92,0,100,0,26,0,17,0,108,0,120,0,100,0,8,0,31,0,14,0,92,0,120,0,79,0,8,0,22,0,18,0,96,0,120,0,74,0,24,0,8,0,25,0,14,0,122,0,34,0,120,0,74,0,8,0,75,0,14,0,17,0,108,0,21,0,34,0,120,0,61,0,30,0,32,0,3,0,34,0,61,0,3,0,25,0,120,0,54,0,30,0,3,0,30,0,39,0,26,0,23,0,24,0,18,0,92,0,74,0,26,0,17,0,120,0,39,0,26,0,3,0,25,0,61,0,22,0,17,0,14,0,26,0,120,0,14,0,3,0,64,0,18,0,24,0,120,0,39,0,26,0,3,0,15,0,74,0,30,0,3,0,30,0,120,0,39,0,26,0,23,0,3,0,31,0,79,0,92,0,14,0,31,0,79,0,26,0,120,0,14,0,3,0,31,0,14,0,20,0,120,0,74,0,28,0,3,0,15,0,74,0,30,0,3,0,34,0,18,0,22,0,121,0,14,0,3,0,17,0,14,0,20,0,120,0,14,0,3,0,19,0,120,0,14,0,38,0,75,0,14,0,3,0,14,0,22,0,26,0,120,0,79,0,3,0,25,0,61,0,30,0,23,0,61,0,38,0,24,0,120,0,74,0,3,0,14,0,30,0,23,0,75,0,14,0,30,0,17,0,120,0,14,0,26,0,3,0,21,0,15,0,14,0,92,0,18,0,32,0,122,0,120,0,74,0,30,0,10,0,2],"phonemes":["t","i","p","ˈ","ɪ","k"," ","b","ɪ","r"," ","ɟ","œ","k","ː","u","ʃ","a","ː","ˈ","ɯ"," ","k","ˈ","ɯ","r","m","ɯ","z","ɯ",",","t","u","ɾ","ʊ","n","d","ʒ","ˈ","ʊ",",","s","a","ɾ","ˈ","ɯ",",","j","e","ʃ","ˈ","ɪ","l",",","m","a","ː","v","ˈ","ɪ",",","ɫ","a","d","ʒ","i","v","ˈ","ɛ","r","t"," ","v","ɛ"," ","m","ˈ","ɔ","r"," ","r","æ","n","k","l","e","ɾ","ɪ","n","d","ˈ","æ","n"," ","m","ɛ","j","d","a","n","ˈ","a"," ","ɟ","e","l","ˈ","æ","n"," ","b","ɪ","r"," ","r","ˈ","æ","n","k"," ","s","ɯ","ɾ","a","s","ɯ","n","ˈ","a"," ","s","a","h","ˈ","ɪ","p"," ","b","ɪ","r"," ","v","e","j","ˌ","a"," ","d","a","h","ˈ","a"," ","f","ˈ","a","z","ɫ","a"," ","a","j","n","ˈ","ɯ"," ","m","ɛ","r","k","ɛ","z","l","ˈ","ɪ"," ","a","r","k","ɫ","a","r","d","ˈ","a","n"," ","i","b","a","ɾ","e","t","ː","ˈ","ɪ","r","."],"processed_text":"Tipik bir gökkuşağı kırmızı, turuncu, sarı, yeşil, mavi, lacivert ve mor renklerinden meydana gelen bir renk sırasına sahip bir veya daha fazla aynı merkezli arklardan ibarettir.","text":"Tipik bir gökkuşağı kırmızı, turuncu, sarı, yeşil, mavi, lacivert ve mor renklerinden meydana gelen bir renk sırasına sahip bir veya daha fazla aynı merkezli arklardan ibarettir."}
+{"phoneme_ids":[1,0,28,0,21,0,108,0,120,0,14,0,25,0,14,0,75,0,79,0,3,0,20,0,14,0,31,0,32,0,120,0,14,0,3,0,22,0,120,0,14,0,122,0,79,0,38,0,3,0,96,0,27,0,19,0,45,0,92,0,120,0,61,0,3,0,32,0,96,0,14,0,15,0,33,0,17,0,108,0,120,0,14,0,23,0,3,0,64,0,37,0,34,0,39,0,26,0,17,0,120,0,74,0,10,0,2],"phonemes":["p","i","ʒ","ˈ","a","m","a","ɫ","ɯ"," ","h","a","s","t","ˈ","a"," ","j","ˈ","a","ː","ɯ","z"," ","ʃ","o","f","œ","ɾ","ˈ","ɛ"," ","t","ʃ","a","b","u","d","ʒ","ˈ","a","k"," ","ɟ","y","v","æ","n","d","ˈ","ɪ","."],"processed_text":"Pijamalı hasta yağız şoföre çabucak güvendi.","text":"Pijamalı hasta yağız şoföre çabucak güvendi."}
+{"phoneme_ids":[1,0,120,0,45,0,23,0,42,0,38,0,3,0,14,0,108,0,120,0,14,0,26,0,3,0,20,0,120,0,14,0,28,0,31,0,61,0,3,0,17,0,42,0,96,0,32,0,120,0,42,0,3,0,22,0,120,0,14,0,34,0,30,0,100,0,25,0,8,0,27,0,17,0,108,0,14,0,122,0,120,0,79,0,3,0,19,0,120,0,39,0,24,0,32,0,96,0,3,0,64,0,21,0,15,0,120,0,74,0,10,0,2],"phonemes":["ˈ","œ","k","ø","z"," ","a","ʒ","ˈ","a","n"," ","h","ˈ","a","p","s","ɛ"," ","d","ø","ʃ","t","ˈ","ø"," ","j","ˈ","a","v","r","ʊ","m",",","o","d","ʒ","a","ː","ˈ","ɯ"," ","f","ˈ","æ","l","t","ʃ"," ","ɟ","i","b","ˈ","ɪ","."],"processed_text":"Öküz ajan hapse düştü yavrum, ocağı felç gibi.","text":"Öküz ajan hapse düştü yavrum, ocağı felç gibi."}
@@ -0,0 +1,5 @@
+Gökkuşağı, güneş ışınlarının yağmur damlalarında veya sis bulutlarında yansıması ve kırılmasıyla meydana gelen ve ışık tayfı renklerinin bir yay şeklinde göründüğü meteorolojik bir olaydır.
+Gökkuşağındaki renkler bir spektrum oluşturur.
+Tipik bir gökkuşağı kırmızı, turuncu, sarı, yeşil, mavi, lacivert ve mor renklerinden meydana gelen bir renk sırasına sahip bir veya daha fazla aynı merkezli arklardan ibarettir.
+Pijamalı hasta yağız şoföre çabucak güvendi.
+Öküz ajan hapse düştü yavrum, ocağı felç gibi.
@@ -5,6 +5,8 @@ project(piper C CXX)
 find_package(PkgConfig)
 pkg_check_modules(SPDLOG REQUIRED spdlog)

+file(READ "${CMAKE_CURRENT_LIST_DIR}/../../VERSION" piper_version)
+
 set(CMAKE_CXX_STANDARD 17)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)

@@ -35,3 +37,5 @@ target_include_directories(piper PUBLIC

 target_compile_options(piper PUBLIC
                       ${SPDLOG_CFLAGS_OTHER})
+
+target_compile_definitions(piper PUBLIC _PIPER_VERSION=${piper_version})
@@ -70,9 +70,10 @@ struct RunConfig {

  // stdin input is lines of JSON instead of text with format:
  // {
-  //   "text": "...",             (required)
+  //   "text": str,               (required)
  //   "speaker_id": int,         (optional)
-  //   "output_file": "...",      (optional)
+  //   "speaker": str,            (optional)
+  //   "output_file": str,        (optional)
  // }
  bool jsonInput = false;
 };
@@ -194,7 +195,7 @@ int main(int argc, char *argv[]) {
  while (getline(cin, line)) {
    auto outputType = runConfig.outputType;
    auto speakerId = voice.synthesisConfig.speakerId;
-    std::optional<filesystem::path> outputPath;
+    std::optional<filesystem::path> maybeOutputPath = runConfig.outputPath;

    if (runConfig.jsonInput) {
      // Each line is a JSON object
@@ -206,7 +207,7 @@ int main(int argc, char *argv[]) {
      if (lineRoot.contains("output_file")) {
        // Override output WAV file path
        outputType = OUTPUT_FILE;
-        outputPath =
+        maybeOutputPath =
            filesystem::path(lineRoot["output_file"].get<std::string>());
      }

@@ -214,6 +215,16 @@ int main(int argc, char *argv[]) {
        // Override speaker id
        voice.synthesisConfig.speakerId =
            lineRoot["speaker_id"].get<piper::SpeakerId>();
+      } else if (lineRoot.contains("speaker")) {
+        // Resolve to id using speaker id map
+        auto speakerName = lineRoot["speaker"].get<std::string>();
+        if ((voice.modelConfig.speakerIdMap) &&
+            (voice.modelConfig.speakerIdMap->count(speakerName) > 0)) {
+          voice.synthesisConfig.speakerId =
+              (*voice.modelConfig.speakerIdMap)[speakerName];
+        } else {
+          spdlog::warn("No speaker named: {}", speakerName);
+        }
      }
    }

@@ -227,14 +238,20 @@ int main(int argc, char *argv[]) {
      // Generate path using timestamp
      stringstream outputName;
      outputName << timestamp << ".wav";
-      outputPath = runConfig.outputPath.value();
-      outputPath->append(outputName.str());
+      filesystem::path outputPath = runConfig.outputPath.value();
+      outputPath.append(outputName.str());

      // Output audio to automatically-named WAV file in a directory
-      ofstream audioFile(outputPath->string(), ios::binary);
+      ofstream audioFile(outputPath.string(), ios::binary);
      piper::textToWavFile(piperConfig, voice, line, audioFile, result);
-      cout << outputPath->string() << endl;
+      cout << outputPath.string() << endl;
    } else if (outputType == OUTPUT_FILE) {
+      if (!maybeOutputPath || maybeOutputPath->empty()) {
+        throw runtime_error("No output path provided");
+      }
+
+      filesystem::path outputPath = maybeOutputPath.value();
+
      if (!runConfig.jsonInput) {
        // Read all of standard input before synthesizing.
        // Otherwise, we would overwrite the output file for each line.
@@ -248,9 +265,9 @@ int main(int argc, char *argv[]) {
      }

      // Output audio to WAV file
-      ofstream audioFile(outputPath->string(), ios::binary);
+      ofstream audioFile(outputPath.string(), ios::binary);
      piper::textToWavFile(piperConfig, voice, line, audioFile, result);
-      cout << outputPath->string() << endl;
+      cout << outputPath.string() << endl;
    } else if (outputType == OUTPUT_STDOUT) {
      // Output WAV to stdout
      piper::textToWavFile(piperConfig, voice, line, cout, result);
@@ -368,7 +385,7 @@ void printUsage(char *argv[]) {
       << endl;
  cerr << "   --noise_w               NUM   phoneme width noise (default: 0.8)"
       << endl;
-  cerr << "   --silence_seconds       NUM   seconds of silence after each "
+  cerr << "   --sentence_silence      NUM   seconds of silence after each "
          "sentence (default: 0.2)"
       << endl;
  cerr << "   --espeak_data           DIR   path to espeak-ng data directory"
@@ -444,6 +461,9 @@ void parseArgs(int argc, char *argv[], RunConfig &runConfig) {
      runConfig.tashkeelModelPath = filesystem::path(argv[++i]);
    } else if (arg == "--json_input" || arg == "--json-input") {
      runConfig.jsonInput = true;
+    } else if (arg == "--version") {
+      std::cout << piper::getVersion() << std::endl;
+      exit(0);
    } else if (arg == "--debug") {
      // Set DEBUG logging
      spdlog::set_level(spdlog::level::debug);
@@ -16,11 +16,24 @@

 namespace piper {

+#ifdef _PIPER_VERSION
+// https://stackoverflow.com/questions/47346133/how-to-use-a-define-inside-a-format-string
+#define _STR(x) #x
+#define STR(x) _STR(x)
+const std::string VERSION = STR(_PIPER_VERSION);
+#else
+const std::string VERSION = "";
+#endif
+
 // Maximum value for 16-bit signed WAV sample
 const float MAX_WAV_VALUE = 32767.0f;

 const std::string instanceName{"piper"};

+std::string getVersion() {
+  return VERSION;
+}
+
 // True if the string is a single UTF-8 codepoint
 bool isSingleCodepoint(std::string s) {
  return utf8::distance(s.begin(), s.end()) == 1;
@@ -163,6 +176,19 @@ void parseModelConfig(json &configRoot, ModelConfig &modelConfig) {

  modelConfig.numSpeakers = configRoot["num_speakers"].get<SpeakerId>();

+  if (configRoot.contains("speaker_id_map")) {
+    if (!modelConfig.speakerIdMap) {
+      modelConfig.speakerIdMap.emplace();
+    }
+
+    auto speakerIdMapValue = configRoot["speaker_id_map"];
+    for (auto &speakerItem : speakerIdMapValue.items()) {
+      std::string speakerName = speakerItem.key();
+      (*modelConfig.speakerIdMap)[speakerName] =
+          speakerItem.value().get<SpeakerId>();
+    }
+  }
+
 } /* parseModelConfig */

 void initialize(PiperConfig &config) {
@@ -61,6 +61,9 @@ struct SynthesisConfig {

 struct ModelConfig {
  int numSpeakers;
+
+  // speaker name -> id
+  std::optional<std::map<std::string, SpeakerId>> speakerIdMap;
 };

 struct ModelSession {
@@ -86,6 +89,9 @@ struct Voice {
  ModelSession session;
 };

+// Get version of Piper
+std::string getVersion();
+
 // Must be called before using textTo* functions
 void initialize(PiperConfig &config);

@@ -0,0 +1,3 @@
+build/
+dist/
+*.egg-info/
@@ -0,0 +1,2 @@
+include requirements.txt
+include piper/voices.json
@@ -1,147 +1,5 @@
-import io
-import json
-import logging
-import wave
-from dataclasses import dataclass
-from pathlib import Path
-from typing import List, Mapping, Optional, Sequence, Union
+from .voice import PiperVoice

-import numpy as np
-import onnxruntime
-from espeak_phonemizer import Phonemizer
-
-_LOGGER = logging.getLogger(__name__)
-
-_BOS = "^"
-_EOS = "$"
-_PAD = "_"
-
-
-@dataclass
-class PiperConfig:
-    num_symbols: int
-    num_speakers: int
-    sample_rate: int
-    espeak_voice: str
-    length_scale: float
-    noise_scale: float
-    noise_w: float
-    phoneme_id_map: Mapping[str, Sequence[int]]
-
-
-class Piper:
-    def __init__(
-        self,
-        model_path: Union[str, Path],
-        config_path: Optional[Union[str, Path]] = None,
-        use_cuda: bool = False,
-    ):
-        if config_path is None:
-            config_path = f"{model_path}.json"
-
-        self.config = load_config(config_path)
-        self.phonemizer = Phonemizer(self.config.espeak_voice)
-        self.model = onnxruntime.InferenceSession(
-            str(model_path),
-            sess_options=onnxruntime.SessionOptions(),
-            providers=["CPUExecutionProvider"]
-            if not use_cuda
-            else ["CUDAExecutionProvider"],
-        )
-
-    def synthesize(
-        self,
-        text: str,
-        speaker_id: Optional[int] = None,
-        length_scale: Optional[float] = None,
-        noise_scale: Optional[float] = None,
-        noise_w: Optional[float] = None,
-    ) -> bytes:
-        """Synthesize WAV audio from text."""
-        if length_scale is None:
-            length_scale = self.config.length_scale
-
-        if noise_scale is None:
-            noise_scale = self.config.noise_scale
-
-        if noise_w is None:
-            noise_w = self.config.noise_w
-
-        phonemes_str = self.phonemizer.phonemize(text)
-        phonemes = [_BOS] + list(phonemes_str)
-        phoneme_ids: List[int] = []
-
-        for phoneme in phonemes:
-            if phoneme in self.config.phoneme_id_map:
-                phoneme_ids.extend(self.config.phoneme_id_map[phoneme])
-                phoneme_ids.extend(self.config.phoneme_id_map[_PAD])
-            else:
-                _LOGGER.warning("No id for phoneme: %s", phoneme)
-
-        phoneme_ids.extend(self.config.phoneme_id_map[_EOS])
-
-        phoneme_ids_array = np.expand_dims(np.array(phoneme_ids, dtype=np.int64), 0)
-        phoneme_ids_lengths = np.array([phoneme_ids_array.shape[1]], dtype=np.int64)
-        scales = np.array(
-            [noise_scale, length_scale, noise_w],
-            dtype=np.float32,
-        )
-
-        if (self.config.num_speakers > 1) and (speaker_id is None):
-            # Default speaker
-            speaker_id = 0
-
-        sid = None
-
-        if speaker_id is not None:
-            sid = np.array([speaker_id], dtype=np.int64)
-
-        # Synthesize through Onnx
-        audio = self.model.run(
-            None,
-            {
-                "input": phoneme_ids_array,
-                "input_lengths": phoneme_ids_lengths,
-                "scales": scales,
-                "sid": sid,
-            },
-        )[0].squeeze((0, 1))
-        audio = audio_float_to_int16(audio.squeeze())
-
-        # Convert to WAV
-        with io.BytesIO() as wav_io:
-            wav_file: wave.Wave_write = wave.open(wav_io, "wb")
-            with wav_file:
-                wav_file.setframerate(self.config.sample_rate)
-                wav_file.setsampwidth(2)
-                wav_file.setnchannels(1)
-                wav_file.writeframes(audio.tobytes())
-
-            return wav_io.getvalue()
-
-
-def load_config(config_path: Union[str, Path]) -> PiperConfig:
-    with open(config_path, "r", encoding="utf-8") as config_file:
-        config_dict = json.load(config_file)
-        inference = config_dict.get("inference", {})
-
-        return PiperConfig(
-            num_symbols=config_dict["num_symbols"],
-            num_speakers=config_dict["num_speakers"],
-            sample_rate=config_dict["audio"]["sample_rate"],
-            espeak_voice=config_dict["espeak"]["voice"],
-            noise_scale=inference.get("noise_scale", 0.667),
-            length_scale=inference.get("length_scale", 1.0),
-            noise_w=inference.get("noise_w", 0.8),
-            phoneme_id_map=config_dict["phoneme_id_map"],
-        )
-
-
-def audio_float_to_int16(
-    audio: np.ndarray, max_wav_value: float = 32767.0
-) -> np.ndarray:
-    """Normalize audio and convert to int16 range"""
-    audio_norm = audio * (max_wav_value / max(0.01, np.max(np.abs(audio))))
-    audio_norm = np.clip(audio_norm, -max_wav_value, max_wav_value)
-    audio_norm = audio_norm.astype("int16")
-    return audio_norm
+__all__ = [
+    "PiperVoice",
+]
@@ -2,10 +2,12 @@ import argparse
 import logging
 import sys
 import time
-from functools import partial
+import wave
 from pathlib import Path
+from typing import Any, Dict

-from . import Piper
+from . import PiperVoice
+from .download import ensure_voice_exists, find_voice, get_voices

 _FILE = Path(__file__)
 _DIR = _FILE.parent
@@ -17,33 +19,108 @@ def main() -> None:
    parser.add_argument("-m", "--model", required=True, help="Path to Onnx model file")
    parser.add_argument("-c", "--config", help="Path to model config file")
    parser.add_argument(
-        "-f", "--output_file", help="Path to output WAV file (default: stdout)"
+        "-f",
+        "--output-file",
+        "--output_file",
+        help="Path to output WAV file (default: stdout)",
    )
    parser.add_argument(
-        "-d", "--output_dir", help="Path to output directory (default: cwd)"
+        "-d",
+        "--output-dir",
+        "--output_dir",
+        help="Path to output directory (default: cwd)",
    )
+    parser.add_argument(
+        "--output-raw",
+        "--output_raw",
+        action="store_true",
+        help="Stream raw audio to stdout",
+    )
+    #
    parser.add_argument("-s", "--speaker", type=int, help="Id of speaker (default: 0)")
-    parser.add_argument("--noise-scale", type=float, help="Generator noise")
-    parser.add_argument("--length-scale", type=float, help="Phoneme length")
-    parser.add_argument("--noise-w", type=float, help="Phoneme width noise")
+    parser.add_argument(
+        "--length-scale", "--length_scale", type=float, help="Phoneme length"
+    )
+    parser.add_argument(
+        "--noise-scale", "--noise_scale", type=float, help="Generator noise"
+    )
+    parser.add_argument(
+        "--noise-w", "--noise_w", type=float, help="Phoneme width noise"
+    )
+    #
    parser.add_argument("--cuda", action="store_true", help="Use GPU")
    #
+    parser.add_argument(
+        "--sentence-silence",
+        "--sentence_silence",
+        type=float,
+        default=0.0,
+        help="Seconds of silence after each sentence",
+    )
+    #
+    parser.add_argument(
+        "--data-dir",
+        "--data_dir",
+        action="append",
+        default=[str(Path.cwd())],
+        help="Data directory to check for downloaded models (default: current directory)",
+    )
+    parser.add_argument(
+        "--download-dir",
+        "--download_dir",
+        help="Directory to download voices into (default: first data dir)",
+    )
+    #
    parser.add_argument(
        "--debug", action="store_true", help="Print DEBUG messages to console"
    )
    args = parser.parse_args()
    logging.basicConfig(level=logging.DEBUG if args.debug else logging.INFO)
+    _LOGGER.debug(args)

-    voice = Piper(args.model, config_path=args.config, use_cuda=args.cuda)
-    synthesize = partial(
-        voice.synthesize,
-        speaker_id=args.speaker,
-        length_scale=args.length_scale,
-        noise_scale=args.noise_scale,
-        noise_w=args.noise_w,
-    )
+    if not args.download_dir:
+        # Download to first data directory by default
+        args.download_dir = args.data_dir[0]

-    if args.output_dir:
+    # Download voice if file doesn't exist
+    model_path = Path(args.model)
+    if not model_path.exists():
+        # Load voice info
+        voices_info = get_voices()
+
+        # Resolve aliases for backwards compatibility with old voice names
+        aliases_info: Dict[str, Any] = {}
+        for voice_info in voices_info.values():
+            for voice_alias in voice_info.get("aliases", []):
+                aliases_info[voice_alias] = {"_is_alias": True, **voice_info}
+
+        voices_info.update(aliases_info)
+        ensure_voice_exists(args.model, args.data_dir, args.download_dir, voices_info)
+        args.model, args.config = find_voice(args.model, args.data_dir)
+
+    # Load voice
+    voice = PiperVoice.load(args.model, config_path=args.config, use_cuda=args.cuda)
+    synthesize_args = {
+        "speaker_id": args.speaker,
+        "length_scale": args.length_scale,
+        "noise_scale": args.noise_scale,
+        "noise_w": args.noise_w,
+        "sentence_silence": args.sentence_silence,
+    }
+
+    if args.output_raw:
+        # Read line-by-line
+        for line in sys.stdin:
+            line = line.strip()
+            if not line:
+                continue
+
+            # Write raw audio to stdout as its produced
+            audio_stream = voice.synthesize_stream_raw(line, **synthesize_args)
+            for audio_bytes in audio_stream:
+                sys.stdout.buffer.write(audio_bytes)
+                sys.stdout.buffer.flush()
+    elif args.output_dir:
        output_dir = Path(args.output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)

@@ -53,21 +130,23 @@ def main() -> None:
            if not line:
                continue

-            wav_bytes = synthesize(line)
            wav_path = output_dir / f"{time.monotonic_ns()}.wav"
-            wav_path.write_bytes(wav_bytes)
+            with wave.open(str(wav_path), "wb") as wav_file:
+                voice.synthesize(line, wav_file, **synthesize_args)
+
            _LOGGER.info("Wrote %s", wav_path)
    else:
        # Read entire input
        text = sys.stdin.read()
-        wav_bytes = synthesize(text)

        if (not args.output_file) or (args.output_file == "-"):
            # Write to stdout
-            sys.stdout.buffer.write(wav_bytes)
+            with wave.open(sys.stdout.buffer, "wb") as wav_file:
+                voice.synthesize(text, wav_file, **synthesize_args)
        else:
-            with open(args.output_file, "wb") as output_file:
-                output_file.write(wav_bytes)
+            # Write to file
+            with wave.open(args.output_file, "wb") as wav_file:
+                voice.synthesize(text, wav_file, **synthesize_args)


 if __name__ == "__main__":
@@ -0,0 +1,53 @@
+"""Piper configuration"""
+from dataclasses import dataclass
+from enum import Enum
+from typing import Any, Dict, Mapping, Sequence
+
+
+class PhonemeType(str, Enum):
+    ESPEAK = "espeak"
+    TEXT = "text"
+
+
+@dataclass
+class PiperConfig:
+    """Piper configuration"""
+
+    num_symbols: int
+    """Number of phonemes"""
+
+    num_speakers: int
+    """Number of speakers"""
+
+    sample_rate: int
+    """Sample rate of output audio"""
+
+    espeak_voice: str
+    """Name of espeak-ng voice or alphabet"""
+
+    length_scale: float
+    noise_scale: float
+    noise_w: float
+
+    phoneme_id_map: Mapping[str, Sequence[int]]
+    """Phoneme -> [id,]"""
+
+    phoneme_type: PhonemeType
+    """espeak or text"""
+
+    @staticmethod
+    def from_dict(config: Dict[str, Any]) -> "PiperConfig":
+        inference = config.get("inference", {})
+
+        return PiperConfig(
+            num_symbols=config["num_symbols"],
+            num_speakers=config["num_speakers"],
+            sample_rate=config["audio"]["sample_rate"],
+            noise_scale=inference.get("noise_scale", 0.667),
+            length_scale=inference.get("length_scale", 1.0),
+            noise_w=inference.get("noise_w", 0.8),
+            #
+            espeak_voice=config["espeak"]["voice"],
+            phoneme_id_map=config["phoneme_id_map"],
+            phoneme_type=PhonemeType(config.get("phoneme_type", PhonemeType.ESPEAK)),
+        )
@@ -0,0 +1,5 @@
+"""Constants"""
+
+PAD = "_"  # padding (0)
+BOS = "^"  # beginning of sentence
+EOS = "$"  # end of sentence
@@ -0,0 +1,120 @@
+"""Utility for downloading Piper voices."""
+import json
+import logging
+import shutil
+from pathlib import Path
+from typing import Any, Dict, Iterable, Set, Tuple, Union
+from urllib.request import urlopen
+
+from .file_hash import get_file_hash
+
+URL_FORMAT = "https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/{file}"
+
+_DIR = Path(__file__).parent
+_LOGGER = logging.getLogger(__name__)
+
+_SKIP_FILES = {"MODEL_CARD"}
+
+
+class VoiceNotFoundError(Exception):
+    pass
+
+
+def get_voices() -> Dict[str, Any]:
+    """Loads available voices from embedded JSON file."""
+    with open(_DIR / "voices.json", "r", encoding="utf-8") as voices_file:
+        return json.load(voices_file)
+
+
+def ensure_voice_exists(
+    name: str,
+    data_dirs: Iterable[Union[str, Path]],
+    download_dir: Union[str, Path],
+    voices_info: Dict[str, Any],
+):
+    assert data_dirs, "No data dirs"
+    if name not in voices_info:
+        raise VoiceNotFoundError(name)
+
+    voice_info = voices_info[name]
+    voice_files = voice_info["files"]
+    files_to_download: Set[str] = set()
+
+    for data_dir in data_dirs:
+        data_dir = Path(data_dir)
+
+        # Check sizes/hashes
+        for file_path, file_info in voice_files.items():
+            if file_path in files_to_download:
+                # Already planning to download
+                continue
+
+            file_name = Path(file_path).name
+            if file_name in _SKIP_FILES:
+                continue
+
+            data_file_path = data_dir / file_name
+            _LOGGER.debug("Checking %s", data_file_path)
+            if not data_file_path.exists():
+                _LOGGER.debug("Missing %s", data_file_path)
+                files_to_download.add(file_path)
+                continue
+
+            expected_size = file_info["size_bytes"]
+            actual_size = data_file_path.stat().st_size
+            if expected_size != actual_size:
+                _LOGGER.warning(
+                    "Wrong size (expected=%s, actual=%s) for %s",
+                    expected_size,
+                    actual_size,
+                    data_file_path,
+                )
+                files_to_download.add(file_path)
+                continue
+
+            expected_hash = file_info["md5_digest"]
+            actual_hash = get_file_hash(data_file_path)
+            if expected_hash != actual_hash:
+                _LOGGER.warning(
+                    "Wrong hash (expected=%s, actual=%s) for %s",
+                    expected_hash,
+                    actual_hash,
+                    data_file_path,
+                )
+                files_to_download.add(file_path)
+                continue
+
+    if (not voice_files) and (not files_to_download):
+        raise ValueError(f"Unable to find or download voice: {name}")
+
+    # Download missing files
+    download_dir = Path(download_dir)
+
+    for file_path in files_to_download:
+        file_name = Path(file_path).name
+        if file_name in _SKIP_FILES:
+            continue
+
+        file_url = URL_FORMAT.format(file=file_path)
+        download_file_path = download_dir / file_name
+        download_file_path.parent.mkdir(parents=True, exist_ok=True)
+
+        _LOGGER.debug("Downloading %s to %s", file_url, download_file_path)
+        with urlopen(file_url) as response, open(
+            download_file_path, "wb"
+        ) as download_file:
+            shutil.copyfileobj(response, download_file)
+
+        _LOGGER.info("Downloaded %s (%s)", download_file_path, file_url)
+
+
+def find_voice(name: str, data_dirs: Iterable[Union[str, Path]]) -> Tuple[Path, Path]:
+    for data_dir in data_dirs:
+        data_dir = Path(data_dir)
+        onnx_path = data_dir / f"{name}.onnx"
+        config_path = data_dir / f"{name}.onnx.json"
+
+        if onnx_path.exists() and config_path.exists():
+            return onnx_path, config_path
+
+    raise ValueError(f"Missing files for voice {name}")
@@ -0,0 +1,46 @@
+import argparse
+import hashlib
+import json
+import sys
+from pathlib import Path
+from typing import Union
+
+
+def get_file_hash(path: Union[str, Path], bytes_per_chunk: int = 8192) -> str:
+    """Hash a file in chunks using md5."""
+    path_hash = hashlib.md5()
+    with open(path, "rb") as path_file:
+        chunk = path_file.read(bytes_per_chunk)
+        while chunk:
+            path_hash.update(chunk)
+            chunk = path_file.read(bytes_per_chunk)
+
+    return path_hash.hexdigest()
+
+
+# -----------------------------------------------------------------------------
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("file", nargs="+")
+    parser.add_argument("--dir", help="Parent directory")
+    args = parser.parse_args()
+
+    if args.dir:
+        args.dir = Path(args.dir)
+
+    hashes = {}
+    for path_str in args.file:
+        path = Path(path_str)
+        path_hash = get_file_hash(path)
+        if args.dir:
+            path = path.relative_to(args.dir)
+
+        hashes[str(path)] = path_hash
+
+    json.dump(hashes, sys.stdout)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,12 @@
+"""Utilities"""
+import numpy as np
+
+
+def audio_float_to_int16(
+    audio: np.ndarray, max_wav_value: float = 32767.0
+) -> np.ndarray:
+    """Normalize audio and convert to int16 range"""
+    audio_norm = audio * (max_wav_value / max(0.01, np.max(np.abs(audio))))
+    audio_norm = np.clip(audio_norm, -max_wav_value, max_wav_value)
+    audio_norm = audio_norm.astype("int16")
+    return audio_norm
@@ -0,0 +1,177 @@
+import json
+import logging
+import wave
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable, List, Optional, Union
+
+import numpy as np
+import onnxruntime
+from piper_phonemize import phonemize_codepoints, phonemize_espeak, tashkeel_run
+
+from .config import PhonemeType, PiperConfig
+from .const import BOS, EOS, PAD
+from .util import audio_float_to_int16
+
+_LOGGER = logging.getLogger(__name__)
+
+
+@dataclass
+class PiperVoice:
+    session: onnxruntime.InferenceSession
+    config: PiperConfig
+
+    @staticmethod
+    def load(
+        model_path: Union[str, Path],
+        config_path: Optional[Union[str, Path]] = None,
+        use_cuda: bool = False,
+    ) -> "PiperVoice":
+        """Load an ONNX model and config."""
+        if config_path is None:
+            config_path = f"{model_path}.json"
+
+        with open(config_path, "r", encoding="utf-8") as config_file:
+            config_dict = json.load(config_file)
+
+        return PiperVoice(
+            config=PiperConfig.from_dict(config_dict),
+            session=onnxruntime.InferenceSession(
+                str(model_path),
+                sess_options=onnxruntime.SessionOptions(),
+                providers=["CPUExecutionProvider"]
+                if not use_cuda
+                else ["CUDAExecutionProvider"],
+            ),
+        )
+
+    def phonemize(self, text: str) -> List[List[str]]:
+        """Text to phonemes grouped by sentence."""
+        if self.config.phoneme_type == PhonemeType.ESPEAK:
+            if self.config.espeak_voice == "ar":
+                # Arabic diacritization
+                # https://github.com/mush42/libtashkeel/
+                text = tashkeel_run(text)
+
+            return phonemize_espeak(text, self.config.espeak_voice)
+
+        if self.config.phoneme_type == PhonemeType.TEXT:
+            return phonemize_codepoints(text)
+
+        raise ValueError(f"Unexpected phoneme type: {self.config.phoneme_type}")
+
+    def phonemes_to_ids(self, phonemes: List[str]) -> List[int]:
+        """Phonemes to ids."""
+        id_map = self.config.phoneme_id_map
+        ids: List[int] = list(id_map[BOS])
+
+        for phoneme in phonemes:
+            if phoneme not in id_map:
+                _LOGGER.warning("Missing phoneme from id map: %s", phoneme)
+                continue
+
+            ids.extend(id_map[phoneme])
+            ids.extend(id_map[PAD])
+
+        ids.extend(id_map[EOS])
+
+        return ids
+
+    def synthesize(
+        self,
+        text: str,
+        wav_file: wave.Wave_write,
+        speaker_id: Optional[int] = None,
+        length_scale: Optional[float] = None,
+        noise_scale: Optional[float] = None,
+        noise_w: Optional[float] = None,
+        sentence_silence: float = 0.0,
+    ):
+        """Synthesize WAV audio from text."""
+        wav_file.setframerate(self.config.sample_rate)
+        wav_file.setsampwidth(2)  # 16-bit
+        wav_file.setnchannels(1)  # mono
+
+        for audio_bytes in self.synthesize_stream_raw(
+            text,
+            speaker_id=speaker_id,
+            length_scale=length_scale,
+            noise_scale=noise_scale,
+            noise_w=noise_w,
+            sentence_silence=sentence_silence,
+        ):
+            wav_file.writeframes(audio_bytes)
+
+    def synthesize_stream_raw(
+        self,
+        text: str,
+        speaker_id: Optional[int] = None,
+        length_scale: Optional[float] = None,
+        noise_scale: Optional[float] = None,
+        noise_w: Optional[float] = None,
+        sentence_silence: float = 0.0,
+    ) -> Iterable[bytes]:
+        """Synthesize raw audio per sentence from text."""
+        sentence_phonemes = self.phonemize(text)
+
+        # 16-bit mono
+        num_silence_samples = int(sentence_silence * self.config.sample_rate)
+        silence_bytes = bytes(num_silence_samples * 2)
+
+        for phonemes in sentence_phonemes:
+            phoneme_ids = self.phonemes_to_ids(phonemes)
+            yield self.synthesize_ids_to_raw(
+                phoneme_ids,
+                speaker_id=speaker_id,
+                length_scale=length_scale,
+                noise_scale=noise_scale,
+                noise_w=noise_w,
+            ) + silence_bytes
+
+    def synthesize_ids_to_raw(
+        self,
+        phoneme_ids: List[int],
+        speaker_id: Optional[int] = None,
+        length_scale: Optional[float] = None,
+        noise_scale: Optional[float] = None,
+        noise_w: Optional[float] = None,
+    ) -> bytes:
+        """Synthesize raw audio from phoneme ids."""
+        if length_scale is None:
+            length_scale = self.config.length_scale
+
+        if noise_scale is None:
+            noise_scale = self.config.noise_scale
+
+        if noise_w is None:
+            noise_w = self.config.noise_w
+
+        phoneme_ids_array = np.expand_dims(np.array(phoneme_ids, dtype=np.int64), 0)
+        phoneme_ids_lengths = np.array([phoneme_ids_array.shape[1]], dtype=np.int64)
+        scales = np.array(
+            [noise_scale, length_scale, noise_w],
+            dtype=np.float32,
+        )
+
+        if (self.config.num_speakers > 1) and (speaker_id is None):
+            # Default speaker
+            speaker_id = 0
+
+        sid = None
+
+        if speaker_id is not None:
+            sid = np.array([speaker_id], dtype=np.int64)
+
+        # Synthesize through Onnx
+        audio = self.session.run(
+            None,
+            {
+                "input": phoneme_ids_array,
+                "input_lengths": phoneme_ids_lengths,
+                "scales": scales,
+                "sid": sid,
+            },
+        )[0].squeeze((0, 1))
+        audio = audio_float_to_int16(audio.squeeze())
+
+        return audio.tobytes()
@@ -1,2 +1,2 @@
-espeak-phonemizer>=1.1.0,<2
-onnxruntime~=1.11.0
+piper-phonemize~=1.0.0
+onnxruntime>=1.11.0,<2
@@ -0,0 +1,47 @@
+#!/usr/bin/env python3
+from pathlib import Path
+
+import setuptools
+from setuptools import setup
+
+this_dir = Path(__file__).parent
+module_dir = this_dir / "piper"
+
+requirements = []
+requirements_path = this_dir / "requirements.txt"
+if requirements_path.is_file():
+    with open(requirements_path, "r", encoding="utf-8") as requirements_file:
+        requirements = requirements_file.read().splitlines()
+
+data_files = [module_dir / "voices.json"]
+
+# -----------------------------------------------------------------------------
+
+setup(
+    name="piper-tts",
+    version="1.1.0",
+    description="A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4.",
+    url="http://github.com/rhasspy/piper",
+    author="Michael Hansen",
+    author_email="mike@rhasspy.org",
+    license="MIT",
+    packages=setuptools.find_packages(),
+    package_data={"piper": [str(p.relative_to(module_dir)) for p in data_files]},
+    entry_points={
+        "console_scripts": [
+            "piper = piper.__main__:main",
+        ]
+    },
+    install_requires=requirements,
+    classifiers=[
+        "Development Status :: 3 - Alpha",
+        "Intended Audience :: Developers",
+        "Topic :: Text Processing :: Linguistic",
+        "License :: OSI Approved :: MIT License",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+    ],
+    keywords="rhasspy piper tts",
+)
@@ -1 +1 @@
 .0.0
 .1.0