diff --git a/README.md b/README.md index 8ce296b..bbfddd6 100644 --- a/README.md +++ b/README.md @@ -88,81 +88,10 @@ Piper has been used in the following projects/papers: ## Training -See [src/python](src/python) +See the [training guide](TRAINING.md) and the [source code](src/python). Pretrained checkpoints are available on [Hugging Face](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main) -Start by installing system dependencies: - -``` sh -sudo apt-get install python3-dev -``` - -Then create a virtual environment: - -``` sh -cd piper/src/python -python3 -m venv .venv -source .venv/bin/activate -pip3 install --upgrade pip -pip3 install --upgrade wheel setuptools -pip3 install -r requirements.txt -``` - -Run the `build_monotonic_align.sh` script in the `src/python` directory to build the extension. - -Ensure you have [espeak-ng](https://github.com/espeak-ng/espeak-ng/) installed (`sudo apt-get install espeak-ng`). - -Next, preprocess your dataset: - -``` sh -python3 -m piper_train.preprocess \ - --language en-us \ - --input-dir /path/to/ljspeech/ \ - --output-dir /path/to/training_dir/ \ - --dataset-format ljspeech \ - --sample-rate 22050 -``` - -Datasets must either be in the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) format (with only id/text columns or id/speaker/text) or from [Mimic Recording Studio](https://github.com/MycroftAI/mimic-recording-studio) (`--dataset-format mycroft`). - -Finally, you can train: - -``` sh -python3 -m piper_train \ - --dataset-dir /path/to/training_dir/ \ - --accelerator 'gpu' \ - --devices 1 \ - --batch-size 32 \ - --validation-split 0.05 \ - --num-test-examples 5 \ - --max_epochs 10000 \ - --precision 32 -``` - -Training uses [PyTorch Lightning](https://www.pytorchlightning.ai/). Run `tensorboard --logdir /path/to/training_dir/lightning_logs` to monitor. See `python3 -m piper_train --help` for many additional options. - -It is highly recommended to train with the following `Dockerfile`: - -``` dockerfile -FROM nvcr.io/nvidia/pytorch:22.03-py3 - -RUN pip3 install \ - 'pytorch-lightning' - -ENV NUMBA_CACHE_DIR=.numba_cache -``` - -See the various `infer_*` and `export_*` scripts in [src/python/piper_train](src/python/piper_train) to test and export your voice from the checkpoint in `lightning_logs`. The `dataset.jsonl` file in your training directory can be used with `python3 -m piper_train.infer` for quick testing: - -``` sh -head -n5 /path/to/training_dir/dataset.jsonl | \ - python3 -m piper_train.infer \ - --checkpoint lightning_logs/path/to/checkpoint.ckpt \ - --sample-rate 22050 \ - --output-dir wavs -``` - ## Running in Python diff --git a/TRAINING.md b/TRAINING.md new file mode 100644 index 0000000..4872d93 --- /dev/null +++ b/TRAINING.md @@ -0,0 +1,235 @@ +# Training Guide + +Training a voice for Piper involves 3 main steps: + +1. Preparing the dataset +2. Training the voice model +3. Exporting the voice model + +Choices must be made at each step, including: + +* The model "quality" + * low = 16,000 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30) + * medium = 22,050 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30) + * high = 22,050 Hz sample rate, [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45) +* Single or multiple speakers +* Fine-tuning an [existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main) or training from scratch +* Exporting to [onnx](https://github.com/microsoft/onnxruntime/) or PyTorch + +## Getting Started + +Start by installing system dependencies: + +``` sh +sudo apt-get install python3-dev +``` + +Then create a Python virtual environment: + +``` sh +cd piper/src/python +python3 -m venv .venv +source .venv/bin/activate +pip3 install --upgrade pip +pip3 install --upgrade wheel setuptools +pip3 install -r requirements.txt +``` + +Run the `build_monotonic_align.sh` script in the `src/python` directory to build the extension. + +Ensure you have [espeak-ng](https://github.com/espeak-ng/espeak-ng/) installed (`sudo apt-get install espeak-ng`). + + +## Preparing a Dataset + +The Piper training scripts expect two files that can be generated by `python3 -m piper_train.preprocess`: + +* A `config.json` file with the voice settings + * `audio` (required) + * `sample_rate` - audio rate in hertz + * `espeak` (required) + * `language` - espeak-ng voice or [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp) + * `num_symbols` (required) + * Number of phonemes in the model (typically 256) + * `num_speakers` (required) + * Number of speakers in the dataset + * `phoneme_id_map` (required) + * Map from a phoneme (UTF-8 codepoint) to a list of ids + * Id 0 ("_") is padding (pad) + * Id 1 ("^") is the beginning of an utterance (bos) + * Id 2 ("$") is the end of an utterance (eos) + * Id 3 (" ") is a word separator (whitespace) + * `phoneme_type` + * "espeak" or "text" + * "espeak" phonemes use [espeak-ng](https://github.com/rhasspy/espeak-ng) + * "text" phonemes use a pre-defined [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp) + * `speaker_id_map` + * Map from a speaker name to id + * `phoneme_map` + * Map from a phoneme (UTF-8 codepoint) to a list of phonemes + * `inference` + * `noise_scale` - noise added to the generator (default: 0.667) + * `length_scale` - speaking speed (default: 1.0) + * `noise_w` - phoneme width variation (default: 0.8) +* A `dataset.jsonl` file with one line per utterance (JSON objects) + * `phoneme_ids` (required) + * List of ids for each utterance phoneme (0 <= id < `num_symbols`) + * `audio_norm_path` (required) + * Absolute path to [normalized audio](https://github.com/rhasspy/piper/tree/master/src/python/piper_train/norm_audio) file (`.pt`) + * `audio_spec_path` (required) + * Absolute path to [audio spectrogram](https://github.com/rhasspy/piper/blob/fda64e7a5104810a24eb102b880fc5c2ac596a38/src/python/piper_train/vits/mel_processing.py#L40) file (`.pt`) + * `speaker_id` (required for multi-speaker) + * Id of the utterance's speaker (0 <= id < `num_speakers`) + * `audio_path` + * Absolute path to original audio file + * `text` + * Original text of utterance before phonemization + * `phonemes` + * Phonemes from utterance text before converting to ids + * `speaker` + * Name of utterance speaker (from `speaker_id_map`) + + +### Dataset Format + +The pre-processing script expects data to be a directory with: + +* `metadata.csv` - CSV file with text, audio filenames, and speaker names +* `wav/` - directory with audio files + +The `metadata.csv` file uses `|` as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers. +There is no header row. + +For single speaker datasets: + +```csv +id|text +``` + +where `id` is the name of the WAV file in the `wav` directory. For example, an `id` of `1234` means that `wav/1234.wav` should exist. + +For multi-speaker datasets: + +```csv +id|speaker|text +``` + +where `speaker` is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances). + + +### Pre-processing + +An example of pre-processing a single speaker dataset: + +``` sh +python3 -m piper_train.preprocess \ + --language en-us \ + --input-dir /path/to/dataset_dir/ \ + --output-dir /path/to/training_dir/ \ + --dataset-format ljspeech \ + --single-speaker \ + --sample-rate 22050 +``` + +The `--language` argument refers to an [espeak-ng voice](https://github.com/espeak-ng/espeak-ng/) by default, such as `de` for German. + +To pre-process a multi-speaker dataset, remove the `--single-speaker` flag and ensure that your dataset has the 3 columns: `id|speaker|text` +Verify the number of speakers in the generated `config.json` file before proceeding. + + +## Training a Model + +Once you have a `config.json`, `dataset.jsonl`, and audio files (`.pt`) from pre-processing, you can begin the training process with `python3 -m piper_train` + +For most cases, you should fine-tune from [an existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main). The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language. + +It is **highly recommended** to train with the following `Dockerfile`: + +``` dockerfile +FROM nvcr.io/nvidia/pytorch:22.03-py3 + +RUN pip3 install \ + 'pytorch-lightning' + +ENV NUMBA_CACHE_DIR=.numba_cache +``` + +As an example, we will fine-tune the [medium quality lessac voice](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac/medium). Download the `.ckpt` file and run the following command in your training environment: + +``` sh +python3 -m piper_train \ + --dataset-dir /path/to/training_dir/ \ + --accelerator 'gpu' \ + --devices 1 \ + --batch-size 32 \ + --validation-split 0.0 \ + --num-test-examples 0 \ + --max_epochs 10000 \ + --resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \ + --checkpoint-epochs 1 \ + --precision 32 +``` + +Use `--quality high` to train a [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45) (sounds better, but is much slower). + +You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small. + +Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The `--max-phoneme-ids ` argument to `piper_train` will drop sentences that have more than `N` phoneme ids. In practice, using `--batch-size 32` and `--max-phoneme-ids 400` will work for 24 GB of vRAM (RTX 3090/4090). + + +### Multi-Speaker Fine-Tuning + +If you're training a multi-speaker model, use `--resume_from_single_speaker_checkpoint` instead of `--resume_from_checkpoint`. This will be *much* faster than training your multi-speaker model from scratch. + + +### Testing + +To test your voice during training, you can use [these test sentences](https://github.com/rhasspy/piper/tree/master/etc/test_sentences) or generate your own with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/). Run the following command to generate audio files: + +```sh +cat test_en-us.jsonl | \ + python3 -m piper_train.infer \ + --sample-rate 22050 \ + --checkpoint /path/to/training_dir/lightning_logs/version_0/checkpoints/*.ckpt \ + --output-dir /path/to/training_dir/output" +``` + +The input format to `piper_train.infer` is the same as `dataset.jsonl`: one line of JSON per utterance with `phoneme_ids` and `speaker_id` (multi-speaker only). Generate your own test file with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/): + +```sh +lib/piper_phonemize -l en-us --espeak-data lib/espeak-ng-data/ < my_test_sentences.txt > my_test_phonemes.jsonl +``` + + +### Tensorboard + +Check on your model's progress with tensorboard: + +```sh +tensorboard --logdir /path/to/training_dir/lightning_logs +``` + +Click on the scalars tab and look at both `loss_disc_all` and `loss_gen_all`. In general, the model is "done" when `loss_disc_all` levels off. We've found that 2000 epochs is usually good for models trained from scratch, and an additional 1000 epochs when fine-tuning. + + +## Exporting a Model + +When your model is finished training, export it to onnx with: + +```sh +python3 -m piper_train.export_onnx \ + /path/to/model.ckpt \ + /path/to/model.onnx + +cp /path/to/training_dir/config.json \ + /path/to/model.onnx.json +``` + +The [export script](https://github.com/rhasspy/piper-samples/blob/master/_script/export.sh) does additional optimization of the model with [onnx-simplifier](https://github.com/daquexian/onnx-simplifier). + +If the export is successful, you can now use your voice with Piper: + +```sh +echo 'This is a test.' | \ + piper -m /path/to/model.onnx --output_file test.wav +```