A major update to the training notebook.

This commit is contained in:
Mateo Cedillo
2023-06-14 12:03:43 -05:00
parent 4a7f37c4f6
commit fcf2d947ab

View File

@@ -109,20 +109,21 @@
"cell_type": "code",
"source": [
"#@markdown # <font color=\"pink\"> **Install software.** 📦\n",
"\n",
"#@markdown ####In this cell the synthesizer and its necessary dependencies to execute the training will be installed. (this may take a while)\n",
"\n",
"#@markdown <font color=\"orange\">**Note: Please restart the runtime environment when the cell execution is finished. Then you can continue with the training section.**\n",
"\n",
"# clone:\n",
"!git clone https://github.com/rmcpantoja/piper\n",
"%cd piper/src/python\n",
"!pip install --upgrade pip\n",
"!pip install --upgrade wheel setuptools\n",
"!pip install -r requirements.txt\n",
"!pip install torchtext==0.12.0\n",
"!pip install torchvision==0.12.0\n",
"!git clone -q https://github.com/rmcpantoja/piper\n",
"%cd /content/piper/src/python\n",
"!pip install -q -r requirements.txt\n",
"!pip install -q torchtext==0.12.0\n",
"!pip install -q torchvision==0.12.0\n",
"!bash build_monotonic_align.sh\n",
"!apt-get install espeak-ng\n",
"!apt-get install -q espeak-ng\n",
"# download patches:\n",
"print(\"Downloading patch...\")\n",
"!gdown -q \"1EWEb7amo1rgFGpBFfRD4BKX3pkjVK1I-\" -O \"/content/piper/src/python/patch.zip\"\n",
"!unzip -o -q \"patch.zip\"\n",
"%cd /content"
],
"metadata": {
@@ -177,12 +178,22 @@
"#@markdown ---\n",
"#@markdown ####Important: the transcription means writing what the character says in each of the audios, and it must have the following structure:\n",
"\n",
"#@markdown ##### For a single-speaker dataset:\n",
"#@markdown * wavs/1.wav|This is what my character says in audio 1.\n",
"#@markdown * wavs/2.wav|This, the text that the character says in audio 2.\n",
"#@markdown * ...\n",
"\n",
"#@markdown ##### For a multi-speaker dataset:\n",
"\n",
"#@markdown * wavs/speaker1audio1.wav|speaker1|This is what the first speaker says.\n",
"#@markdown * wavs/speaker1audio2.wav|speaker1|This is another audio of the first speaker.\n",
"#@markdown * wavs/speaker2audio1.wav|speaker2|This is what the second speaker says in the first audio.\n",
"#@markdown * wavs/speaker2audio2.wav|speaker2|This is another audio of the second speaker.\n",
"#@markdown * ...\n",
"\n",
"#@markdown And so on. In addition, the transcript must be in a .csv format. (UTF-8 without BOM)\n",
"\n",
"#@markdown ---\n",
"%cd /content/dataset\n",
"from google.colab import files\n",
"!rm /content/dataset/metadata.csv\n",
@@ -261,7 +272,7 @@
" force_sp = \"\"\n",
"#@markdown ---\n",
"#@markdown ### Select the sample rate of the dataset:\n",
"sample_rate = \"16000\" #@param [\"16000\", \"22050\"]\n",
"sample_rate = \"22050\" #@param [\"16000\", \"22050\"]\n",
"#@markdown ---\n",
"%cd /content/piper/src/python\n",
"!python -m piper_train.preprocess \\\n",
@@ -286,21 +297,23 @@
"import json\n",
"import ipywidgets as widgets\n",
"from IPython.display import display\n",
"from google.colab import output\n",
"import os\n",
"#@markdown ### Select the action to train this dataset:\n",
"\n",
"#@markdown ### Fine-tune this dataset?\n",
"#@markdown * The option to convert a single-speaker model to a multi-speaker model is self-explanatory, and for this it is important that you have processed a dataset that contains text and audio from all possible speakers that you want to train in your model.\n",
"#@markdown * The finetune option is used to train a dataset using a pretrained model, that is, train on that data. This option is ideal if you want to train a very small dataset (more than five minutes recommended).\n",
"#@markdown * The train from scratch option builds features such as dictionary and speech form from scratch, and this may take longer to converge. For this, hours of audio (8 at least) are recommended, which have a large collection of phonemes.\n",
"\n",
"#@markdown note: Currently, some models are not working due to piper update, but it is expected to update soon\n",
"\n",
"#@markdown * ar-qasr low and high\n",
"#@markdown * da-nst_talesyntese-medium\n",
"#@markdown * kk-iseke-low\n",
"#@markdown * kk-raya-low\n",
"#@markdown * ne-Google-low\n",
"#@markdown * no-talesyntese-medium\n",
"finetune = True #@param {type:\"boolean\"}\n",
"action = \"finetune\" #@param [\"convert single-speaker to multi-speaker model\", \"finetune\", \"train from scratch\"]\n",
"#@markdown ---\n",
"if finetune:\n",
"if action == \"finetune\":\n",
" ft_command = '--resume_from_checkpoint \"/content/pretrained.ckpt\" '\n",
"elif action == \"convert single-speaker to multi-speaker model\":\n",
" ft_command = '--resume_from_single_speaker_checkpoint \"/content/pretrained.ckpt\" '\n",
"else:\n",
" ft_command = \"\"\n",
"if action== \"convert single-speaker to multi-speaker model\" or action == \"finetune\":\n",
" try:\n",
" with open('/content/piper/notebooks/pretrained_models.json') as f:\n",
" pretrained_models = json.load(f)\n",
@@ -312,12 +325,20 @@
" def download_model(btn):\n",
" model_name = model_dropdown.value\n",
" model_url = pretrained_models[final_language][model_name]\n",
" print(\"Downloading pretrained model...\")\n",
" if model_url.startswith(\"1\"):\n",
" !gdown \"{model_url}\" -O \"/content/pretrained.ckpt\"\n",
" !gdown -q \"{model_url}\" -O \"/content/pretrained.ckpt\"\n",
" elif model_url.startswith(\"https://drive.google.com/file/d/\"):\n",
" !gdown \"{model_url}\" -O \"/content/pretrained.ckpt\" --fuzzy\n",
" !gdown -q \"{model_url}\" -O \"/content/pretrained.ckpt\" --fuzzy\n",
" else:\n",
" !wget \"{model_url}\" -O \"/content/pretrained.ckpt\"\n",
" !wget -q \"{model_url}\" -O \"/content/pretrained.ckpt\"\n",
" model_dropdown.close()\n",
" download_button.close()\n",
" output.clear()\n",
" if os.path.exists(\"/content/pretrained.ckpt\"):\n",
" print(\"Model downloaded!\")\n",
" else:\n",
" raise Exception(\"Couldn't download the pretrained model!\")\n",
" download_button.on_click(download_model)\n",
" display(model_dropdown, download_button)\n",
" else:\n",
@@ -325,24 +346,21 @@
" except FileNotFoundError:\n",
" raise Exception(\"The pretrained_models.json file was not found.\")\n",
"else:\n",
" ft_command = \"\"\n",
" print(\"Warning: this model will be trained from scratch. You need at least 8 hours of data for everything to work decent. Good luck!\")\n",
"#@markdown ### Choose batch size based on this dataset:\n",
"batch_size = 16 #@param {type:\"integer\"}\n",
"#@markdown ---\n",
"#@markdown ### Validation split:\n",
"validation_split = 0.01 #@param {type:\"number\"}\n",
"batch_size = 8 #@param {type:\"integer\"}\n",
"#@markdown ---\n",
"validation_split = 0.01\n",
"#@markdown ### Choose the quality for this model:\n",
"\n",
"#@markdown * x-low - 16Khz audio, 5-7M params\n",
"#@markdown * low - 16Khz audio, 15-20M params\n",
"#@markdown * medium - 22.05Khz audio, 15-20 params\n",
"#@markdown * high - 22.05Khz audio, 28-32M params\n",
"quality = \"x-low\" #@param [\"high\", \"x-low\", \"medium\"]\n",
"quality = \"medium\" #@param [\"high\", \"x-low\", \"medium\"]\n",
"#@markdown ---\n",
"#@markdown ### For how many epochs to save training checkpoints?\n",
"#@markdown The larger your dataset, you should set this saving interval to a smaller value, as epochs can progress longer time.\n",
"checkpoint_epochs = 25 #@param {type:\"integer\"}\n",
"checkpoint_epochs = 3 #@param {type:\"integer\"}\n",
"#@markdown ---\n",
"#@markdown ### Step interval to generate model samples:\n",
"log_every_n_steps = 250 #@param {type:\"integer\"}\n",
@@ -361,40 +379,24 @@
{
"cell_type": "code",
"source": [
"#@markdown # <font color=\"pink\"> **5. Run the TensorBoard extension.** 📈\n",
"#@markdown # <font color=\"pink\"> **5. Train.** 🏋️‍♂️\n",
"#@markdown Run this cell to train your final model! If possible, some audio samples will be saved during training in the output folder.\n",
"\n",
"#@markdown The TensorBoard is used to visualize the results of the model while it is being trained.\n",
"\n",
"#@markdown **Note: due to piper update, the tensorboard is not working at the moment.**\n",
"%load_ext tensorboard\n",
"%tensorboard --logdir {output_dir}"
],
"metadata": {
"cellView": "form",
"id": "MpKDfhAHjHJ3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#@markdown # <font color=\"pink\"> **6. Train.** 🏋️‍♂️\n",
"\n",
"#@markdown <font color=\"orange\">**Note: Remember to empty the trash of your Drive from time to time to avoid a lot of space consumption when saving the models.**\n",
"!python -m piper_train \\\n",
" --dataset-dir \"{output_dir}\" \\\n",
" --accelerator 'gpu' \\\n",
" --devices 1 \\\n",
" --batch-size {batch_size} \\\n",
" --validation-split {validation_split} \\\n",
" --num-test-examples 2 \\\n",
" --quality {quality} \\\n",
" --checkpoint-epochs {checkpoint_epochs} \\\n",
" --log_every_n_steps {log_every_n_steps} \\\n",
" --max_epochs {max_epochs} \\\n",
" {ft_command}\\\n",
" --precision 32"
"get_ipython().system(f'''\n",
"python -m piper_train \\\n",
"--dataset-dir \"{output_dir}\" \\\n",
"--accelerator 'gpu' \\\n",
"--devices 1 \\\n",
"--batch-size {batch_size} \\\n",
"--validation-split {validation_split} \\\n",
"--num-test-examples 2 \\\n",
"--quality {quality} \\\n",
"--checkpoint-epochs {checkpoint_epochs} \\\n",
"--log_every_n_steps {log_every_n_steps} \\\n",
"--max_epochs {max_epochs} \\\n",
"{ft_command}\\\n",
"--precision 32\n",
"''')"
],
"metadata": {
"id": "X4zbSjXg2J3N",