Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toucan Questions #195

Open
MrEdwards007 opened this issue Oct 1, 2024 · 41 comments
Open

Toucan Questions #195

MrEdwards007 opened this issue Oct 1, 2024 · 41 comments

Comments

@MrEdwards007
Copy link

I have a few questions that I hope will not much of your time.

  • Is there support for IPA or some other phonetic pronunciation for words that are incorrectly pronounced or that you have a specific pronunciation for?
  • Is there a way to add specific lengths of silence e.g. [[slnc 2000]], where 2000 is 2000 milliseconds
  • How does one add emphasis or emotion?
  • Are there any scripts that can be used for fine tuning a single voice? Hopefully, a simple one with a GUI.
  • Can you utilize TensorBoard to view the training logs and see what is likely the best checkpoint?
  • Can you provide some guidance on what is different about my voice that no cloning software is effective (I've tried many)? The only exception is using post processing using RVC but again, that is post processing.
    My_Voice_mp3.zip

I'm ultimately looking for a clone close enough that I could fool myself. I get pretty close when I use RVC.

I tried the space on HuggingFace but the clone using any of my uploaded samples sounded like another guy.
I have approximately 1.5 hours of good quality audio, similar to the attached, so I can fine-tune if needed.
https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 2, 2024

Is there support for IPA or some other phonetic pronunciation for words that are incorrectly pronounced or that you have a specific pronunciation for?

Not directly, but that's a very good idea. I'll add it to my list to do. There are a few things that need to be considered to implement this, but it shouldn't be too difficult. For now, you could extend the following function that is applied to any English text to change the orthography of certain words automatically to trick the phonemizer into converting the word in a way that sounds more like what it's supposed to be:

def english_text_expansion(text):

Is there a way to add specific lengths of silence e.g. [[slnc 2000]], where 2000 is 2000 milliseconds

Again, not directly. You can add ~ to the text to insert a pause at that point, but the model will decide how long the pause should be. It is possible to modify the length of the pause after the model has predicted it, but there's currently no simple way of doing this. I plan to implement a GUI where you can modify an utterance intuitively, but that seems pretty difficult, so I have put it off for a while already.

How does one add emphasis or emotion?

Same as above, it's possible by overwriting some intermediate results, but there's currently no simple way of doing this. In the future there will be, but it might take a while.

Are there any scripts that can be used for fine tuning a single voice? Hopefully, a simple one with a GUI.

There is one. It doesn't come with a GUI, but the changes you need to do to the code are pretty minimal. And again, I plan to improve this in the future and even add a GUI where you can just throw in some data. You just mentioned a lot of points that I plan for the future, but didn't get to yet, haha.

For finetuning, all you need to implement is a function that returns a dictionary of paths to audios that map to their respective transcripts. Each audio should be around 6-10 seconds long ideally, a bit longer or a bit shorter is no problem. You then plug this dictionary into the following function and run the run_training_pipeline.py script with "finetuning_example_simple" as command line argument.

train_data = prepare_tts_corpus(transcript_dict=build_path_to_transcript_integration_test(),

Can you utilize TensorBoard to view the training logs and see what is likely the best checkpoint?

In generative tasks, the loss is usually not really indicative of the "best" checkpoint, so I would always just use the latest one. There is no TensorBoard integration, but there is Weights and Biases integration, which basically does the same thing. You can create a free account there, log in on the command line and then use the --wandb flag when starting a training run.

Can you provide some guidance on what is different about my voice that no cloning software is effective (I've tried many)?

Actually I think your voice should be quite simple to clone. The recording is good. For Toucan specifically, the system is conditioned on short audios of 6-10 seconds, so the length might be an issue, but voice cloning is not one of Toucans strengths anyway. To get a good clone out of Toucan, you would need to finetune the model to your voice. A few minutes of finetuning data is enough. It's surprising though that you say no other software is good at cloning it either, I would have thought that systems optimized for cloning should be able to do it quite well.

@MrEdwards007
Copy link
Author

This is really good information and I am looking into it. A few questions and a problem.

Questions

  • Roughly, how much audio data is required to get extremely (nearly indistinguishable) from the training data
  • Can the number of checkpoints be more than 5? I really want to keep all of them so that I can see which ones are best for my use case. I dont want to have to train a model again because I overtrained it.
  • Once you've create your fine tuned model, how do you reference it?
  • Where are the base models? I've been looking through directories but havent found models to build upon.

Problem

I create data set and I have been following the directions for finetuning but have encountered an error that I've been unable to resolve. The huggingface file cant be found.

Here is my process

python run_training_pipeline.py --gpu_id 0 --model_save_dir /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints --wandb finetuning_example_simple

In the file "corpus_prepartion" in the method "prepare_tts_corpus"
Line 56
path_to_checkpoint=hf_hub_download(repo_id="Flux9665/ToucanTTS", filename="Aligner.pt"),
The file cant be found.

Here is the trace


/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/_distutils_hack/init.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
Loaded an Aligner dataset with 228 datapoints from Corpora/integration_test.
Traceback (most recent call last):
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
response.raise_for_status()
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/Flux9665/ToucanTTS/resolve/main/Aligner.pt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/homer/Documents/Programs/IMS-Toucan/run_training_pipeline.py", line 114, in
pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
File "/home/homer/Documents/Programs/IMS-Toucan/Recipes/finetuning_example_simple.py", line 40, in run
train_data = prepare_tts_corpus(transcript_dict=build_path_to_transcript(),
File "/home/homer/Documents/Programs/IMS-Toucan/Utility/corpus_preparation.py", line 57, in prepare_tts_corpus
path_to_checkpoint=hf_hub_download(repo_id="Flux9665/ToucanTTS", filename="Aligner.pt"),
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
return f(*args, **kwargs)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1232, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1295, in _hf_hub_download_to_cache_dir
(url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1746, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata
r = _request_wrapper(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper
response = _request_wrapper(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper
hf_raise_for_status(response)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 417, in hf_raise_for_status
raise _format(EntryNotFoundError, message, response) from e
huggingface_hub.errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-66fe1e1b-261d66b223e2d6206c89a64b;5d2d317c-734e-4328-aa03-bc331fa6ef03)

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 3, 2024

Re: Questions

Roughly, how much audio data is required to get extremely (nearly indistinguishable) from the training data

I never explored when exactly that threshold occurs, but considering some research of other people and some of the language-learning experiments I've done, I would guess around 30 minutes. The more data you have, the longer you can train for, so having more surely won't hurt. It will probably forget about some of its multilingual capabilities though, if you finetune on a single language, but it will get much better at cloning this voice.

Can the number of checkpoints be more than 5? I really want to keep all of them so that I can see which ones are best for my use case. I dont want to have to train a model again because I overtrained it.

Sure, that's just storage optimization. But for generative tasks, overtraining is a rare phenomenon. Basically, if you have enough data (for TTS I would say 1 or 2 hours is enough), a generative model can almost never overfit. Just delete this line:

delete_old_checkpoints(save_directory, keep=5)

And then you should also change the amount of training steps in the finetuning recipe, since the default is kept very low:

Just set it to some very large number and kill the process yourself once you're satisfied.

Once you've create your fine tuned model, how do you reference it?

You instanciate an InferenceInterfaces/ToucanTTSInterface and pass the path to your checkpoint as tts_model_path. Then you can use it in the same way as the pretrained model. If you want to use the GUI demo, you would add tts_model_path="path to your checkpoint goes here" to the following initialization of the GUI:

TTSWebUI(gpu_id="cuda" if torch.cuda.is_available() else "cpu")

Where are the base models? I've been looking through directories but havent found models to build upon.

I am not sure what you mean by base model. Do you mean the architecture of the model? That's in Modules/ToucanTTS/ToucanTTS.py

Re: Problem

Ah yes, sorry about that, the huggingface integration is brand new, I didn't have time to properly test it yet and there was indeed a file missing. I uploaded it to the hub, it should be working now. If you encounter any more missing files, please let me know.

@MrEdwards007
Copy link
Author

IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop.py

Line 195 in 07edf9c
delete_old_checkpoints(save_directory, keep=5) #

I would consider this a bug. If there are existing checkpoints in the folder when you start, they are counted towards your 5 checkpoints. I didnt clear the existing checkpoints and so only the "best.pt" was being saved. It took a minute of watching the directory before I realized what was happening.

@MrEdwards007
Copy link
Author

Questions

  • Would you elaborate on how to use the fine-tuned model as the model or the model path is not apparent on how that is passed to "run_text_to_file_reader.py"

def read_texts(sentence, filename, model_id=None, device="cpu", language="eng", speaker_reference=None, duration_scaling_factor=1.0):

  • Is there a specific locations that models are looked for
  • Are there pre-trained voice models? I was looking through the directories to see if there was a 'Paul' or 'Nancy' or others and if so, where do you download them to and, how do yo reference them.

Problem
I created a fine-tuned model with 10 minutes of audio and it appeared to have zero effect on the sound output. So either I incorrectly connected the fine-tuned model, or there wasnt enough audio data. I used 27000 steps (yes, really as I didnt know a good place to stop), which probably caused the model collapse that I've read about. So, I'm going to start again with 2 hours of audio, so that I know that I have enough audio data to work with.

I dont know what I'm looking for, as I was accustomed to looking for the total loss and looking for the lowest point in the graph. Here, I dont know what to look for in training the graph or numbers.

Would you provide guidance on the learning rate and how it should or could be adjusted.

wandb: Run history:
wandb: duration_loss █▇▆▆▄▄▅▃▄▂▃▂▃▂▃▃▂▄▃▂▄▃▂▃▁▃▃▁▃▂▃▃▂▁▄▁▁▂▂▂
wandb: energy_loss ▅▇▇▅▇▆▆▅█▄▁▃▆▃█▆▁▅▇▄█▄▅█▃▆▄▃▅▅█▆▄▄▆▂▄▃▅▅
wandb: learning_rate ▄█████████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
wandb: pitch_loss ▃▆▆▄▆▅▅▄▇▃▁▃▅▃▇▆▂▄▇▃█▃▅▇▃▆▃▃▅▅█▆▃▄▇▂▃▃▄▅
wandb: regression_loss █▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: stochastic_loss ▅▆▆▂▆▅▆▅▆▃▁▃▅▃▇▅▁▄▆▄█▄▅▇▃▆▃▃▅▅▇▅▃▄▆▂▄▂▅▄
wandb:
wandb: Run summary:
wandb: duration_loss 2.99871
wandb: energy_loss 1.96445
wandb: learning_rate 0.0
wandb: pitch_loss 1.89659
wandb: regression_loss 0.27856
wandb: stochastic_loss 2.00286

@MrEdwards007
Copy link
Author

When running the demo, I get this warning frequently that the audio may be clipped. I can attest that it is being clipped, as I can hear it and see it in Audacity. I don't know what the cause it. I tried utilizing a lower volume reference but that didn't resolve the issue.

Is there something that I can do to resolve this?

/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/pyloudnorm/normalize.py:62: UserWarning: Possible clipped samples in output.
warnings.warn("Possible clipped samples in output.")

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 4, 2024

I would consider this a bug. If there are existing checkpoints in the folder when you start, they are counted towards your 5 checkpoints.

This is intended behaviour, but yes maybe a warning message telling you that the directory is not empty and the ´resume´ flag is not present would be a good idea. I recommend changing the name of the model directory for every individual run to describe what makes this run unique:

save_dir = os.path.join(MODELS_DIR, "ToucanTTS_FinetuningExample") # RENAME TO SOMETHING MEANINGFUL FOR YOUR DATA

Would you elaborate on how to use the fine-tuned model as the model or the model path is not apparent on how that is passed to "run_text_to_file_reader.py"

You pass either the absolute or the relative path to any .pt checkpoint to the ´model_id´ parameter of that function. Alternatively, you can just specify the ID of the model, which is what comes after "ToucanTTS_" in the name of the directory you save the checkpoints to. It will then look in the model dir (see below) and find the ´best.pt´ from the ToucanTTS_ directory in the model dir.

MODELS_DIR = "Models/"

Are there pre-trained voice models? I was looking through the directories to see if there was a 'Paul' or 'Nancy' or others and if so, where do you download them to and, how do yo reference them.

The pretrained model is the one that is automatically loaded from huggingface as the default. Any other models you would need to train yourself. This pretrained one is very universal, so I didn't see a point in providing more than this one.

Problem

Early stopping in TTS does not make much sense, the losses don't tell you much. 27k steps is reasonable, you can even do a lot more. Best thing to look out for is the progress plot. If it looks good, then the model is probably good. The only way to make sure is to generate audio with it and listen. If the model collapses, then you will definitely see this in the progress plot. There is unfortunately no good objective metric for TTS, it's a big problem of the field.

If you trained the model and it did not change much, then you should probably increase the learning rate. Try 1e-4 here:

lr=1e-5, # if you have enough data (over ~1000 datapoints) you can increase this up to 1e-4 and it will still be stable, but learn quicker.

Clipped Samples

You can try to normalize it to a lower loudness in the following snippet. If that doesn't work, then that's unfortunately not fixable, since the vocoder directly produces those values. I also get the warning often, but I don't notice it when listening, only when inspecting the signal. Maybe some post-processing could be helpful.

Soon I will hopefully release another vocoder checkpoint with higher quality that will be used if a GPU is available during inference (on CPU it's too slow, on GPU it doesn't make a difference). Maybe that one will have less issues with clipping.

@MrEdwards007
Copy link
Author

I'll try and let you know how it turns out. I have at least 8 hours before this finetuning run is complete.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 5, 2024

Question

I feel like I'm doing something wrong. I have two hours of high-quality audio data and I've approximately 10-11 hours of training time (subjective, as I know that completely depends on my equipment) and 42416 steps (objective, an absolute number) so far but this doesn't sound like my voice.

  • Is this normal to take this many steps for two hours of audio data. I have increased my steps to 70k, I'll continue from my last checkpoint to see if the model starts sounding like me on the next iteration (which will stop before I get to 70k, based on the problem that I am experiencing).
  • Is there something that can be done to accelerate the training?
  • Is there a way to know if I'm doing this correctly.
  • Would doubling my audio file (using it twice, splicing together and then splitting to the required training dataset format) to create 4 hours of audio data help? This would make it so that I don't have to sit at the microphone for X additional hours.

If I know that I can get an extremely high-quality voice based on my training data, I would used a rented GPU, since I know this would be a one-time event and one-time cost but I cant put money into renting a GPU until I know I can get get the desired outcome. Currently, this doesn't sound like me and I don't have a horizon to know when I can get there.

I can spend more time at the microphone or find other productions that I've done and combine for a longer session.

Problem

Oddly, the training run stopped at

  1. checkpoint_21208.pt
  2. checkpoint_42416.pt (exactly double).

I'm not sure what is happening but I am seeing a pattern with training failure. I thought I was going crazy but in the last three training runs, attempting to fine-tune a model against two hours of of my voice data, roughly every 5 -5.5 hours, the process will stop. After reading the documentation a few times, I saved my checkpoints and continued to train from a previous checkpoint, so that I do not need to start from zero (thank goodness). I do not know what this means or how to address it.

if use_wandb:
    wandb.init(
        name=f"{__name__.split('.')[-1]}_{time.strftime('%Y%m%d-%H%M%S')}" if wandb_resume_id is None else None,
        id=wandb_resume_id,  # this is None if not specified in the command line arguments.
        resume="must" if wandb_resume_id is not None else None)

print("Training model")
train_loop(net=model,
           datasets=[train_data],
           device=device,
           save_directory=save_dir,
           batch_size=20,  # YOU MIGHT GET OUT OF MEMORY ISSUES ON SMALL GPUs, IF SO, DECREASE THIS.
           eval_lang="eng",  # THE LANGUAGE YOUR PROGRESS PLOTS WILL BE MADE IN
           warmup_steps=500,
           lr=1e-4,  # if you have enough data (over ~1000 datapoints) you can increase this up to 1e-4 and it will still be stable, but learn quicker.
           # DOWNLOAD THESE INITIALIZATION MODELS FROM THE RELEASE PAGE OF THE GITHUB OR RUN THE DOWNLOADER SCRIPT TO GET THEM AUTOMATICALLY
           path_to_checkpoint=hf_hub_download(repo_id="Flux9665/ToucanTTS", filename="ToucanTTS.pt") if resume_checkpoint is None else resume_checkpoint,
           fine_tune=True if resume_checkpoint is None and not resume else finetune,
           resume=resume,
           steps=70000,
           use_wandb=use_wandb,
           train_samplers=[torch.utils.data.RandomSampler(train_data)],
           gpu_count=1)
if use_wandb:
    wandb.finish()

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.40it/s]
Epoch: 1925
Time elapsed: 321 Minutes
Reconstruction Loss: 0.198
Steps: 42383

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_42383.pt
averaging...
saving model...
...done!
100%|██████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:09<00:00, 1.12it/s]

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.31it/s]
Epoch: 1926
Time elapsed: 321 Minutes
Reconstruction Loss: 0.198
Steps: 42394

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_42394.pt
averaging...
saving model...
...done!
100%|██████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:09<00:00, 1.10it/s]

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.23it/s]
Epoch: 1927
Time elapsed: 321 Minutes
Reconstruction Loss: 0.198
Steps: 42405

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_42405.pt
averaging...
saving model...
...done!
100%|██████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:09<00:00, 1.16it/s]

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.30it/s]
Epoch: 1928
Time elapsed: 322 Minutes
Reconstruction Loss: 0.198
Steps: 42416

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:09<00:00, 1.11it/s]
Traceback (most recent call last):
File "/home/homer/Documents/Programs/IMS-Toucan/run_training_pipeline.py", line 114, in
pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
File "/home/homer/Documents/Programs/IMS-Toucan/Recipes/finetuning_example_simple.py", line 57, in run
train_loop(net=model,
File "/home/homer/Documents/Programs/IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop_arbiter.py", line 55, in train_loop
mono_language_loop(net=net,
File "/home/homer/Documents/Programs/IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop.py", line 217, in train_loop
path_to_most_recent_plot = plot_progress_spec_toucantts(model,
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/homer/Documents/Programs/IMS-Toucan/Utility/utils.py", line 71, in plot_progress_spec_toucantts
tf = ArticulatoryCombinedTextFrontend(language=lang)
File "/home/homer/Documents/Programs/IMS-Toucan/Preprocessing/TextFrontend.py", line 594, in init
self.phonemizer_backend = EspeakBackend(language=self.g2p_lang,
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/espeak.py", line 45, in init
super().init(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/base.py", line 45, in init
self._espeak = EspeakWrapper()
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/wrapper.py", line 60, in init
self._espeak = EspeakAPI(self.library())
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/api.py", line 84, in init
self._library = ctypes.cdll.LoadLibrary(str(espeak_copy))
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/ctypes/init.py", line 452, in LoadLibrary
return self._dlltype(name)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: /tmp/tmpwjn_1le0/libespeak-ng.so.1.1.49: failed to map segment from shared object
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:
wandb: Run history:
wandb: duration_loss ▃▃▅▃▄▅▆▅▄▃▅▂▆▄▆▃▆▆▇▃▆▄▃▁▁▄▁▄█▃▄▄▅▆▅▇▄▂▂▄
wandb: energy_loss ▃▁▆▄▄▇▇▆▅▄▄▃▅▅▇▃▇▅▇▂▆▃▄▁▃▃▁▅█▄▅▅▅▇▅▇▄▂▂▅
wandb: learning_rate █▇▇▆▆▆▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: pitch_loss ▄▁▆▄▄▆▇▆▄▄▄▂▅▅▆▃▆▆▆▃▅▄▃▁▂▄▁▅█▄▅▅▆▇▆▇▄▃▂▅
wandb: regression_loss ▆█▅▅▅▅▄▃▃▃▃▄▃▃▃▂▂▂▂▂▂▂▃▃▂▁▃▂▁▁▂▂▂▁▂▁▁▁▂▂
wandb: stochastic_loss ▃▂▅▄▃▆▆▅▃▄▃▁▅▅▆▃▅▅▇▄▄▃▃▁▁▄▁▅▇▄▄▄▅█▅▆▃▂▂▅
wandb:
wandb: Run summary:
wandb: duration_loss 2.46269
wandb: energy_loss 1.948
wandb: learning_rate 0.0
wandb: pitch_loss 1.91329
wandb: regression_loss 0.19839
wandb: stochastic_loss 2.05463

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 5, 2024

2 hours of training data is more than enough, you don't need to record additional data or change it in any way. The error you encountered is related to espeak, it usually occurs when your system goes to sleep mode or you log out.

70k steps should definitely be enough. Do you notice any change in the voice between checkpoints? If not, the model might not be loaded correctly. Can you double check at which location in the code you select the checkpoint?

Also, do the loss curves change in wandb? The regression loss should definitely go down.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 5, 2024

I havent heard the difference between the checkpoints.
I considered that I may not be loading the model correctly and modified existing code, so I could hear the difference.
The same code, referencing two different checkpoints

import os
import torch
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface


def read_texts(sentence, filename, model_id=None, device="cpu", language="eng", speaker_reference=None, duration_scaling_factor=1.0):
    tts = ToucanTTSInterface(device=device, tts_model_path=model_id)
    tts.set_language(language)
    print("speaker_reference", speaker_reference)
    if speaker_reference is not None:
        tts.set_utterance_embedding(speaker_reference)
    if type(sentence) == str:
        sentence = [sentence]
    tts.read_to_file(text_list=sentence, file_location=filename, duration_scaling_factor=duration_scaling_factor, prosody_creativity=0.0)
    del tts

if __name__ == '__main__':
    exec_device = "cuda" if torch.cuda.is_available() else "cpu"

    # quick, easy sentences.

    sentence=[]
    sentence += ["Cybersecurity is the practice of protecting systems, networks, and programs from digital attacks aimed at accessing, changing, or destroying sensitive information."]
    sentence += ["A strong cybersecurity framework involves implementing multiple layers of defense across technology, processes, and people to reduce the risk of breaches."]
    sentence += ["Continuous monitoring and timely updates are essential in cybersecurity to address evolving threats and vulnerabilities within any organization’s infrastructure."]	    
    
    checkpoint  = "checkpoint_21208"
    model_path  = f"/home/homer/Documents/Programs/IMS-Toucan/Experiment/{checkpoint}.pt"
    output_file = f"output_audio_{checkpoint}.wav"
    speaker_reference = None

    read_texts(sentence=sentence, 
               filename=output_file, 
               model_id=model_path, 
               device=exec_device, 
               language="eng", speaker_reference=speaker_reference)
               
    checkpoint  = "checkpoint_44968"
    model_path  = f"/home/homer/Documents/Programs/IMS-Toucan/Experiment/{checkpoint}.pt"
    output_file = f"output_audio_{checkpoint}.wav"

    read_texts(sentence=sentence, 
               filename=output_file, 
               model_id=model_path, 
               device=exec_device, 
               language="eng", speaker_reference=speaker_reference)
   

In terms of wandb, I dont know what I'm looking at. I was seeking to learn what to look for. I understood with TensorBoard. Using wandb is new to me. I dont know what loss to look for here.

image

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 5, 2024

That looks absolutely correct. I would not expect a big change between step 22k and step 45k, because at 22k the model should already be pretty close to done and the changes after that are getting smaller and smaller. I think most interesting would be the changes within the first few hundred steps. But either way, it should definitely sound pretty close to you at this point.

Does the regression loss curve go down? Maybe try an even higher learning rate, maybe 1e-3. That shouldn't make a difference, but it might be worth to try.

Another option would be to train from scratch without finetuning. 2 hours should be enough for that as well.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 5, 2024

I think I know what you are asking for, after thinking about it a bit more. Knowing that I should be looking for the regression loss was excellent information. I changed the smoothing, so I can see if it was decreasing and it remains flat, so I think it is not benefiting me to continue to train additional steps, if I am understanding this correctly.

It really does not sound like me. The only parallel is that it is a male voice and it is reasonable deep in comparison but the prosody sounds like a different person.

Maybe try an even higher learning rate, maybe 1e-3

I will give that a try.

Another option would be to train from scratch without finetuning.

That will be my next thing I try after changing the learning rate.

image

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 5, 2024

I just recorded 100 sentences of myself and I will see if I can get a model to sound like myself with that.

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 5, 2024

This is the regression loss of a run I just started. I am using 10 minutes of data I just recorded and after 8 minutes of training it looks pretty good already. I didn't change any of the defaults in the finetuning_simple recipe.

https://api.wandb.ai/links/flux9665/uldmaz9k

@MrEdwards007
Copy link
Author

I'll start over without the fine-tuning and let you know how it turned out.
Thank you for ALL your assistance. It is greatly appreciated.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 6, 2024

Question
I unfortunately was not clear on the normal (without fine-tuning) process for developing a model.
I just kicked off "nancy" using the same dataset but but I wasnt seeing a place for training steps, batch size, etc.
It ran out of memory after 738 steps on my 16G of VRam. I dont know for sure that 'nancy' is the right way to go but I hope there is a path forward.

Is 'nancy' the way to go to create a model from scratch?

Problem
Since I planned on letting it run overnight, I tried the fine-tuned process again and not surprisingly, I didn't get the desired result of having a model that sounded like myself.

The training stopped again at 21208 steps

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.22it/s]
Epoch: 1926
Time elapsed: 321 Minutes
Reconstruction Loss: 0.199
Steps: 21186

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_21186.pt
averaging...
saving model...
...done!
100%|██████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:10<00:00, 1.09it/s]

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.24it/s]
Epoch: 1927
Time elapsed: 321 Minutes
Reconstruction Loss: 0.199
Steps: 21197

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_21197.pt
averaging...
saving model...
...done!
100%|██████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:09<00:00, 1.12it/s]

EPOCH COMPLETE

91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.28it/s]
Epoch: 1928
Time elapsed: 321 Minutes
Reconstruction Loss: 0.199
Steps: 21208

mmap() failed: Cannot allocate memory
Failed to create permanent mapping for memfd region with ID = 2344500660
Ignoring received block reference with non-registered memfd ID = 2344500660
91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:08<00:00, 1.16it/s]
Traceback (most recent call last):
File "/home/homer/Documents/Programs/IMS-Toucan/run_training_pipeline.py", line 114, in
pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
File "/home/homer/Documents/Programs/IMS-Toucan/Recipes/finetuning_example_simple.py", line 57, in run
train_loop(net=model,
File "/home/homer/Documents/Programs/IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop_arbiter.py", line 55, in train_loop
mono_language_loop(net=net,
File "/home/homer/Documents/Programs/IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop.py", line 217, in train_loop
path_to_most_recent_plot = plot_progress_spec_toucantts(model,
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/homer/Documents/Programs/IMS-Toucan/Utility/utils.py", line 71, in plot_progress_spec_toucantts
tf = ArticulatoryCombinedTextFrontend(language=lang)
File "/home/homer/Documents/Programs/IMS-Toucan/Preprocessing/TextFrontend.py", line 594, in init
self.phonemizer_backend = EspeakBackend(language=self.g2p_lang,
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/espeak.py", line 45, in init
super().init(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/base.py", line 45, in init
self._espeak = EspeakWrapper()
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/wrapper.py", line 60, in init
self._espeak = EspeakAPI(self.library())
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/api.py", line 84, in init
self._library = ctypes.cdll.LoadLibrary(str(espeak_copy))
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/ctypes/init.py", line 452, in LoadLibrary
return self._dlltype(name)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: /tmp/tmpypnt_wge/libespeak-ng.so.1.1.49: failed to map segment from shared object
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: / 514.951 MB of 514.951 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: duration_loss █▅▅▄▃▄▃▂▃▃▂▂▃▂▂▂▂▂▂▃▆▃▂▂▃▂▁▂▂▂▂▁▂▁▁▁▁▁▁▂
wandb: energy_loss ▆▇▆█▇▅█▃▆▆▄▃▆▄▅▂▅▅▄▇▇█▅▇▇▅▁▄▅▆▅▂▅▁▂▄▄▃▄▄
wandb: learning_rate ▁███████████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁
wandb: pitch_loss ▅▇▅█▆▅█▃▇▆▄▂▆▄▄▂▅▅▅▆▇█▅▇█▅▁▅▆▆▅▂▅▂▃▅▄▃▄▆
wandb: regression_loss █▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: stochastic_loss ▅▆▅▇▅▄▇▃▇▆▄▃▇▃▄▂▅▅▅▇██▅▇▇▆▁▅▆▇▆▂▆▂▂▄▄▄▂▄
wandb:
wandb: Run summary:
wandb: duration_loss 2.4607
wandb: energy_loss 1.98393
wandb: learning_rate 3e-05
wandb: pitch_loss 1.85372
wandb: regression_loss 0.19899
wandb: stochastic_loss 2.05225

@MrEdwards007
Copy link
Author

I may have hit the jackpot, as I was looking through my issues and the code, I came to this

def train_loop(net,  # an already initialized ToucanTTS model that should be trained.
               datasets,
               # a list of datasets to train on. Every dataset within a language should already be a concat dataset of all the datasets
               # in that language. So every list entry here should be a (combined) dataset for each language. For the case of a monolingual model, pass a list
               # with only one dataset in it. This will trigger the arbiter to call the train loop for simple one language training runs rather than the complex
               # LAML based one.
               train_samplers,  # the sampler(s) for the dataloader(s) (gpu_count or single GPU use different ones)
               gpu_count,  # amount of GPUs to use
               device,  # the device where this training should run on.
               save_directory,  # directory where the models and visualizations should be saved.
               steps_per_checkpoint=None,  # how many steps should be trained before a checkpoint is created. This is only relevant for the multilingual case,
               # the monolingual case will do this once per epoch, regardless of the steps.
               path_to_checkpoint=None,  # path to a trained checkpoint to either continue training or fine-tune from.
               lr=0.0001,  # learning rate of the model.
               resume=False,  # whether to automatically load the most recent checkpoint and resume training from it.
               warmup_steps=4000,  # how many steps until the learning rate reaches the specified value and starts decreasing again.
               use_wandb=False,  # whether to use online experiment tracking with weights and biases. Requires prior CLI login.
               batch_size=32,  # how many samples to put into one batch. Higher batch size is more stable, but requires more VRAM.
               eval_lang="eng",  # in which language the evaluation sentence is to be plotted.
               fine_tune=False,  # whether to use the provided checkpoint as basis for fine-tuning.
               steps=200000,  # how many updates to run until training is completed
               use_less_loss=False,  # whether to use the loss that enforces a structure in the language embedding space
               freeze_lang_embs=False,  # whether to use the language embeddings from a checkpoint without modifying them, to maintain compatibility with the zero-shot method. This treats language embeddings from the given checkpoint as constants.
               ):

I didnt see what method called it but I saw that the default "batch_size=32", so I changed it to "batch_size=16".
batch_size=16, # how many samples to put into one batch. Higher batch size is more stable, but requires more VRAM.

So far, nothing has stopped, so I'll see how this turns out.

@MrEdwards007
Copy link
Author

I know this plot is here for a reason but I dont know how to interpret it. My guess is that one plot is the ideal graph and the other is the current state. If that is accurate, which line represents the current state?

image

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 6, 2024

Nancy is the correct pipeline to adapt. Some of the arguments are not set explicitly, so the defaults will be used. I just made some of them explicit instead.

The train_loop_arbiter that you found is called by every training recipe. This is where the defaults are, so yes, you changed the batchsize correctly.

After warmup_steps * 2, a secondary decoder will start to be trained. Then the memory requirements will increase, so that's a point where it might crash. (default warmup_steps is 4000, so the memory requirements will increase after 8000 steps and then stay constant.)

The lines on the plot are the pitch contour and the energy contour that are predicted by the model. They should not look the same.

Check your results after ~20k steps, if it does not sound good by then, we might need to use the scorer to clean some of the data.

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 7, 2024

How does one add emphasis

I just finished implementing the new GUI for inference where you can control how exactly an utterance is produced. If you pull the newest version of the code and update the requirements, you can run the advanced GUI script. I hope it's pretty intuitive.

@MrEdwards007
Copy link
Author

I'm looking forward to looking at this tomorrow.

I went for broke, and did the 'nancy' pipleline run on all 200k steps.
The regression loss continued to go down, though at extremely small amounts. I also continued to look at the progress plot as it continued to change. However, this did not produce a desirable result. I would grab sample checkpoints every so often to see what it sounded like. I recognize my pacing and some of my intonations but it wasn't a desirable outcome.

Now I'm wondering if it is my dataset. Although I have two hours of good audio, maybe its my splitter.

I'm attempting to explore anything that I can do on my side to support the ideal outcome. I wrote a splitter using Whisper (which I often use for transcriptions) and initially found errors in the transcription splitting, so I ran it through a second pass (on the split wav files), as a validation procedure. Spot checking some of them after the second pass, I would occasionally see hallucinations of a word (sometimes two) that was not in the audio.

This is my plot after 200k steps

image

@Flux9665
Copy link
Collaborator

Flux9665 commented Oct 9, 2024

The plot looks pretty decent I would say, the boundaries in the spectrogram line up with the phonemes pretty well.

If there are some incorrect labels, this messes the TTS up pretty badly, it is not robust against mistakes in the labels at all. In that case you should definitely run the scorer to find and remove datapoints that appear to have problems from the dataset cache. I'll give instructions below.

Also, do the pauses in your speech align with some symbol in the text? The default heuristic is to use some punctuation marks as indicators of pauses (mostly commas coincide with pauses fairly well). If the text has no punctuation marks, the model cannot learn about pauses and gets confused by their existence in the audio without a corresponding indicator in the text. In that case you could run a punctuation restoration model on your texts and re-extract the dataset cache.

Using the scorer is pretty simple as well, you don't need to change much in the code. You specify the path to your 200k step model here

tts_scorer = TTSScorer(path_to_model=None, device=exec_device)

You specify the path to your dataset cache in the line directly below

tts_scorer.score(path_to_toucantts_dataset=os.path.join(PREPROCESSING_DIR, "IntegrationTest"), lang_id=lang_id)

And then in the line below that one you tell it to show you the X samples with the highest loss. I would set this number pretty large, maybe 500 or so and then have a look at the distribution.

tts_scorer.show_samples_with_highest_loss(20)

There's probably going to be a few samples that are really really bad and then fewer and fewer as you approach the average sample quality. You can use this to estimate how many of the worst samples you want to exclude and then do that with the line below:

tts_scorer.remove_samples_with_highest_loss(5)

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 13, 2024

Good day. I recreated my dataset with my newly created program. I found the previous version had anomalies when chunking and transcribing the data. I ran the "Nancy" pipeline for 200k steps and the resemblance was not convincing.

I did not run the scorer before I launched the most recent training run. Here I noticed what I think is a big problem for me.

Problem

I could be wrong but that is concerning as I just realized is the cache shows from a week ago was shows the most recent but I just ran 'Nancy' from from a new regular training run. It takes 1.5 to 2 days for each 200k cycle to complete, so it takes a while to determine I have the desired outcome. If my new voice was built off the old cache, not the newly created dataset then I need to know how to clear the cache or force the program to use the new dataset so this doesn't happen again.

Question

  • I would be curious to know if would be of value to share my 'LJDataset Splitter' program
  • Does a new cache get created when you start a new run, as in not going from a previous checkpoint?
  • How to clear the cache? Is it as simple as deleting the following files

.../IMS-Toucan/Corpora/Nancy/aligner_train_cache.pt
.../IMS-Toucan/Corpora/Nancy/files_used.txt
.../IMS-Toucan/Corpora/Nancy/tts_train_cache.pt

Although I can hear myself in the output, there is a large difference between my voice and the one produces after 200k steps.

Voice_Comparison.zip

@MrEdwards007
Copy link
Author

Reporting, so that you know that the following occurs on every long training run.
In relation to the previous report, my machine is neither sleeping nor being logged out.
Hopefully, that helps.

REPORT

EPOCH COMPLETE

99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:03<00:00, 1.42it/s]
Epoch: 1917
Time elapsed: 2090 Minutes
Reconstruction Loss: 0.179
Steps: 155277

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_155277.pt
averaging...
saving model...
...done!
100%|████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:06<00:00, 1.22it/s]

EPOCH COMPLETE

99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:02<00:00, 1.19it/s]
Epoch: 1918
Time elapsed: 2092 Minutes
Reconstruction Loss: 0.179
Steps: 155358

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_155358.pt
averaging...
saving model...
...done!
100%|████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:05<00:00, 1.24it/s]

EPOCH COMPLETE

99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:02<00:00, 1.28it/s]
Epoch: 1919
Time elapsed: 2093 Minutes
Reconstruction Loss: 0.179
Steps: 155439

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_155439.pt
averaging...
saving model...
...done!
100%|████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:05<00:00, 1.24it/s]

EPOCH COMPLETE

99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:02<00:00, 1.28it/s]
Epoch: 1920
Time elapsed: 2094 Minutes
Reconstruction Loss: 0.179
Steps: 155520

selecting checkpoints...
loading model /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints/checkpoint_155520.pt
averaging...
saving model...
...done!
100%|████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:04<00:00, 1.25it/s]

EPOCH COMPLETE

99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:03<00:00, 1.28it/s]
Epoch: 1921
Time elapsed: 2095 Minutes
Reconstruction Loss: 0.179
Steps: 155601

99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:05<00:00, 1.23it/s]
Traceback (most recent call last):
File "/home/homer/Documents/Programs/IMS-Toucan/run_training_pipeline.py", line 114, in
pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
File "/home/homer/Documents/Programs/IMS-Toucan/Recipes/ToucanTTS_Nancy.py", line 62, in run
train_loop(net=model,
File "/home/homer/Documents/Programs/IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop_arbiter.py", line 55, in train_loop
mono_language_loop(net=net,
File "/home/homer/Documents/Programs/IMS-Toucan/Modules/ToucanTTS/toucantts_train_loop.py", line 217, in train_loop
path_to_most_recent_plot = plot_progress_spec_toucantts(model,
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/torch/utils/contextlib.py", line 115, in decorate
context
return func(*args, **kwargs)
File "/home/homer/Documents/Programs/IMS-Toucan/Utility/utils.py", line 71, in plot_progress_spec_toucantts
tf = ArticulatoryCombinedTextFrontend(language=lang)
File "/home/homer/Documents/Programs/IMS-Toucan/Preprocessing/TextFrontend.py", line 594, in init
self.phonemizer_backend = EspeakBackend(language=self.g2p_lang,
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/espeak.py", line 45, in
init
super().init(
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/base.py", line 45, in __
init

self._espeak = EspeakWrapper()
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/wrapper.py", line 60, in
init
self._espeak = EspeakAPI(self.library())
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/phonemizer/backend/espeak/api.py", line 84, in i
nit

self._library = ctypes.cdll.LoadLibrary(str(espeak_copy))
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/ctypes/init.py", line 452, in LoadLibrary
return self._dlltype(name)
File "/home/homer/anaconda3/envs/toucan-tts/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: /tmp/tmp3k_0p863/libespeak-ng.so.1.1.49: cannot map zero-fill pages
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: / 535.675 MB of 535.675 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: duration_loss █▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: energy_loss █▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: learning_rate ▃██████████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁
wandb: pitch_loss █▃▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: regression_loss █▄▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: stochastic_loss ▁▁█▅▅▅▅▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
wandb:
wandb: Run summary:
wandb: duration_loss 0.54201
wandb: energy_loss 0.42444
wandb: learning_rate 1e-05
wandb: pitch_loss 0.39868
wandb: regression_loss 0.17908
wandb: stochastic_loss 0.69141
wandb:

@Flux9665
Copy link
Collaborator

If a cache for a dataset already exists, the existing one is always loaded. To create a new cache, the easiest way is to delete the old aligner and tts caches, as you mentioned. Alternatively you could specify a different name for the cache save directory in the recipe, then a new one is created and you can keep the old one if you ever need to reproduce something. Since creating these caches for large datasets can take weeks, the point is to only extract them once and then load the same cache for every training run that includes this dataset.

Sharing your dataset splitter would surely be helpful. I intend to build an automated pipeline for people to make their own models with minimal effort in the future, but it's still pretty far down on the list.

Adding punctuation to your transcript, if you haven't already, would definitely help with the pauses. You would need to do this before extracting the new caches. Also running the scorer afterwards anyway would probably also be a good idea, because even few imperfections in the data can really mess with the model.

The espeak DLL issue seems to be specific to your machine, I can't reproduce it. Maybe try using a different way of making the process permanent, like tmux.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 14, 2024

Understood. I just need to be mindful of the cache in the future.

In terms of the splitter, it can be found here:

https://github.com/MrEdwards007/LJ-Speech-Dataset-Creator/tree/main

I've begun using the scorer as of late.

I created a new dataset, cleared the cache and ran the scorer. After 70k steps, it sounds like a robotic version of me. I dont think that going to 200k steps is going to make it better. Stumped.

@Nestorovski
Copy link

Because i see also only Nancy, is there a way to add the Paul voice too in the source, in order it would be easier for user to modify the gui code and maybe recreate conversations by segmenting parts and appending them after?

@Flux9665
Copy link
Collaborator

@MrEdwards007 if it sounds like a robotic version of you, I think that's a good direction. The robotic part should be easier to fix than it not quite sounding like you. You can explore the learning rate, 1e-4 is a good default, but the size of the dataset can require a different learning rate. There might be other settings that can play a role. How many datapoints are left in your dataset after the cleaning etc?

@Nestorovski I am not sure if I understand what you mean. I don't know the Paul voice you are referring to. You can load a different voice in the GUI by clicking the "Load Example of Voice to mimic" button and then selecting a corresponding audio file. In order to concatenate multiple audios spoken by different speakers, you will have to generate each audio individually and then concatenate them yourself.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 14, 2024

The robotic part should be easier to fix than it not quite sounding like you. You can explore the learning rate, 1e-4 is a good default

I believe there are 1166 remaining datapoints. I captured that information somewhere, thinking it was important. I recall reading in the code:

       warmup_steps=500,
       **lr=1e-4,  # if you have enough data (over ~1000 datapoints) you can increase this up to 1e-4 and it will still be stable, but learn quicker.**
       # DOWNLOAD THESE INITIALIZATION MODELS FROM THE RELEASE PAGE OF THE GITHUB OR RUN THE DOWNLOADER SCRIPT TO GET THEM AUTOMATICALLY
       path_to_checkpoint=hf_hub_download(repo_id="Flux9665/ToucanTTS", filename="ToucanTTS.pt") if resume_checkpoint is None else resume_checkpoint,

I will make the change the the learning rate and let you know how it turns out.

Oh and thank you for the explanation. Now I understand WHY the cache isn't automatically discarded when you change your dataset. This makes a lot of sense when you are dealing with bigger issues (much larger datasets) than my own.

Revisiting a question that I poorly asked. Are there existing models/checkpoints that can be downloaded? The idea that I'm inquiring about is that applying fine-tuning to an existing voice could accelerate voice development for the end user.

Speculative Ideas

  • One thing that I've questioned regarding voice cloning organizations whether they have a bank of voice models with different characteristics. When they take your voice, do they then compare your voice to the bank, pick one that is closest and then fine-tune on the banked model, to accelerate and get to the desired voice outcome.

No, I guess that can't be true. I watched a producer create a voice (on Google Colab), that I know very well over less than 10 minutes of YouTube data and it sounded pretty close. I could hear the missing inflections in the created voice but if I didn't know the voice already, I would say it was reasonably convincing.

  • I don't know if this is applicable to you but I had been doing to research on low rank adaptation - LoRA . Does it work the same way in terms of fine-tuning a voice model versus a LLM? My suspicion (in general) is that it would enable accents, speaker adaptations, tone and emotion, possibly at an accelerated rate. This is total speculation on my part but the core idea is to reduce the number of parameters that need to be fine-tuned while still maintaining performance (based on the underlying model being adapted) and speed to end product.

@MrEdwards007
Copy link
Author

MrEdwards007 commented Oct 15, 2024

Apologies, I just went back to view the code and to kick off another training run but was confused.
I'm using the "nancy" training pipeline.

lr=0.0001, # learning rate of the model.

The toucantts_train_loop_arbiter.py already has a learning rate of 1-e04 or 0.0001, so there would be no change for me.

Do I need to adjust anything because of my batch size?
I have to use a batch_size=16 because of my 16g of VRAM.

def train_loop(net,  # an already initialized ToucanTTS model that should be trained.
               datasets,
               # a list of datasets to train on. Every dataset within a language should already be a concat dataset of all the datasets
               # in that language. So every list entry here should be a (combined) dataset for each language. For the case of a monolingual model, pass a list
               # with only one dataset in it. This will trigger the arbiter to call the train loop for simple one language training runs rather than the complex
               # LAML based one.
               train_samplers,  # the sampler(s) for the dataloader(s) (gpu_count or single GPU use different ones)
               gpu_count,  # amount of GPUs to use
               device,  # the device where this training should run on.
               save_directory,  # directory where the models and visualizations should be saved.
               steps_per_checkpoint=None,  # how many steps should be trained before a checkpoint is created. This is only relevant for the multilingual case,
               # the monolingual case will do this once per epoch, regardless of the steps.
               path_to_checkpoint=None,  # path to a trained checkpoint to either continue training or fine-tune from.
               lr=0.0001,  # learning rate of the model.
               resume=False,  # whether to automatically load the most recent checkpoint and resume training from it.
               warmup_steps=4000,  # how many steps until the learning rate reaches the specified value and starts decreasing again.
               use_wandb=False,  # whether to use online experiment tracking with weights and biases. Requires prior CLI login.
               batch_size=16,  # how many samples to put into one batch. Higher batch size is more stable, but requires more VRAM.
               eval_lang="eng",  # in which language the evaluation sentence is to be plotted.
               fine_tune=False,  # whether to use the provided checkpoint as basis for fine-tuning.
               steps=200000,  # how many updates to run until training is completed
               use_less_loss=False,  # whether to use the loss that enforces a structure in the language embedding space
               freeze_lang_embs=False,  # whether to use the language embeddings from a checkpoint without modifying them, to maintain compatibility with the zero-shot method. This treats language embeddings from the given checkpoint as constants.
               ):

@MrEdwards007
Copy link
Author

I ran the scorer, removed the top 60 mostly because I don't know what loss value is acceptable or good.
So I am left with 1106 datatapoints. I'm currently at 58000 steps. I think it sounds a little better (a little less robotic but not entirely sure), so I'll let the training continue.

problem
I just ran into the strangest thing. I was listening to my testing script and one of the sentences is "How now, brown cow."
However, instead of it saying "How now, brown cow" I heard "How now, brown company". Since this was in a sentence within a paragraph, I ran the speech to file process again as a completely separate test, using only "How now, brown cow" but again what I received was "How now, brown company." I don't know if this is something to do with my training data but I thought I should let you know.

Let's say that it is a problem with my training data. Other than trying to find a substitute pronunciation ("kow", Kau, Cau,Kaow,Kao), is there something that can be done to avoid retraining (once I find the offending transcription).

@MrEdwards007
Copy link
Author

I ran "Nancy" and stopped at ~85k steps. I listened to the audio produced by multiple checkpoints, ~(20k, 30k, 40k, 50k, 70k, 80k). I felt that the audio was getting worse (more distorted) instead of better. I'm determined to make this work but unable to determine how to get to a representative voice.

Suggestion: Have the program automatically save checkpoints at per-defined times, such as cross multiples of 10 or 15k.

Currently, I'm looking at the checkpoints directory every so often and copying them to an alternate folder, so I can review the quality of the produced audio. Previously, I was grabbing a checkpoint every 50k but thats a long time (on my computer) to know if things are going well.

@Flux9665
Copy link
Collaborator

Revisiting a question that I poorly asked. Are there existing models/checkpoints that can be downloaded? The idea that I'm inquiring about is that applying fine-tuning to an existing voice could accelerate voice development for the end user.

That was what we tried originally with the multispeaker model from which we tried to finetune a single-speaker version, but you found the similarity to your voice not convincing enough.

No, I guess that can't be true. I watched a producer create a voice (on Google Colab), that I know very well over less than 10 minutes of YouTube data and it sounded pretty close. I could hear the missing inflections in the created voice but if I didn't know the voice already, I would say it was reasonably convincing.

There's multiple ways of adapting a system to a new voice, but not all systems are equally good at this. The Toucan system is built for multilinguality, but cloning voices is not the main focus, so there are other systems (e.g. the one from Coqui) that are much better at this task because that's specifically what they were built for.

I don't know if this is applicable to you but I had been doing to research on low rank adaptation - LoRA . Does it work the same way in terms of fine-tuning a voice model versus a LLM? My suspicion (in general) is that it would enable accents, speaker adaptations, tone and emotion, possibly at an accelerated rate. This is total speculation on my part but the core idea is to reduce the number of parameters that need to be fine-tuned while still maintaining performance (based on the underlying model being adapted) and speed to end product.

Yes stull like low-rank adaptation or just more generally adapters work well for this. There's plenty of techniques for this, but I didn't get to investigate any of them yet for the TTS task, so I can't say for sure which are the best. Generally, they are kind of hard to integrate.

Do I need to adjust anything because of my batch size?

No I don't think so. I just ran a test on a dataset of exactly 1000 samples trained from scratch for 20k steps using a batchsize of 16 and a learningrate of 1e-4 with a smaller version of the model and it worked totally fine. This leads me to believe that there might still a problem with the data, or we just need to decrease the size of the model to make it work on just 1000 samples. Previously I trained from scratch on even less than 1000 samples, but I exchanged a component in the system in the meantime (normalizing flow to conditional flow-matching) and that might affect whether using this amount of datapoints is viable. SO there's two things to try:

  1. When you re-extracted the caches, did you just delete the caches, or did you create a new directory? If you just deleted the caches, then we forgot about one important thing: You also had to delete the aligner model within the directory where the cache is located. It was traine don the old aligner-cache, so it might still be faulty. If you did delete everything or made a new directory, then this is not the problem and you can move on to point 2. If you just deleted the caches, but not the aligner, you should delete the aligner and the TTS-cache. The aligner-cache can stay. Then, the next time you run it, a new aligner will be trained on the new aligner-cache and then used to expant the aligner-cache into the TTS-cache, on which we can then train the TTS.

  2. The model might just have too many parameters for it to be viable to train from scratch on 1000 datapoints. To fix this, we can just make the model smaller and take into consideration that it is supposed to be a single-speaker single-language model. For this, just change the following line in the training recipe to model = ToucanTTS(attention_dimension=128, lang_embs=None, utt_embed_dim=None)

model = ToucanTTS()

I made a small change earlier to fix something that prevented this from working, so you need to update your code before running this model configuration.

I'm hoping that those two steps will finally get you to a voice clone you find sufficient. If not, maybe you should look at other systems. As I mentioned earlier, Toucan was built with other goals in mind, so while it can do a lot, it's probably not the best tool for this job.

Now answering your remaining posts:

I ran the scorer, removed the top 60 mostly because I don't know what loss value is acceptable or good.

Yes, you don't look at the value, but at the distribution of values. If the top 10 e.g. look like ´[100, 20, 17, 1.1, 1.1, 1.1, 1.0, ...]´ then you notice that the first 3 are way higher than the rest, so they should probably go.

I just ran into the strangest thing. I was listening to my testing script and one of the sentences is "How now, brown cow."
However, instead of it saying "How now, brown cow" I heard "How now, brown company". Since this was in a sentence within a paragraph, I ran the speech to file process again as a completely separate test, using only "How now, brown cow" but again what I received was "How now, brown company." I don't know if this is something to do with my training data but I thought I should let you know.

That must be a typo in your test with a missing "w". If the text preprocessing sees "co." somethere, it will be expanded to "company", as you can see in the following

https://github.com/DigitalPhonetics/IMS-Toucan/blob/6355e679207917b68486649bfb0065b7f556cacd/Preprocessing/TextFrontend.py#L1054C100-L1054C118

I ran "Nancy" and stopped at ~85k steps. I listened to the audio produced by multiple checkpoints, ~(20k, 30k, 40k, 50k, 70k, 80k). I felt that the audio was getting worse (more distorted) instead of better. I'm determined to make this work but unable to determine how to get to a representative voice.

Suggestion: Have the program automatically save checkpoints at per-defined times, such as cross multiples of 10 or 15k.

Typically longer training always means better quality, except if there are too few datapoints or there is something wrong with the data, especially because the learningrate decays and thus the changes get smaller and smaller with time. So at some point it just stays about the same, but it typically never gets worse if there is enough data with correct labels. To see how the model changed over time, you can look at the visualization of the spectrogram in the training logs.

@Flux9665
Copy link
Collaborator

One more thing about the efficient finetuning topic I wanted to mention: I have plans to include a very simple finetuning mode in the near future where less than 100 parameters are updated in the model. That should make finetuning possible with very very few datapoints.

However: If enough datapoints are available, finetuning less than 100 parameters will never reach the quality of finetuning the full 30,000,000 parameters of the model.

Also: If enough datapoints are available, even finetuning the full model will probably never reach the quality of training the model from scratch.

The big question to which I don't know the answer yet is: How many datapoints are "enough" for each of those three strategies?

@MrEdwards007
Copy link
Author

I'm looking forward to the progression and the simplified fine-tuning interface.
I really like this program but want to create a model using my own voice.

Background

I have increased to 2.5 hours of high quality audio
I'm refining a new dataset generator, which is more controllable than the last one which keep the 95% of segments from 4-11 seconds in length (never higher than 11 seconds).

Problem

I'm getting very rough sounding audio going from 20-70k steps.
I've gone much higher on the number of steps 100-210k steps but not seeing changes in loss or learning rates after a 50k steps. It seems like a waste to continue with higher steps after 70k steps (I think 70k) on my dataset.

I'm pretty stumped. I haven't posted status but I've tried a few things in the last couple of weeks:

  • Deleted the old aligner and tts caches
  • Deleted the old checkpoints
  • Changed the learning rate up and down from 1-e04 or 0.0001
  • Rebuilding using 'nancy' from scratch
  • Downloaded Toucan_TTS.pth and attempted to fine-tune as a starting checkpoint (experiment).

I'm trying to understand what could cause the noise since I've used the scorer to eliminate outliers from the dataset.

Question

  • when fine-tuning, where does it start from, since you can choose the finetuning recipe while not having a checkpoint to start from.
  • What does fine-tuning do in Toucan? I've known fine-tuning to mean adapt or add a different behavior, so maybe I have the wrong expectation.
  • Do the segments of the dataset need to have silence at the beginning and end?
  • Would you point me to someone who has succeeded in building a model from scratch and fine-tuned, English language?
  • Is it possible to choose a synthetic voice, using a specific seed (starting from a voice that is relatively similar to mine) and then fine tune from there?

@Flux9665
Copy link
Collaborator

Flux9665 commented Nov 1, 2024

Yes, i also think that more than 70k or even 50k are not necessary on your data.

when fine-tuning, where does it start from, since you can choose the finetuning recipe while not having a checkpoint to start from.

By default it loads the massively multilingual and multispeaker checkpoint from huggingface and uses that as the starting point.

What does fine-tuning do in Toucan? I've known fine-tuning to mean adapt or add a different behavior, so maybe I have the wrong expectation.

Fine tuning in Toucan means continuing the training of an already trained model on new data, so that it becomes better at modeling the new data while having a lot of prior knowledge about how the acoustics of speech work and what kinds of prosody are appropriate for a text. So it's just more training, but not starting from zero, but starting from a good model already. The problem is, that the model tends to forget things during finetuning, but also clings onto patterns that it doesn't really unlearn, because the new data is kind of similar to data it has seen before, so it's not changed enough to catch all the nuances of the new data. That's why you need much less data for fine tuning than for learning from scratch, but you might get worse results.

Do the segments of the dataset need to have silence at the beginning and end?

They don't need to have silences, but it doesn't hurt either. The only important thing is that it is consistent. If there is a silence, it should ideally always be the same length.

Would you point me to someone who has succeeded in building a model from scratch and fine-tuned, English language?

I don't know anyone who did that in the current version. In the institute, we are mostly using the model for voice-privacy applications, so building single speaker models from scratch is more of a benchmark thing.

Is it possible to choose a synthetic voice, using a specific seed (starting from a voice that is relatively similar to mine) and then fine tune from there?

The synthetic voices are not part of the model. They are a condition signal. The model is always the same for all voices.

@MrEdwards007
Copy link
Author

I started again with 2-2.5 hours of audio, using Nancy and a newly created dataset to create a model that sounds like my voice. However, I am unable to determine if I am done training. I trained to 30k steps. The voice sounds better but its still some distance from what I sound like. I don't know if I need to continue training further steps or have I reached the end of what I can hope to achieve.

image

@MrEdwards007
Copy link
Author

I extended the training run from 30k to 60k steps. The learning rate lowered and flattened (which I think is a good thing) but did not hear an improvement in the output. I'll think about it and revisit.

@Flux9665
Copy link
Collaborator

Flux9665 commented Jan 2, 2025

looking at these loss curves, I think you have achieved the best performance. If you have more data, you can train for more steps, but for 2.5 hours, 30-60k steps seems perfectly fine to me. In my experience, if you have 5 hours of nice data, you can train for 80k steps and get a pretty good result. But since voice-cloning is not the primary focus of this toolkit, it might still not be quite up to your standards. But at that point you get severely diminishing returns from adding more data or training for longer.

@Flux9665
Copy link
Collaborator

Flux9665 commented Jan 2, 2025

To see convergence, you can zoom into the the second half of the plot for the regression_loss and the stochastic_loss. If the plot looks pretty flat towards the end, then the model is done. If it is still going noticeably down, then it's not done and you should add steps.

A little hack: You can just use a number of steps that is way too high, like 500k and check the plot from time to time. Once the line get's pretty flat you just kill the process, regardless of the amount of steps it is currently at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants