-
-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toucan Questions #195
Comments
Not directly, but that's a very good idea. I'll add it to my list to do. There are a few things that need to be considered to implement this, but it shouldn't be too difficult. For now, you could extend the following function that is applied to any English text to change the orthography of certain words automatically to trick the phonemizer into converting the word in a way that sounds more like what it's supposed to be: IMS-Toucan/Preprocessing/TextFrontend.py Line 1047 in c00f8a4
Again, not directly. You can add ~ to the text to insert a pause at that point, but the model will decide how long the pause should be. It is possible to modify the length of the pause after the model has predicted it, but there's currently no simple way of doing this. I plan to implement a GUI where you can modify an utterance intuitively, but that seems pretty difficult, so I have put it off for a while already.
Same as above, it's possible by overwriting some intermediate results, but there's currently no simple way of doing this. In the future there will be, but it might take a while.
There is one. It doesn't come with a GUI, but the changes you need to do to the code are pretty minimal. And again, I plan to improve this in the future and even add a GUI where you can just throw in some data. You just mentioned a lot of points that I plan for the future, but didn't get to yet, haha. For finetuning, all you need to implement is a function that returns a dictionary of paths to audios that map to their respective transcripts. Each audio should be around 6-10 seconds long ideally, a bit longer or a bit shorter is no problem. You then plug this dictionary into the following function and run the run_training_pipeline.py script with "finetuning_example_simple" as command line argument.
In generative tasks, the loss is usually not really indicative of the "best" checkpoint, so I would always just use the latest one. There is no TensorBoard integration, but there is Weights and Biases integration, which basically does the same thing. You can create a free account there, log in on the command line and then use the --wandb flag when starting a training run.
Actually I think your voice should be quite simple to clone. The recording is good. For Toucan specifically, the system is conditioned on short audios of 6-10 seconds, so the length might be an issue, but voice cloning is not one of Toucans strengths anyway. To get a good clone out of Toucan, you would need to finetune the model to your voice. A few minutes of finetuning data is enough. It's surprising though that you say no other software is good at cloning it either, I would have thought that systems optimized for cloning should be able to do it quite well. |
This is really good information and I am looking into it. A few questions and a problem. Questions
Problem I create data set and I have been following the directions for finetuning but have encountered an error that I've been unable to resolve. The huggingface file cant be found. Here is my process python run_training_pipeline.py --gpu_id 0 --model_save_dir /home/homer/Documents/Programs/IMS-Toucan/WEdwards-Checkpoints --wandb finetuning_example_simple In the file "corpus_prepartion" in the method "prepare_tts_corpus" Here is the trace /home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/_distutils_hack/init.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml The above exception was the direct cause of the following exception: Traceback (most recent call last): |
Re: Questions
I never explored when exactly that threshold occurs, but considering some research of other people and some of the language-learning experiments I've done, I would guess around 30 minutes. The more data you have, the longer you can train for, so having more surely won't hurt. It will probably forget about some of its multilingual capabilities though, if you finetune on a single language, but it will get much better at cloning this voice.
Sure, that's just storage optimization. But for generative tasks, overtraining is a rare phenomenon. Basically, if you have enough data (for TTS I would say 1 or 2 hours is enough), a generative model can almost never overfit. Just delete this line:
And then you should also change the amount of training steps in the finetuning recipe, since the default is kept very low:
Just set it to some very large number and kill the process yourself once you're satisfied.
You instanciate an InferenceInterfaces/ToucanTTSInterface and pass the path to your checkpoint as Line 96 in 07edf9c
I am not sure what you mean by base model. Do you mean the architecture of the model? That's in Modules/ToucanTTS/ToucanTTS.py Re: Problem Ah yes, sorry about that, the huggingface integration is brand new, I didn't have time to properly test it yet and there was indeed a file missing. I uploaded it to the hub, it should be working now. If you encounter any more missing files, please let me know. |
I would consider this a bug. If there are existing checkpoints in the folder when you start, they are counted towards your 5 checkpoints. I didnt clear the existing checkpoints and so only the "best.pt" was being saved. It took a minute of watching the directory before I realized what was happening. |
Questions
def read_texts(sentence, filename, model_id=None, device="cpu", language="eng", speaker_reference=None, duration_scaling_factor=1.0):
Problem I dont know what I'm looking for, as I was accustomed to looking for the total loss and looking for the lowest point in the graph. Here, I dont know what to look for in training the graph or numbers. Would you provide guidance on the learning rate and how it should or could be adjusted. wandb: Run history: |
When running the demo, I get this warning frequently that the audio may be clipped. I can attest that it is being clipped, as I can hear it and see it in Audacity. I don't know what the cause it. I tried utilizing a lower volume reference but that didn't resolve the issue. Is there something that I can do to resolve this? /home/homer/anaconda3/envs/toucan-tts/lib/python3.10/site-packages/pyloudnorm/normalize.py:62: UserWarning: Possible clipped samples in output. |
This is intended behaviour, but yes maybe a warning message telling you that the directory is not empty and the ´resume´ flag is not present would be a good idea. I recommend changing the name of the model directory for every individual run to describe what makes this run unique:
You pass either the absolute or the relative path to any .pt checkpoint to the ´model_id´ parameter of that function. Alternatively, you can just specify the ID of the model, which is what comes after "ToucanTTS_" in the name of the directory you save the checkpoints to. It will then look in the model dir (see below) and find the ´best.pt´ from the ToucanTTS_ directory in the model dir. IMS-Toucan/Utility/storage_config.py Line 1 in 07edf9c
The pretrained model is the one that is automatically loaded from huggingface as the default. Any other models you would need to train yourself. This pretrained one is very universal, so I didn't see a point in providing more than this one.
Early stopping in TTS does not make much sense, the losses don't tell you much. 27k steps is reasonable, you can even do a lot more. Best thing to look out for is the progress plot. If it looks good, then the model is probably good. The only way to make sure is to generate audio with it and listen. If the model collapses, then you will definitely see this in the progress plot. There is unfortunately no good objective metric for TTS, it's a big problem of the field. If you trained the model and it did not change much, then you should probably increase the learning rate. Try 1e-4 here:
You can try to normalize it to a lower loudness in the following snippet. If that doesn't work, then that's unfortunately not fixable, since the vocoder directly produces those values. I also get the warning often, but I don't notice it when listening, only when inspecting the signal. Maybe some post-processing could be helpful.
Soon I will hopefully release another vocoder checkpoint with higher quality that will be used if a GPU is available during inference (on CPU it's too slow, on GPU it doesn't make a difference). Maybe that one will have less issues with clipping. |
I'll try and let you know how it turns out. I have at least 8 hours before this finetuning run is complete. |
Question I feel like I'm doing something wrong. I have two hours of high-quality audio data and I've approximately 10-11 hours of training time (subjective, as I know that completely depends on my equipment) and 42416 steps (objective, an absolute number) so far but this doesn't sound like my voice.
If I know that I can get an extremely high-quality voice based on my training data, I would used a rented GPU, since I know this would be a one-time event and one-time cost but I cant put money into renting a GPU until I know I can get get the desired outcome. Currently, this doesn't sound like me and I don't have a horizon to know when I can get there. I can spend more time at the microphone or find other productions that I've done and combine for a longer session. Problem Oddly, the training run stopped at
I'm not sure what is happening but I am seeing a pattern with training failure. I thought I was going crazy but in the last three training runs, attempting to fine-tune a model against two hours of of my voice data, roughly every 5 -5.5 hours, the process will stop. After reading the documentation a few times, I saved my checkpoints and continued to train from a previous checkpoint, so that I do not need to start from zero (thank goodness). I do not know what this means or how to address it.
EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.40it/s] selecting checkpoints... EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.31it/s] selecting checkpoints... EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.23it/s] selecting checkpoints... EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.30it/s] 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:09<00:00, 1.11it/s] |
2 hours of training data is more than enough, you don't need to record additional data or change it in any way. The error you encountered is related to espeak, it usually occurs when your system goes to sleep mode or you log out. 70k steps should definitely be enough. Do you notice any change in the voice between checkpoints? If not, the model might not be loaded correctly. Can you double check at which location in the code you select the checkpoint? Also, do the loss curves change in wandb? The regression loss should definitely go down. |
I havent heard the difference between the checkpoints. import os
import torch
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
def read_texts(sentence, filename, model_id=None, device="cpu", language="eng", speaker_reference=None, duration_scaling_factor=1.0):
tts = ToucanTTSInterface(device=device, tts_model_path=model_id)
tts.set_language(language)
print("speaker_reference", speaker_reference)
if speaker_reference is not None:
tts.set_utterance_embedding(speaker_reference)
if type(sentence) == str:
sentence = [sentence]
tts.read_to_file(text_list=sentence, file_location=filename, duration_scaling_factor=duration_scaling_factor, prosody_creativity=0.0)
del tts
if __name__ == '__main__':
exec_device = "cuda" if torch.cuda.is_available() else "cpu"
# quick, easy sentences.
sentence=[]
sentence += ["Cybersecurity is the practice of protecting systems, networks, and programs from digital attacks aimed at accessing, changing, or destroying sensitive information."]
sentence += ["A strong cybersecurity framework involves implementing multiple layers of defense across technology, processes, and people to reduce the risk of breaches."]
sentence += ["Continuous monitoring and timely updates are essential in cybersecurity to address evolving threats and vulnerabilities within any organization’s infrastructure."]
checkpoint = "checkpoint_21208"
model_path = f"/home/homer/Documents/Programs/IMS-Toucan/Experiment/{checkpoint}.pt"
output_file = f"output_audio_{checkpoint}.wav"
speaker_reference = None
read_texts(sentence=sentence,
filename=output_file,
model_id=model_path,
device=exec_device,
language="eng", speaker_reference=speaker_reference)
checkpoint = "checkpoint_44968"
model_path = f"/home/homer/Documents/Programs/IMS-Toucan/Experiment/{checkpoint}.pt"
output_file = f"output_audio_{checkpoint}.wav"
read_texts(sentence=sentence,
filename=output_file,
model_id=model_path,
device=exec_device,
language="eng", speaker_reference=speaker_reference)
In terms of wandb, I dont know what I'm looking at. I was seeking to learn what to look for. I understood with TensorBoard. Using wandb is new to me. I dont know what loss to look for here. |
That looks absolutely correct. I would not expect a big change between step 22k and step 45k, because at 22k the model should already be pretty close to done and the changes after that are getting smaller and smaller. I think most interesting would be the changes within the first few hundred steps. But either way, it should definitely sound pretty close to you at this point. Does the regression loss curve go down? Maybe try an even higher learning rate, maybe 1e-3. That shouldn't make a difference, but it might be worth to try. Another option would be to train from scratch without finetuning. 2 hours should be enough for that as well. |
I think I know what you are asking for, after thinking about it a bit more. Knowing that I should be looking for the regression loss was excellent information. I changed the smoothing, so I can see if it was decreasing and it remains flat, so I think it is not benefiting me to continue to train additional steps, if I am understanding this correctly. It really does not sound like me. The only parallel is that it is a male voice and it is reasonable deep in comparison but the prosody sounds like a different person.
I will give that a try.
That will be my next thing I try after changing the learning rate. |
I just recorded 100 sentences of myself and I will see if I can get a model to sound like myself with that. |
This is the regression loss of a run I just started. I am using 10 minutes of data I just recorded and after 8 minutes of training it looks pretty good already. I didn't change any of the defaults in the finetuning_simple recipe. |
I'll start over without the fine-tuning and let you know how it turned out. |
Question Is 'nancy' the way to go to create a model from scratch? Problem The training stopped again at 21208 steps EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.22it/s] selecting checkpoints... EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.24it/s] selecting checkpoints... EPOCH COMPLETE 91%|██████████████████████████████████████████████████████████████████████████████▏ | 10/11 [00:07<00:00, 1.28it/s] mmap() failed: Cannot allocate memory |
I may have hit the jackpot, as I was looking through my issues and the code, I came to this def train_loop(net, # an already initialized ToucanTTS model that should be trained.
datasets,
# a list of datasets to train on. Every dataset within a language should already be a concat dataset of all the datasets
# in that language. So every list entry here should be a (combined) dataset for each language. For the case of a monolingual model, pass a list
# with only one dataset in it. This will trigger the arbiter to call the train loop for simple one language training runs rather than the complex
# LAML based one.
train_samplers, # the sampler(s) for the dataloader(s) (gpu_count or single GPU use different ones)
gpu_count, # amount of GPUs to use
device, # the device where this training should run on.
save_directory, # directory where the models and visualizations should be saved.
steps_per_checkpoint=None, # how many steps should be trained before a checkpoint is created. This is only relevant for the multilingual case,
# the monolingual case will do this once per epoch, regardless of the steps.
path_to_checkpoint=None, # path to a trained checkpoint to either continue training or fine-tune from.
lr=0.0001, # learning rate of the model.
resume=False, # whether to automatically load the most recent checkpoint and resume training from it.
warmup_steps=4000, # how many steps until the learning rate reaches the specified value and starts decreasing again.
use_wandb=False, # whether to use online experiment tracking with weights and biases. Requires prior CLI login.
batch_size=32, # how many samples to put into one batch. Higher batch size is more stable, but requires more VRAM.
eval_lang="eng", # in which language the evaluation sentence is to be plotted.
fine_tune=False, # whether to use the provided checkpoint as basis for fine-tuning.
steps=200000, # how many updates to run until training is completed
use_less_loss=False, # whether to use the loss that enforces a structure in the language embedding space
freeze_lang_embs=False, # whether to use the language embeddings from a checkpoint without modifying them, to maintain compatibility with the zero-shot method. This treats language embeddings from the given checkpoint as constants.
): I didnt see what method called it but I saw that the default "batch_size=32", so I changed it to "batch_size=16". So far, nothing has stopped, so I'll see how this turns out. |
Nancy is the correct pipeline to adapt. Some of the arguments are not set explicitly, so the defaults will be used. I just made some of them explicit instead. The train_loop_arbiter that you found is called by every training recipe. This is where the defaults are, so yes, you changed the batchsize correctly. After warmup_steps * 2, a secondary decoder will start to be trained. Then the memory requirements will increase, so that's a point where it might crash. (default warmup_steps is 4000, so the memory requirements will increase after 8000 steps and then stay constant.) The lines on the plot are the pitch contour and the energy contour that are predicted by the model. They should not look the same. Check your results after ~20k steps, if it does not sound good by then, we might need to use the scorer to clean some of the data. |
I just finished implementing the new GUI for inference where you can control how exactly an utterance is produced. If you pull the newest version of the code and update the requirements, you can run the advanced GUI script. I hope it's pretty intuitive. |
I'm looking forward to looking at this tomorrow. I went for broke, and did the 'nancy' pipleline run on all 200k steps. Now I'm wondering if it is my dataset. Although I have two hours of good audio, maybe its my splitter. I'm attempting to explore anything that I can do on my side to support the ideal outcome. I wrote a splitter using Whisper (which I often use for transcriptions) and initially found errors in the transcription splitting, so I ran it through a second pass (on the split wav files), as a validation procedure. Spot checking some of them after the second pass, I would occasionally see hallucinations of a word (sometimes two) that was not in the audio. This is my plot after 200k steps |
The plot looks pretty decent I would say, the boundaries in the spectrogram line up with the phonemes pretty well. If there are some incorrect labels, this messes the TTS up pretty badly, it is not robust against mistakes in the labels at all. In that case you should definitely run the scorer to find and remove datapoints that appear to have problems from the dataset cache. I'll give instructions below. Also, do the pauses in your speech align with some symbol in the text? The default heuristic is to use some punctuation marks as indicators of pauses (mostly commas coincide with pauses fairly well). If the text has no punctuation marks, the model cannot learn about pauses and gets confused by their existence in the audio without a corresponding indicator in the text. In that case you could run a punctuation restoration model on your texts and re-extract the dataset cache. Using the scorer is pretty simple as well, you don't need to change much in the code. You specify the path to your 200k step model here Line 13 in 20ce6c8
You specify the path to your dataset cache in the line directly below Line 14 in 20ce6c8
And then in the line below that one you tell it to show you the X samples with the highest loss. I would set this number pretty large, maybe 500 or so and then have a look at the distribution. Line 15 in 20ce6c8
There's probably going to be a few samples that are really really bad and then fewer and fewer as you approach the average sample quality. You can use this to estimate how many of the worst samples you want to exclude and then do that with the line below: Line 16 in 20ce6c8
|
Good day. I recreated my dataset with my newly created program. I found the previous version had anomalies when chunking and transcribing the data. I ran the "Nancy" pipeline for 200k steps and the resemblance was not convincing. I did not run the scorer before I launched the most recent training run. Here I noticed what I think is a big problem for me. Problem I could be wrong but that is concerning as I just realized is the cache shows from a week ago was shows the most recent but I just ran 'Nancy' from from a new regular training run. It takes 1.5 to 2 days for each 200k cycle to complete, so it takes a while to determine I have the desired outcome. If my new voice was built off the old cache, not the newly created dataset then I need to know how to clear the cache or force the program to use the new dataset so this doesn't happen again. Question
.../IMS-Toucan/Corpora/Nancy/aligner_train_cache.pt Although I can hear myself in the output, there is a large difference between my voice and the one produces after 200k steps. |
Reporting, so that you know that the following occurs on every long training run. REPORT EPOCH COMPLETE 99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:03<00:00, 1.42it/s] selecting checkpoints... EPOCH COMPLETE 99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:02<00:00, 1.19it/s] selecting checkpoints... EPOCH COMPLETE 99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:02<00:00, 1.28it/s] selecting checkpoints... EPOCH COMPLETE 99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:02<00:00, 1.28it/s] selecting checkpoints... EPOCH COMPLETE 99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:03<00:00, 1.28it/s] 99%|██████████████████████████████████████████████████████████████████████████████████████▉ | 80/81 [01:05<00:00, 1.23it/s] |
If a cache for a dataset already exists, the existing one is always loaded. To create a new cache, the easiest way is to delete the old aligner and tts caches, as you mentioned. Alternatively you could specify a different name for the cache save directory in the recipe, then a new one is created and you can keep the old one if you ever need to reproduce something. Since creating these caches for large datasets can take weeks, the point is to only extract them once and then load the same cache for every training run that includes this dataset. Sharing your dataset splitter would surely be helpful. I intend to build an automated pipeline for people to make their own models with minimal effort in the future, but it's still pretty far down on the list. Adding punctuation to your transcript, if you haven't already, would definitely help with the pauses. You would need to do this before extracting the new caches. Also running the scorer afterwards anyway would probably also be a good idea, because even few imperfections in the data can really mess with the model. The espeak DLL issue seems to be specific to your machine, I can't reproduce it. Maybe try using a different way of making the process permanent, like tmux. |
Understood. I just need to be mindful of the cache in the future. In terms of the splitter, it can be found here: https://github.com/MrEdwards007/LJ-Speech-Dataset-Creator/tree/main I've begun using the scorer as of late. I created a new dataset, cleared the cache and ran the scorer. After 70k steps, it sounds like a robotic version of me. I dont think that going to 200k steps is going to make it better. Stumped. |
Because i see also only Nancy, is there a way to add the Paul voice too in the source, in order it would be easier for user to modify the gui code and maybe recreate conversations by segmenting parts and appending them after? |
@MrEdwards007 if it sounds like a robotic version of you, I think that's a good direction. The robotic part should be easier to fix than it not quite sounding like you. You can explore the learning rate, 1e-4 is a good default, but the size of the dataset can require a different learning rate. There might be other settings that can play a role. How many datapoints are left in your dataset after the cleaning etc? @Nestorovski I am not sure if I understand what you mean. I don't know the Paul voice you are referring to. You can load a different voice in the GUI by clicking the "Load Example of Voice to mimic" button and then selecting a corresponding audio file. In order to concatenate multiple audios spoken by different speakers, you will have to generate each audio individually and then concatenate them yourself. |
I believe there are 1166 remaining datapoints. I captured that information somewhere, thinking it was important. I recall reading in the code:
I will make the change the the learning rate and let you know how it turns out. Oh and thank you for the explanation. Now I understand WHY the cache isn't automatically discarded when you change your dataset. This makes a lot of sense when you are dealing with bigger issues (much larger datasets) than my own. Revisiting a question that I poorly asked. Are there existing models/checkpoints that can be downloaded? The idea that I'm inquiring about is that applying fine-tuning to an existing voice could accelerate voice development for the end user. Speculative Ideas
No, I guess that can't be true. I watched a producer create a voice (on Google Colab), that I know very well over less than 10 minutes of YouTube data and it sounded pretty close. I could hear the missing inflections in the created voice but if I didn't know the voice already, I would say it was reasonably convincing.
|
Apologies, I just went back to view the code and to kick off another training run but was confused. lr=0.0001, # learning rate of the model. The toucantts_train_loop_arbiter.py already has a learning rate of 1-e04 or 0.0001, so there would be no change for me. Do I need to adjust anything because of my batch size?
|
I ran the scorer, removed the top 60 mostly because I don't know what loss value is acceptable or good. problem Let's say that it is a problem with my training data. Other than trying to find a substitute pronunciation ("kow", Kau, Cau,Kaow,Kao), is there something that can be done to avoid retraining (once I find the offending transcription). |
I ran "Nancy" and stopped at ~85k steps. I listened to the audio produced by multiple checkpoints, ~(20k, 30k, 40k, 50k, 70k, 80k). I felt that the audio was getting worse (more distorted) instead of better. I'm determined to make this work but unable to determine how to get to a representative voice. Suggestion: Have the program automatically save checkpoints at per-defined times, such as cross multiples of 10 or 15k. Currently, I'm looking at the checkpoints directory every so often and copying them to an alternate folder, so I can review the quality of the produced audio. Previously, I was grabbing a checkpoint every 50k but thats a long time (on my computer) to know if things are going well. |
That was what we tried originally with the multispeaker model from which we tried to finetune a single-speaker version, but you found the similarity to your voice not convincing enough.
There's multiple ways of adapting a system to a new voice, but not all systems are equally good at this. The Toucan system is built for multilinguality, but cloning voices is not the main focus, so there are other systems (e.g. the one from Coqui) that are much better at this task because that's specifically what they were built for.
Yes stull like low-rank adaptation or just more generally adapters work well for this. There's plenty of techniques for this, but I didn't get to investigate any of them yet for the TTS task, so I can't say for sure which are the best. Generally, they are kind of hard to integrate.
No I don't think so. I just ran a test on a dataset of exactly 1000 samples trained from scratch for 20k steps using a batchsize of 16 and a learningrate of 1e-4 with a smaller version of the model and it worked totally fine. This leads me to believe that there might still a problem with the data, or we just need to decrease the size of the model to make it work on just 1000 samples. Previously I trained from scratch on even less than 1000 samples, but I exchanged a component in the system in the meantime (normalizing flow to conditional flow-matching) and that might affect whether using this amount of datapoints is viable. SO there's two things to try:
IMS-Toucan/Recipes/ToucanTTS_Nancy.py Line 45 in 6355e67
I made a small change earlier to fix something that prevented this from working, so you need to update your code before running this model configuration. I'm hoping that those two steps will finally get you to a voice clone you find sufficient. If not, maybe you should look at other systems. As I mentioned earlier, Toucan was built with other goals in mind, so while it can do a lot, it's probably not the best tool for this job. Now answering your remaining posts:
Yes, you don't look at the value, but at the distribution of values. If the top 10 e.g. look like ´[100, 20, 17, 1.1, 1.1, 1.1, 1.0, ...]´ then you notice that the first 3 are way higher than the rest, so they should probably go.
That must be a typo in your test with a missing "w". If the text preprocessing sees "co." somethere, it will be expanded to "company", as you can see in the following
Typically longer training always means better quality, except if there are too few datapoints or there is something wrong with the data, especially because the learningrate decays and thus the changes get smaller and smaller with time. So at some point it just stays about the same, but it typically never gets worse if there is enough data with correct labels. To see how the model changed over time, you can look at the visualization of the spectrogram in the training logs. |
One more thing about the efficient finetuning topic I wanted to mention: I have plans to include a very simple finetuning mode in the near future where less than 100 parameters are updated in the model. That should make finetuning possible with very very few datapoints. However: If enough datapoints are available, finetuning less than 100 parameters will never reach the quality of finetuning the full 30,000,000 parameters of the model. Also: If enough datapoints are available, even finetuning the full model will probably never reach the quality of training the model from scratch. The big question to which I don't know the answer yet is: How many datapoints are "enough" for each of those three strategies? |
I'm looking forward to the progression and the simplified fine-tuning interface. Background I have increased to 2.5 hours of high quality audio Problem I'm getting very rough sounding audio going from 20-70k steps. I'm pretty stumped. I haven't posted status but I've tried a few things in the last couple of weeks:
I'm trying to understand what could cause the noise since I've used the scorer to eliminate outliers from the dataset. Question
|
Yes, i also think that more than 70k or even 50k are not necessary on your data.
By default it loads the massively multilingual and multispeaker checkpoint from huggingface and uses that as the starting point.
Fine tuning in Toucan means continuing the training of an already trained model on new data, so that it becomes better at modeling the new data while having a lot of prior knowledge about how the acoustics of speech work and what kinds of prosody are appropriate for a text. So it's just more training, but not starting from zero, but starting from a good model already. The problem is, that the model tends to forget things during finetuning, but also clings onto patterns that it doesn't really unlearn, because the new data is kind of similar to data it has seen before, so it's not changed enough to catch all the nuances of the new data. That's why you need much less data for fine tuning than for learning from scratch, but you might get worse results.
They don't need to have silences, but it doesn't hurt either. The only important thing is that it is consistent. If there is a silence, it should ideally always be the same length.
I don't know anyone who did that in the current version. In the institute, we are mostly using the model for voice-privacy applications, so building single speaker models from scratch is more of a benchmark thing.
The synthetic voices are not part of the model. They are a condition signal. The model is always the same for all voices. |
I started again with 2-2.5 hours of audio, using Nancy and a newly created dataset to create a model that sounds like my voice. However, I am unable to determine if I am done training. I trained to 30k steps. The voice sounds better but its still some distance from what I sound like. I don't know if I need to continue training further steps or have I reached the end of what I can hope to achieve. |
I extended the training run from 30k to 60k steps. The learning rate lowered and flattened (which I think is a good thing) but did not hear an improvement in the output. I'll think about it and revisit. |
looking at these loss curves, I think you have achieved the best performance. If you have more data, you can train for more steps, but for 2.5 hours, 30-60k steps seems perfectly fine to me. In my experience, if you have 5 hours of nice data, you can train for 80k steps and get a pretty good result. But since voice-cloning is not the primary focus of this toolkit, it might still not be quite up to your standards. But at that point you get severely diminishing returns from adding more data or training for longer. |
To see convergence, you can zoom into the the second half of the plot for the regression_loss and the stochastic_loss. If the plot looks pretty flat towards the end, then the model is done. If it is still going noticeably down, then it's not done and you should add steps. A little hack: You can just use a number of steps that is way too high, like 500k and check the plot from time to time. Once the line get's pretty flat you just kill the process, regardless of the amount of steps it is currently at. |
I have a few questions that I hope will not much of your time.
My_Voice_mp3.zip
I'm ultimately looking for a clone close enough that I could fool myself. I get pretty close when I use RVC.
I tried the space on HuggingFace but the clone using any of my uploaded samples sounded like another guy.
I have approximately 1.5 hours of good quality audio, similar to the attached, so I can fine-tune if needed.
https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS
The text was updated successfully, but these errors were encountered: