More details about wav2mask ? #124

My-captain · 2023-04-07T12:49:09Z

My-captain
Apr 7, 2023

Hi, @jianfch .
Thanks for your awesome work! I test the repos on Mandarin songs, the alignment time points of tokens generated by the algorithm are more accurate.
But I'm having a little trouble reading the source code and understanding your algorithm, I'm stuck in the wav2mask, especially the principle of audio2loudness.
I am a beginner in the field of speech processing, please forgive my stupidity. I will very appreciate if you could give me some hints.

Answered by jianfch

Apr 8, 2023

audio2loudness converts the waveform into list of amplitude values that corresponds with the 1500 timestamps tokens in the prediction (i.e. each is 0.02s and total is 30s). Essentially tells you how loud each 0.02s chunk of the audio is. wav2mask does loudness equalization on that list of amplitude values and quantize those value (i.e. it zeros the low values). Then it converts this list of values into all mask for the 1500 timestamps tokens as a way tell the decoder which timestamp values to ignore because their relative loudness tell us that those timestamps are silent. On the other hand, vad=True generates this mask using another neural net.

View full answer

jianfch · 2023-04-08T17:00:15Z

jianfch
Apr 8, 2023
Maintainer

audio2loudness converts the waveform into list of amplitude values that corresponds with the 1500 timestamps tokens in the prediction (i.e. each is 0.02s and total is 30s). Essentially tells you how loud each 0.02s chunk of the audio is. wav2mask does loudness equalization on that list of amplitude values and quantize those value (i.e. it zeros the low values). Then it converts this list of values into all mask for the 1500 timestamps tokens as a way tell the decoder which timestamp values to ignore because their relative loudness tell us that those timestamps are silent. On the other hand, vad=True generates this mask using another neural net.

4 replies

My-captain Apr 10, 2023
Author

Thanks for your reply，It`s clear.
👍😊

dgoryeo May 6, 2023

Hi @jianfch , In my normal workflow I pre-process my audios with Audacity compressor, or ffmpeg speechnorm to make quieter parts louder.

Do I understand right (from this discussion) that by using stable-ts I won't need to do any pre-processing as stable-ts already performs that on every input audio?

Thanks

jianfch May 6, 2023
Maintainer

@dgoryeo, if you're trying to make the quieter parts louder because of model not doing as well with the quieter parts, then you should keep your normal workflow. The preprocessing in stable-ts is only for generating for a mask for timestamps, so does not alter the audio that is presented to the model (except only_voice_freq=True).

dgoryeo May 7, 2023

Thanks @jianfch . On a side note: I've gotten hooked on demucs. It has improved my transcriptons a lot. Thanks for that feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More details about wav2mask ? #124

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

More details about wav2mask ? #124

My-captain Apr 7, 2023

Replies: 1 comment · 4 replies

jianfch Apr 8, 2023 Maintainer

My-captain Apr 10, 2023 Author

dgoryeo May 6, 2023

jianfch May 6, 2023 Maintainer

dgoryeo May 7, 2023

My-captain
Apr 7, 2023

Replies: 1 comment 4 replies

jianfch
Apr 8, 2023
Maintainer

My-captain Apr 10, 2023
Author

jianfch May 6, 2023
Maintainer