Skip to content

Latest commit

 

History

History
69 lines (49 loc) · 2.66 KB

README.md

File metadata and controls

69 lines (49 loc) · 2.66 KB

Pronunciation correction in vector quantized PPG representation space

Work in progress: Inspired from this paper Zero-Shot Foreign Accent Conversion without a Native Reference

This is the translator module as shown in the above paper.

Installation

  • Install ffmpeg.
  • Install Kaldi
  • Install PyKaldi
  • Install packages using environment.yml file.
  • Download pretrained TDNN-F model, extract it, and set PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh to the pretrained model directory.

Dataset

  • Acoustic Model: LibriSpeech. Download pretrained TDNN-F acoustic model here.
    • You also need to set KALDI_ROOT and PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh accordingly.
  • Vector Quantization: [ARCTIC and L2-ARCTIC, see here for detailed training process.
  • Translator seq2seq (i.e., Seq2seq model): ARCTIC and L2-ARCTIC. Please see here for a merged version. All the pretrained the models are available (To be updated) here

Directory layout (Format your dataset to match below)

datatset_root
├── speaker 1
├── speaker 2 
│   ├── wav          # contains all the wav files from speaker 2
│   └── kaldi        # Kaldi files (auto-generated after running kaldi-scripts
.
.
└── speaker N

Quick Start

See the inference script

Training

  • Use Kaldi to extract BNF for individual speakers (Do it for all speakers)
./kaldi_scripts/extract_features_kaldi.sh /path/to/speaker
  • Preprocessing
python preprocess_bnfs.py path/to/dataset
python make_data.py #Edit the file to specify dataset path
  • Vector Quantize the BNFs see here

  • Setting Training params See conf/

  • Training Model

./train.sh
  • Synthesizer Code and Training see here