Skip to content

Latest commit

 

History

History
110 lines (72 loc) · 4.16 KB

README.md

File metadata and controls

110 lines (72 loc) · 4.16 KB

giant microphone

Echo Keeper

Create your own datasets for training speech synthesis models

What

Create voice datasets in the lj-speech format.(1) Export your dataset and use it to train a model with Coqui 🐸 TTS

Quickstart 1: Just record and transcribe audio

Run container

Choose the directory where you would like to retrieve your exported dataset. Bind mount that directory to /app/backend/user/projects with (-v) the volume option as below

docker run -p 8000:8000 -v /path/to/your/export/dir:/app/backend/user/projects echokeeper

Default prompts?

Running the container this way loads these prompts. You could record a 25minute dataset using these. The result will probably not fit your use case. I recommend building your own prompts with this notebook.

Quickstart 2: Bring your own prompts

If you want to use your own prompts...

Run container

  1. Choose the directory where you would like to retrieve your exported dataset. In this example, we'll call that directory 'output'.
  2. Create a directory structure like this:
  /output
  |-/projects
    |--leave this dir empty if you are starting from scratch
    |--if you want to continue recording to a project, place that project folder here
  |-/prompts
    |--prompt.json
  1. Bind mount your output dir to the /app/backend/user docker run -p 8000:8000 -v /path/to/your/export/dir:/app/backend/user echokeeper

Use

Create a new project

Type a name for your project into the New Project input

Select model parameters for Whisper

Whisper is the magic behind this whole app. It transcribes your recorded audio into text Select your language and model size. The bigger models perform better but require more RAM and disk space.

Read a prompt and record audio

The model will be downloaded after you make your first recording. Unfortunately, this means there will be a longish pause between the moment that you finish your first recording and the moment that you see your first transcription. Every following transcription will feel instantaneous.

Manual Setup

run on your machine without a container

  1. Whisper requires the command-line tool ffmpeg and portaudio to be installed on your system, which is available from most package managers:
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
sudo apt install portaudio19-dev

# on Arch Linux
sudo pacman -S ffmpeg
sudo pacman -S portaudio

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
brew install portaudio

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
  1. Install the backend and frontend environmet sh install_playground.sh
  2. Run the backend cd backend && source venv/bin/activate && flask run --port 8000
  3. In a different terminal, run the React frontend cd interface && yarn start

Work in Progress

  • [] Normalize dataset audio during the Export phase
  • [] Track duration of recordings
  • [] In export_metadata_txt(), create the third column of the ljspeech format, per Keith Ito's spec

License

This repository and the code and model weights of Whisper are released under the MIT License.

Footnotes

(1) https://keithito.com/LJ-Speech-Dataset/ Metadata is provided in transcripts.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

ID: this is the name of the corresponding .wav file
Transcription: words spoken by the reader (UTF-8)
Normalized Transcription: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8).

!Note: Normalized Transcriptions aren't complete as of 5/6/23