Note, this tool uses the v1p1beta1 speech api endpoint
Google Cloud Speech to Text API offers a number of models and options. This program is designed to help determine the API options that provide the lowest Word Error Rate (WER).
- Reference, ground truth, expected, gold standard : file containing text transcription produced by humans.
- Hypothesis: Transcript text derived from Speech to Text
- STT: Speech to Text
- WER: Word Error Rate
- If you do not have a google account, create one
- Go to Cloud Speech to Text
- Click Try it Free or Get started for Free
- Sign in
- In Step 1 of 2 select your country, agree to terms, and click Continue
- In Step 2 of 2 fill out all the information and supply a payment method (Speech to Text is free unless you hit some really high caps)
See also: Before you begin
- In the left menu select APIs & Services > Library
- In the search bar, type speech
- Click Cloud Speech-to-Text API
- Click Enable
- In the Billing Required dialog click Enable Billing
- After enabling billing above, the API details are displayed. In the right upper corner click Create Credentials
- In the step "Find out what kind of credentials you need" pull down menu, select Cloud Speech-to-Text API
- Select the "No, I'm not using them" radio button in the question "Are you planning to use this API with App Engine or Compute Engine?"
- Click What credentials do I need
- In the step "Create a service account", enter any name in the Service account name field (example: speech)
- In the Role pull down menu select Project > Owner
- The radio button for Key type should default to JSON. Leave it at this default
- Click Continue
- A dialog will appear stating "Service account and key created" and a JSON file will download
- Store the file anyplace you wish. (I like to leave it at root level)
- Follow the applicable steps in Set the environment variable GOOGLE_APPLICATION_CREDENTIALS
See also: Before you begin
Method 1: In JSON path
- cd to the path where the JSON file resides
- Enable the applicable virtual environment
path_to_json user$ source path_to_virtual_env/bin/activate
- Run your program
path_to_json user$ python3 path_to_source/run.py
Method 2: Supply path in command line
path_to_source user$ GOOGLE_APPLICATION_CREDENTIALS=~/your_credentials_file.json python3 run.py
- Create a python environment as described in Python Setup. See also:
- For best results, follow Best Practices
- Single directory containing both audio and reference text files
- Audio files must all be of a single supported encoding type
- Audio files must all have a single sample rate
Name the reference files with the same root name as the audio. Reference files must end in .txt
Example:
audio_file_1.mp3
audio_file_1.txt
audio_file_2.mp3
audio_file_2.txt
A reference file should contain only text that is a human derived transcription of the audio. This text can be called reference text, expected results text, or ground truth text. There is no need to add punctuation, or any kind of markup. Punctuation will be removed and any other markup might produce anomalous WER results. Somewhat standard transcription marking such as [cough] or [laugh] will be removed automatically, any other type of transcription markings that do not fall in brackets will cause miscalculated WER
There are a number of command line options available
python3 run.py -cs "gs://some/bucket" -lr local_result_folder -m video phone call -l es-MX -hz 44200 -p phrase-file.txt -b 10 30 50 -alts es-ES es-DO -n LINEAR16
- -cs, --cloud_store_uri: Cloud storage uri where audio and ground truth expected reference transcriptions are stored
- -lr, --local_results_path: Local path to store generated results
- -n, --encoding: Specifies audio encoding type
- -m, --models: Space separated list of models to evaluate. Example "video phone"
- -l, --langs: Language code. Default is en-US
- -hz, --sample_rate_hertz: Specifies the sample rate. Example: 48000
- -e, --enhanced: Use the enhanced model(s) if specified by -m
- -ch, --multi: Integer indicating the number of channels if more than one
- -p, --phrase_file: Path to file containing comma separated phrases for speech adaptation
- -b, --boosts: Space separated list of integers for boosts to apply for speech adaptation
- -alts, --alternative_languages: Space separated list of language codes for auto language detection. Example en-IN en-US en-GB")
Useful for some use cases where accurate transcription of every single word is not critical. This may include sentiment analysis and labeling applications
- -ex, --expand: Expand all contractions. Example aren't = are not"
- -stop, --remove_stop_words: Remove stop words from all text
- -stem, --stem: Apply stemming
- -nw, --numbers_to_words: Convert numbers to words. Example "order number 123" = "order number one two three"
- -to, --transcriptions_only: If specified the only output will be transcripts, no analysis will be done
- -a, --alts2prime: Use each alternative language as a primary language. Helpful when audio might be mixed for example en-US en-GB en-AU
- -q, --random_queue: Replace default queue.txt with randomly named queue file ####_queue.txt
- -fake, --fake_hyp: Use a fake hypothesis for testing. This will allow skipping sending audio to the API
- -limit, --limit: Limits audio processing to int number. Useful in cases where there is a lot of audio and you want to get some quick results first
If using phrases or classes by specifying the command line argument -p, the file must be a text file. It should contain a comma seperated list of phrases and/or classes.
some_phrase_file.txt:
pizza, burger, fries, pickles
A results.csv file will be written to the results directory you specified on the command line. The file contains the results of transcribing audio for each of the API options you specified
An html file will be written for each transcription in a similar naming pattern. The HTML file highlights differences between the reference texts and the API results.