Skip to content

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

License

Notifications You must be signed in to change notification settings

AggressiveEnvy/silero-models

 
 

Repository files navigation

Mailing list : test Mailing list : test License: CC BY-NC 4.0

Donations Backers Sponsors

header)

Silero Models

Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks.

Enterprise-grade STT made refreshingly simple (seriously, see benchmarks). We provide quality comparable to Google's STT (and sometimes even better) and we are not Google.

As a bonus:

  • No Kaldi;
  • No compilation;
  • No 20-step instructions;

Also we have published TTS models that satisfy the following criteria:

  • One-line usage;
  • A large library of voices;
  • A fully end-to-end pipeline;
  • Naturally sounding speech;
  • No GPU or training required;
  • Minimalism and lack of dependencies;
  • Faster than real-time on one CPU thread (!!!);
  • Support for 16kHz and 8kHz out of the box;

Speech-To-Text

All of the provided models are listed in the models.yml file. Any meta-data and newer versions will be added there.

Currently we provide the following checkpoints:

PyTorch ONNX Quantization Quality Colab
English (en_v3) ✔️ ✔️ ✔️ link Open In Colab
German (de_v1) ✔️ ✔️ link Open In Colab
Spanish (es_v1) ✔️ ✔️ link Open In Colab
Ukrainian (ua_v3) ✔️ ✔️ ✔️ N/A Open In Colab

Model flavours:

jit jit jit jit_q jit_q onnx onnx onnx
xsmall small large xsmall small xsmall small large
English (en_v3) ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
German (de_v1) ✔️ ✔️
Spanish (es_v1) ✔️ ✔️
Ukrainian (ua_v3 ✔️ ✔️ ✔️

Dependencies

  • All examples:
    • torch, 1.8+ (used to clone the repo in tf and onnx examples), breaking changes for version older than 1.6
    • torchaudio, latest version bound to PyTorch should work
    • omegaconf, latest just should work
  • Additional for ONNX examples:
    • onnx, latest just should work
    • onnxruntime, latest just should work
  • Additional for TensorFlow examples:
    • tensorflow, latest just should work
    • tensorflow_hub, latest just should work

Please see the provided Colab for details for each example below. All examples are maintained to work with the latest major packaged versions of the installed libraries.

PyTorch

Open In Colab

Open on Torch Hub

import torch
import zipfile
import torchaudio
from glob import glob

device = torch.device('cpu')  # gpu also works, but our models are fast enough for CPU
model, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                       model='silero_stt',
                                       language='en', # also available 'de', 'es'
                                       device=device)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils  # see function signature for details

# download a single file, any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav',
                               dst ='speech_orig.wav', progress=True)
test_files = glob('speech_orig.wav')
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]),
                            device=device)

output = model(input)
for example in output:
    print(decoder(example.cpu()))

ONNX

Open In Colab

You can run our model everywhere, where you can import the ONNX model or run ONNX runtime.

import onnx
import torch
import onnxruntime
from omegaconf import OmegaConf

language = 'en' # also available 'de', 'es'

# load provided utils
_, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_stt', language=language)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils

# see available models
torch.hub.download_url_to_file('https://raw.githubusercontent.com/snakers4/silero-models/master/models.yml', 'models.yml')
models = OmegaConf.load('models.yml')
available_languages = list(models.stt_models.keys())
assert language in available_languages

# load the actual ONNX model
torch.hub.download_url_to_file(models.stt_models.en.latest.onnx, 'model.onnx', progress=True)
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession('model.onnx')

# download a single file, any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = ['speech_orig.wav']
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]))

# actual onnx inference and decoding
onnx_input = input.detach().cpu().numpy()
ort_inputs = {'input': onnx_input}
ort_outs = ort_session.run(None, ort_inputs)
decoded = decoder(torch.Tensor(ort_outs[0])[0])
print(decoded)

TensorFlow

Open In Colab

SavedModel example

import os
import torch
import subprocess
import tensorflow as tf
import tensorflow_hub as tf_hub
from omegaconf import OmegaConf

language = 'en' # also available 'de', 'es'

# load provided utils using torch.hub for brevity
_, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_stt', language=language)
(read_batch, split_into_batches,
 read_audio, prepare_model_input) = utils

# see available models
torch.hub.download_url_to_file('https://raw.githubusercontent.com/snakers4/silero-models/master/models.yml', 'models.yml')
models = OmegaConf.load('models.yml')
available_languages = list(models.stt_models.keys())
assert language in available_languages

# load the actual tf model
torch.hub.download_url_to_file(models.stt_models.en.latest.tf, 'tf_model.tar.gz')
subprocess.run('rm -rf tf_model && mkdir tf_model && tar xzfv tf_model.tar.gz -C tf_model',  shell=True, check=True)
tf_model = tf.saved_model.load('tf_model')

# download a single file, any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = ['speech_orig.wav']
batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]))

# tf inference
res = tf_model.signatures["serving_default"](tf.constant(input.numpy()))['output_0']
print(decoder(torch.Tensor(res.numpy())[0]))

Text-To-Speech

All of the provided models are listed in the models.yml file. Any meta-data and newer versions will be added there.

Currently we provide the following speakers:

Speaker Stress Language SR PyTorch Colab
aidar_8khz yes ru 8000 ✔️ Open In Colab
baya_8khz yes ru 8000 ✔️ Open In Colab
ksenia_8khz yes ru 8000 ✔️ Open In Colab
irina_8khz yes ru 8000 ✔️ Open In Colab
natasha_8khz yes ru 8000 ✔️ Open In Colab
ruslan_8khz yes ru 8000 ✔️ Open In Colab
lj_8khz no en 8000 ✔️ Open In Colab
thorsten_8khz no de 8000 ✔️ Open In Colab
gilles_8khz no fr 8000 ✔️ Open In Colab
tux_8khz no es 8000 ✔️ Open In Colab
aidar_16khz yes ru 16000 ✔️ Open In Colab
baya_16khz yes ru 16000 ✔️ Open In Colab
ksenia_16khz yes ru 16000 ✔️ Open In Colab
irina_16khz yes ru 16000 ✔️ Open In Colab
natasha_16khz yes ru 16000 ✔️ Open In Colab
ruslan_16khz yes ru 16000 ✔️ Open In Colab
lj_16khz no en 16000 ✔️ Open In Colab
thorsten_16khz no de 16000 ✔️ Open In Colab
gilles_16khz no fr 16000 ✔️ Open In Colab
tux_16khz no es 16000 ✔️ Open In Colab

Dependencies

Basic dependencies (see colab):

  • torch, 1.8+ (used to clone the repo in tf and onnx examples), breaking changes for version older than 1.6
  • torchaudio, latest version bound to PyTorch should work (required only because models are hosted together with STT, not required for work)
  • omegaconf, latest (can be removed as well, if you do not load all of the configs)

PyTorch

Open In Colab

Open on Torch Hub

import torch

language = 'ru'
speaker = 'kseniya_16khz'
device = torch.device('cpu')
model, symbols, sample_rate, example_text, apply_tts = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                                                      model='silero_tts',
                                                                      language=language,
                                                                      speaker=speaker)
model = model.to(device)  # gpu or cpu
audio = apply_tts(texts=[example_text],
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)

FAQ

Wiki

Also check out our wiki.

Performance and Quality

Please refer to this wiki sections:

Adding new Languages

Please refer here.

Contact

Get in Touch

Try our models, create an issue, join our chat, email us, read our news.

Commercial Inquiries

Please see our wiki and tiers for relevant information and email us.

Citations

@misc{Silero Models,
  author = {Silero Team},
  title = {Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-models}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Further reading

English

  • STT:

    • Towards an Imagenet Moment For Speech-To-Text - link
    • A Speech-To-Text Practitioners Criticisms of Industry and Academia - link
    • Modern Google-level STT Models Released - link
  • TTS:

    • High-Quality Text-to-Speech Made Accessible, Simple and Fast - link
  • VAD:

    • Modern Portable Voice Activity Detector Released - link

Chinese

  • STT:
    • 迈向语音识别领域的 ImageNet 时刻 - link
    • 语音领域学术界和工业界的七宗罪 - link

Russian

  • STT

    • Мы опубликовали современные STT модели сравнимые по качеству с Google - link
    • Понижаем барьеры на вход в распознавание речи - link
    • Огромный открытый датасет русской речи версия 1.0 - link
    • Насколько Быстрой Можно Сделать Систему STT? - link
    • Наша система Speech-To-Text - link
    • Speech To Text - link
  • TTS:

    • Мы Опубликовали Качественный, Простой, Доступный и Быстрый Синтез Речи - link
  • VAD:

    • Мы опубликовали современный Voice Activity Detector и не только -link

Donations

Please use the "sponsor" button.

About

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 63.8%
  • Python 36.2%