π£ β’ TalkSee
β’ π is a speech-to-text
application that allows users to transcribe audio files or microphone input using the WhisperAI ASR
models.
The graphical user interface is powered by
Streamlit
.
Provides a GUI to to select a
WhisperAI ASR
model.
Supports two modes of audio input:
microphone
input andfile
upload.
Employs a
WhisperAI ASR
model to transcribe the user audio input into text.
Displays the transcribed text to the user.
The π£ β’ TalkSee
β’ π web app relies on the following external libraries and resources:
-
os: Provides operating system interface.
-
time: Provides time functionality.
-
io: Provides input/output functionality.
-
Streamlit: Provides the user interface framework;
-
audio_recorder_streamlit: Provides the audio input stream;
-
PyTorch: Provides the neural network library for GPU processing;
-
WhisperAI ASR: Provides the speech recognition functionality;
- Clone the repository:
gh repo clone PedroZappa/TalkSee
- Change the current directory to the cloned repository:
cd TalkSee
- Install the
required packages
from the requirements.txt file:
pip install -r requirements.txt
- Create a
.streamlit/secrets.toml
file and add the desired path toMODELS_PATH
variable:
touch .streamlit/secrets.toml | echo 'MODELS_PATH="models"' >> .streamlit/secrets.toml
- Run Streamlit application:
streamlit run main.py
-
Streamlit
-based user interface for easy interaction. -
Select
WhisperAI ASR
model from the list of available models:
Size | Parameters | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|
tiny | 39 M | tiny |
~1 GB | ~32x |
base | 74 M | base |
~1 GB | ~16x |
small | 244 M | small |
~2 GB | ~6x |
medium | 769 M | medium |
~5 GB | ~2x |
large | 1550 M | large |
~10 GB | 1x |
-
Checks if
CUDA
is available forGPU
processing, else runs onCPU
. -
Support for both
microphone input
and audiofile upload
. -
Display of the transcribed text to the user.
- Select
WhisperAI ASR
model from the available options. - Choose an
input mode
(Mic or File).- If using the
Mic
, click the "microphone-icon" button to start recording audio. The recording will stop automatically after 2 seconds of silence. - If using
File
, upload an audio file in.wav
,.mp3
or.m4a
formats.
- If using the
- Click the
Transcribe
button to transcribe the audio file. - Display transcribed text in "
Transcription
" section.
Some possible future enhancements for π£ β’ TalkSee
β’ π include:
-
Support for
mobile devices
. -
Support for
additional speech recognition models
. -
Real-time transcription
of live audio input. -
Integration with cloud storage services for seamless
file upload and storage
. -
Improved
error handling
anduser feedback
. -
Generate an image
with the transcribed text as a prompt.