From b00d22cd26df1c29e9fc6db5a51c7b7603aebd54 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Drago=C8=99?= Date: Wed, 10 Jul 2024 14:51:41 +0200 Subject: [PATCH] Updating results with accurate time values --- NISV/res_labelled.md | 17 ++--------------- NISV/res_unlabelled.md | 17 ++--------------- 2 files changed, 4 insertions(+), 30 deletions(-) diff --git a/NISV/res_labelled.md b/NISV/res_labelled.md index 4bb4681..6aa3aaa 100644 --- a/NISV/res_labelled.md +++ b/NISV/res_labelled.md @@ -44,26 +44,13 @@ And a matrix with the **time** spent in total by each implementation **to load a |---|---|---|---|---| |[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s| |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s| -|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**1m:51s**|**2m:08s**|**1m:50s**|**2m:12s**| -|[WhisperX](https://github.com/m-bain/whisperX/)**\***|11m:17s|15m:54s|11m:29s|15m:05s| +|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:40s|22m:27s|**11m:18s**|21m:56s| +|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**11m:17s**|**15m:54s**|11m:29s|**15m:05s**| \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in [this section](./whisperx.md).
-As well as the **time** spent in total by **faster-whisper** and **WhisperX** to **load, transcribe + save the output to files\***: - -|Load+transcribe+save output|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| -|---|---|---|---|---| -|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**11m:40s**|22m:27s|**11m:18s**|21m:56s| -|[WhisperX](https://github.com/m-bain/whisperX/)\**|15m:45s|**20m:26s**|16m:01s|**19m:36s**| - -\* It has been noticed after benchmarking that, for these 2 implementations, saving the output takes unusually long. - -\** For WhisperX, this includes the entire pipeline (loading -> transcription -> alignment -> speaker diarization -> saving to file). - -
- Finally, a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**): |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| diff --git a/NISV/res_unlabelled.md b/NISV/res_unlabelled.md index 177fb33..8970229 100644 --- a/NISV/res_unlabelled.md +++ b/NISV/res_unlabelled.md @@ -23,23 +23,10 @@ Here's a matrix with the **time** spent in total by each implementation **to loa |---|---|---|---|---| |[OpenAI](https://github.com/openai/whisper)|1h:43m:47s|1h:20m:29s|1h:57m:06s|1h:28m:50s| |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|43m:05s|1h:05m:17s|41m:39s|1h:01m:45s| -|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**4m:14s**|**4m:26s**|**3m:36s**|**5m:07s**| -|[WhisperX](https://github.com/m-bain/whisperX/)*|26m:57s|31m:57s|27m:00s|31m:43s| - -\* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section. - -
- -As well as the **time** spent in total by **faster-whisper** and **WhisperX** to **load, transcribe + save the output to files\***: - -|Load+transcribe+save output|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| -|---|---|---|---|---| |[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|39m:59s|1h:18m:34s|40m:07s|1h:23m:07s| -|**[WhisperX](https://github.com/m-bain/whisperX/)\*\***|**39m:25s**|**44m:01s**|**39m:21s**|**43m:52s**| +|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**26m:57s**|**31m:57s**|**27m:00s**|**31m:43s**| -\* It has been noticed after benchmarking that, for these 2 implementations, saving the output takes unusually long. - -\** For WhisperX, this includes the entire pipeline (loading -> transcription -> alignment -> speaker diarization -> saving to file). +\* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section.