Add faster-whisper w/ batching results + RTF info

opensource-spraakherkenning-nl · Jul 15, 2024 · 8e9501b · 8e9501b
1 parent aa54c03
commit 8e9501b
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 4 deletions.
diff --git a/NISV/res_labelled.md b/NISV/res_labelled.md
@@ -19,9 +19,10 @@ This data does not reflect the type of content on which the ASR will be applied
 For each Whisper implementation, 2 variables have been modified:
 - The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
 - The compute type: `float16` vs. `float32`
-- **For Huggingface and WhisperX:** `batch_size`
+- **For Huggingface, WhisperX, faster-whisper with batching:** `batch_size`
     - `2` for HF
     - `64` for `WhisperX float16`, `16` for `WhisperX float32`
+    - `64` for faster-whisper with batching
 
 The compute type refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, `float16`, also known as **half-precision**, uses 16 bits to store a single floating-point number, whereas `float32`, known as **single-precision**, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that `float16` uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that `float16` leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.
 
@@ -33,7 +34,8 @@ Here is a matrix with **WER** results of the baseline implementation from OpenAI
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|11.1%|11.0%|12.9%|13.2%|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|17.1%|16.9%|16.6%|16.6%|
-|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**10.3%**|**10.3%**|**11.8%**|**11.8%**|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**10.3%**|10.3%|**11.8%**|**11.8%**|
+|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**10.3%**|**10.2%**|12.4%|12.4%|
 |[WhisperX](https://github.com/m-bain/whisperX/)|12.3%|12.4%|13.0%|12.9%|
 
 <br>
@@ -45,19 +47,33 @@ And a matrix with the **time** spent in total by each implementation **to load a
 |[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s|
 |[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:39s|22m:29s|11m:04s|24m:04s|
-|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**11m:34s**|**15m:15s**|**11m:01s**|**15m:04s**|
+|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**4m:45s**|**8m:52s**|**4m:25s**|**8m:28s**|
+|[WhisperX](https://github.com/m-bain/whisperX/)\*|11m:34s|15m:15s|11m:01s|15m:04s|
 
 \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in [this section](./whisperx.md).
 
 <br>
 
+Here's also a matrix with the **Real-Time Factor or RTF** for short (defined as time to process all of the input divided by the duration of the input) for transcribing **2.23 hours of speech** (rounded to 4 decimals):
+
+|RTF (process time/duration of audio)|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
+|---|---|---|---|---|
+|[OpenAI](https://github.com/openai/whisper)|0.2698|0.2443|0.3149|0.2273|
+|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|0.1629|0.1436|0.1746|0.1647|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|0.0871|0.168|0.0827|0.1799|
+|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**0.0355**|**0.0663**|**0.033**|**0.0633**|
+|[WhisperX](https://github.com/m-bain/whisperX/)\*|0.0864|0.114|0.0823|0.1126|
+
+<br>
+
 Finally, a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**):
 
 |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|10621 MiB / 240 W|10639 MiB / 264 W|10927 MiB / 238 W|10941 MiB / 266 W|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|15073 MiB / **141 W**|12981 MiB / **215 W**|14566 MiB / **123 W**|19385 MiB / **235 W**|
 |[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**4287 MiB** / 230 W|**7776 MiB** / 263 W|**4292 MiB** / 230 W|**7768 MiB** / 262 W|
+|[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)*|5616 MiB / 243 W|9893 MiB / 264 W|5601 MiB / 242 W|9877 MiB / 264 W|
 |[WhisperX](https://github.com/m-bain/whisperX/)*|9947 MiB / 249 W|13940 MiB / 252 W|9944 MiB / 250 W|14094 MiB / 254 W|
 
 \* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.

diff --git a/NISV/res_unlabelled.md b/NISV/res_unlabelled.md
@@ -17,6 +17,9 @@ For each Whisper implementation, 2 variables have been modified:
         - `44` for `float16 large-v2`
         - `48` for `float16 large-v3`
         - `16` for `float32 large-v2/large-v3`
+    - For `faster-whisper w/ batching`:
+        - `40` for `float16`
+        - `16` for `float32`
 
 <br>
 
@@ -27,19 +30,33 @@ Here's a matrix with the **time** spent in total by each implementation **to loa
 |[OpenAI](https://github.com/openai/whisper)|1h:43m:47s|1h:20m:29s|1h:57m:06s|1h:28m:50s|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|43m:05s|1h:05m:17s|41m:39s|1h:01m:45s|
 |[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|38m:52s|1h:17m:38s|39m:26s|1h:24m:21s|
-|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**24m:52s**|**32m:01s**|**25m:42s**|**31m:24s**|
+|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**12m:31s**|**23m:35s**|**10m:50s**|**22m:17s**|
+|[WhisperX](https://github.com/m-bain/whisperX/)*|24m:52s|32m:01s|25m:42s|31m:24s|
 
 \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section.
 
 <br>
 
+Here's also a matrix with the **Real-Time Factor or RTF** for short (defined as time to process all of the input divided by the duration of the input) for transcribing **9.02 hours of speech** (rounded to 4 decimals):
+
+|RTF (process time/duration of audio)|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
+|---|---|---|---|---|
+|[OpenAI](https://github.com/openai/whisper)|0.1918|0.1487|0.2164|0.1641|
+|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|0.0796|0.1206|0.077|0.1141|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|0.0718|0.1434|0.0728|0.1559|
+|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**0.0231**|**0.0436**|**0.02**|**0.0412**|
+|[WhisperX](https://github.com/m-bain/whisperX/)\*|0.0459|0.0592|0.0475|0.058|
+
+<br>
+
 And also a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**):
 
 |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|10943 MiB / 274 W|10955 MiB / 293 W|11094 MiB / 279 W|11164 MiB / 291 W|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|16629 MiB / 269 W|18563 MiB / 287 W|12106 MiB / **259 W**|15061 MiB / 288 W|
 |[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**4811 MiB** / 269 W|**8519 MiB** / 286 W|**4865 MiB** / 267 W|**9179 MiB** / 282 W|
+|[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)|19025 MiB / 270 W|19873 MiB / 281 W|18919 MiB / 266 W|19845 MiB / 282 W|
 |[WhisperX](https://github.com/m-bain/whisperX/)*|21676 MiB / **268 W**|21657 MiB / **279 W**|22425 MiB / 267 W|21580 MiB / **279 W**|
 
 \* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.