Skip to content

Commit

Permalink
Whisper benchmark results
Browse files Browse the repository at this point in the history
  • Loading branch information
greenw0lf committed Jul 4, 2024
1 parent c2da985 commit 904d57b
Show file tree
Hide file tree
Showing 7 changed files with 170 additions and 4 deletions.
68 changes: 68 additions & 0 deletions NISV/res_labelled.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
[Back to homepage](../index.md)

<h2>Results on the labelled audio of Broadcast News in the Netherlands</h2>

The N-Best 2008 Dutch Evaluation corpus is a corpus designed to evaluate Dutch/Flemish Speech Recognition systems in 2008. The corpus consists of 4 subsets:
- `bn_nl`: Broadcast News programmes in the Netherlands;
- `cts_nl`: Conversational Telephone Speech in the Netherlands;
- `bn_vl`: Broadcast News programmes in Belgium;
- `cts_vl`: Conversational Telephone Speech in Belgium.

For more details about the corpus, click [here](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=32b10cb0f4cb99ba934f5be5066638a5ad9b19f2).

**The subset used in this benchmark is `bn_nl` (Broadcast News programmes in the Netherlands).**

<br>

For each Whisper implementation, 2 variables have been modified:
- The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
- The compute type: `float16` vs. `float32`

The last parameter refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, `float16`, also known as **half-precision**, uses 16 bits to store a single floating-point number, whereas `float32`, known as **single-precision**, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that `float16` uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that `float16` leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.

<br>

Here is a matrix with **WER** results of the baseline implementation from OpenAI, as well as different, more optimized implementations:

|Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
|---|---|---|---|---|
|[OpenAI](https://github.com/openai/whisper)|11.1%|11.0%|12.9%|13.2%|
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|17.1%|16.9%|16.6%|16.6%|
|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**10.3%**|**10.2%**|**11.8%**|**11.8%**|
|[WhisperX](https://github.com/m-bain/whisperX/)|12.3%|12.4%|13.0%|12.9%|

<br>

And a matrix with the **time** spent in total by each implementation **to load and transcribe** the dataset:

|Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
|---|---|---|---|---|
|[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s|
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s|
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:40s|22m:27s|**11m:18s**|21m:56s|
|[**WhisperX**](https://github.com/m-bain/whisperX/)**\***|**11m:17s**|**15m:54s**|11m:29s|**15m:05s**|

\* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in [this section](./whisperx.md).

<br>

Finally, a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**):

|Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
|---|---|---|---|---|
|[OpenAI](https://github.com/openai/whisper)|10621 MiB / 240 W|**10639 MiB** / 264 W|10927 MiB / 238 W|10941 MiB / 266 W|
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|15073 MiB / **141 W**|12981 MiB / **215 W**|14566 MiB / **123 W**|19385 MiB / **235 W**|
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**8576 MiB** / 188 W|11694 MiB / 235 W|**8567 MiB** / 195 W|**6942 MiB** / 237 W|
|[WhisperX](https://github.com/m-bain/whisperX/)*|9419 MiB / 246 W|13548 MiB / 249 W|9417 MiB / 243 W|13539 MiB / 247 W|

\* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.

## Detailed results per pipeline component for WhisperX
[Click here](./whisperx.md)

## Hardware setup

A high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia Quadro RTX 6000 with 24 GiB VRAM each, using CUDA version 12.4, with an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and 256 GB of RAM available.

The OS installed on the cluster is [RHEL 9.3](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/9.3_release_notes/index).

51 changes: 51 additions & 0 deletions NISV/res_unlabelled.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
[Back to homepage](../index.md)

<h2>Computational performance on the unlabelled audio of Broadcast News in the Netherlands</h2>

More details about the parameters and the dataset can be found [here](./res_labelled.md).

<br>

For each Whisper implementation, 2 variables have been modified:
- The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
- The compute type: `float16` vs. `float32` (check [here](./res_labelled.md) for more details about this parameter)

<br>

**TODO: Add results**

<!-- <br>
Here's a matrix with the **time** spent in total by each implementation **to load and transcribe** the data:
|Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
|---|---|---|---|---|
|[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s|
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s|
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:40s|22m:27s|11m:18s|21m:56s|
|[WhisperX](https://github.com/m-bain/whisperX/)*|11m:17s|15m:54s|11m:29s|15m:05s|
\* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section.
<br>
And also a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**):
|Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
|---|---|---|---|---|
|[OpenAI](https://github.com/openai/whisper)|10621 MiB / 240 W|10639 MiB / 264 W|10927 MiB / 238 W|10941 MiB / 266 W|
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|15073 MiB / 141 W|12981 MiB / 215 W|14566 MiB / 123 W|19385 MiB / 235 W|
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|8576 MiB / 188 W|11694 MiB / 235 W|8567 MiB / 195 W|6942 MiB / 237 W|
|[WhisperX](https://github.com/m-bain/whisperX/)*|9419 MiB / 246 W|13548 MiB / 249 W|9417 MiB / 243 W|13539 MiB / 247 W|
\* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.
## Detailed results per pipeline component for WhisperX
Go [here]().
## Hardware setup
A high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia Quadro RTX 6000 with 24 GiB VRAM each, using CUDA version 12.4, with an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and 256 GB of RAM available.
The OS installed on the cluster is [RHEL 9.3](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/9.3_release_notes/index). -->

38 changes: 38 additions & 0 deletions NISV/whisperx.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[Back to homepage](../index.md)

## Detailed results for WhisperX

[**WhisperX**](https://github.com/m-bain/whisperX/) is an implementation of Whisper with support for batching, word/character level time alignment using wav2vec 2.0, and speaker diarization.

There are 4 components/parts that we can define: **loading** Whisper, the **transcriber** (the part where Whisper runs to generate the text transcription without any timestamps), the **aligner** (generates the word-level timestamps using wav2vec 2.0), and the **diarizer** (identifies the speaker per segment and per word by assigning speaker IDs).

Due to the wav2vec 2.0 based aligner which doesn't support aligning digits and currency symbols, numbers and currencies have been converted to their written form by setting `suppress_numerals=True`. In addition, the [original aligner](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-dutch) used in WhisperX, based on XLSR-53, has been replaced with an [aligner based on XLS-R](https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-dutch). This was done because the original aligner struggled with aligning some characters that are less common in Dutch, but part of the orthography of it (mainly accents on vowels). This might have led to more time spent on aligning compared to the XLSR-53 version, but an ablation study is planned for future work to confirm this hypothesis.

<br>

Two variables have been experimented with:
- The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
- The compute type: `float16` vs. `float32` (check [here](./res_labelled.md) for more details about this parameter)
- The batch size: `64` for `float16` and `16` for `float32`

<br>

Here's a matrix with the **time** spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page:

|Configuration\Component|Loading|Transcriber|Aligner|Diarizer|
|---|---|---|---|---|
|large-v2 with `float16`|7.73s|4m:17s|6m:53s|3m:58s|
|large-v2 with `float32`|10.51s|8m:07s|7m:36s|4m:01s|
|large-v3 with `float16`|3.21s|4m:14s|7m:11s|4m:01s|
|large-v3 with `float32`|6.12s|8m:00s|6m:59s|4m:00s|

<br>

And also a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each configuration (**on average**):

|Max. memory / Max. power|Transcriber|Aligner|Diarizer|
|---|---|---|---|
|large-v2 with `float16`|9419 MiB / 246 W|11916 MiB / 227 W|13578 MiB / 229 W|
|large-v2 with `float32`|13548 MiB / 249 W|14749 MiB / 234 W|16480 MiB / 234 W|
|large-v3 with `float16`|9417 MiB / 243 W|11918 MiB / 235 W|13605 MiB / 231 W|
|large-v3 with `float32`|13539 MiB / 247 W|14715 MiB / 232 W|16411 MiB / 228 W|
2 changes: 1 addition & 1 deletion UT/Jasmin/jasmin.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[Back to homepage](../../index.md)

<h2>Kaldi_NL vs. Whisper - Jasmin-CGN</h2>
<h2>Jasmin-CGN</h2>

- [Results](./jasmin_res.md)
- [Setup](./jasmin_setup.md)
Expand Down
2 changes: 1 addition & 1 deletion UT/N-Best/nbest_res.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[Back to homepage](../../index.md)

<h2>Kaldi_NL vs. Whisper - N-Best 2008 Dutch</h2>
<h2>N-Best 2008 Dutch</h2>

The N-Best 2008 Dutch Evaluation corpus is a corpus designed to evaluate Dutch/Flemish Speech Recognition systems in 2008. The corpus consists of 4 subsets:
- `bn_nl`: Broadcast News programmes in the Netherlands;
Expand Down
2 changes: 1 addition & 1 deletion UT/hardware.md → UT/environment.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[Back to homepage](../index.md)

# Hardware setup
# Environment setup

For **Kaldi_NL**, evaluation was run on a local machine instead of a cluster. The local machine used is a Lenovo ThinkPad P15v Gen 3 with AMD Ryzen 7 PRO 6850H CPU and 32 GB of RAM. The reasons are that it is trickier to set up a Docker container on the cluster used for Whisper (see below) since I do not have admin rights and that Kaldi_NL was meant to run on modern local machines rather than requiring to have a powerful GPU/CPU.

Expand Down
11 changes: 10 additions & 1 deletion index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Welcome to the benchmark page where researchers and developers report performanc
- [Results for N-Best 2008 Dutch Evaluation corpus](./UT/N-Best/nbest_res.md)
- [Results for Jasmin-CGN corpus](./UT/Jasmin/jasmin.md)
- [Results for Common Voice](./UT/CommonVoice/cv.md)
- [Hardware setup & model configurations](./UT/hardware.md)
- [Environment setup](./UT/environment.md)
- [Why do the results differ between whisper-timestamped and faster-whisper?](./UT/analysis.md)

The results in **bold** indicate the best performance for the specific subset(s) between all models.
Expand All @@ -26,6 +26,15 @@ These results were achieved during the PDI-SSH **O**ral **H**istory - **S**torie

These results were achieved during the PDI-SSH **Ho**mo **Med**icinalis ([HoMed](https://homed.ruhosting.nl/)) project (2021-2024).

<h2>NISV's Whisper evaluation</h2>

*NISV = Netherlands Institute for Sound & Vision*

- [Results on labelled data](./NISV/res_labelled.md)
- [Computational performance for unlabelled data](./NISV/res_unlabelled.md)

The results in **bold** indicate the best performance for the specific subset(s) between all models. The lower, the better.

## Contributions
Feel free to click the link at the top that leads you to the GitHub repository of this website. You may add changes if you want by forking the repository, making changes on your fork, then opening a pull request on the source repository.

Expand Down

0 comments on commit 904d57b

Please sign in to comment.