-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
255 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
[Back to homepage](../../index.md) | ||
|
||
<h2>N-Best 2008 Dutch</h2> | ||
|
||
The N-Best 2008 Dutch Evaluation corpus is a corpus designed to evaluate Dutch/Flemish Speech Recognition systems in 2008. The corpus consists of 4 subsets: | ||
- `bn_nl`: Broadcast News programmes in the Netherlands; | ||
- `cts_nl`: Conversational Telephone Speech in the Netherlands; | ||
- `bn_vl`: Broadcast News programmes in Belgium; | ||
- `cts_vl`: Conversational Telephone Speech in Belgium. | ||
|
||
For more details about the corpus, click [here](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=32b10cb0f4cb99ba934f5be5066638a5ad9b19f2). | ||
|
||
**The subset used in this benchmark is `bn_nl` (Broadcast News programmes in the Netherlands).** | ||
|
||
- [Results on labelled data](./res_labelled.md) | ||
- [Results on unlabelled data](./res_unlabelled.md) | ||
|
||
## Detailed results per pipeline component for WhisperX | ||
[Click here](./whisperx.md) | ||
|
||
## Hardware setup | ||
|
||
A high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia Quadro RTX 6000 with 24 GiB VRAM each, using CUDA version 12.4, with an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and 256 GB of RAM available. | ||
|
||
The OS installed on the cluster is [RHEL 9.3](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/9.3_release_notes/index). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
[Back to homepage](../index.md) | ||
[Back to homepage](../../index.md) | ||
|
||
## Detailed results for WhisperX | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
[Back to homepage](../../index.md) | ||
|
||
<h2>N-Best 2008 Dutch</h2> | ||
|
||
The N-Best 2008 Dutch Evaluation corpus is a corpus designed to evaluate Dutch/Flemish Speech Recognition systems in 2008. The corpus consists of 4 subsets: | ||
- `bn_nl`: Broadcast News programmes in the Netherlands; | ||
- `cts_nl`: Conversational Telephone Speech in the Netherlands; | ||
- `bn_vl`: Broadcast News programmes in Belgium; | ||
- `cts_vl`: Conversational Telephone Speech in Belgium. | ||
|
||
For more details about the corpus, click [here](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=32b10cb0f4cb99ba934f5be5066638a5ad9b19f2). | ||
|
||
**The subset used in this benchmark is `cts_nl` (Conversational Telephone Speech in the Netherlands).** | ||
|
||
- [Results on labelled data](./res_labelled.md) | ||
- [Results on unlabelled data](./res_unlabelled.md) | ||
|
||
## Detailed results per pipeline component for WhisperX | ||
[Click here](./whisperx.md) | ||
|
||
## Hardware setup | ||
|
||
A high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia Quadro RTX 6000 with 24 GiB VRAM each, using CUDA version 12.4, with an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and 256 GB of RAM available. | ||
|
||
The OS installed on the cluster is [RHEL 9.3](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/9.3_release_notes/index). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
[Go back](./intro_cts_nl.md) | ||
|
||
<h2>Results on the labelled audio of Conversational Telephone Speech in the Netherlands</h2> | ||
|
||
This data does not reflect the type of content on which the ASR will be applied to in terms of length of the audio, but it offers some rough estimates on the WER performance of the model, particularly in more difficult speech conditions (conversational speech) and when it comes to the time alignment of the word-level timestamps with the reference files. | ||
|
||
<br> | ||
|
||
For each Whisper implementation, 2 variables have been modified: | ||
- The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation) | ||
- The compute type: `float16` vs. `float32` | ||
- **For Huggingface, WhisperX, faster-whisper with batching:** `batch_size` | ||
- `2` for HF | ||
- `64` for `WhisperX float16`, `16` for `WhisperX float32` | ||
- `64` for faster-whisper with batching | ||
|
||
The compute type refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, `float16`, also known as **half-precision**, uses 16 bits to store a single floating-point number, whereas `float32`, known as **single-precision**, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that `float16` uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that `float16` leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model. | ||
|
||
<br> | ||
|
||
Here is a matrix with **WER** results of the baseline implementation from OpenAI, as well as different, more optimized implementations: | ||
|
||
|Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|25.9%|25.7%|27.0%|27.0%| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|39.5%|39.8%|34.2%|34.2%| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|25.0%|**24.9%**|26.0%|26.1%| | ||
|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**24.9%**|**24.9%**|**25.8%**|**25.8%**| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)|31.5%|31.6%|29.8%|29.8%| | ||
|
||
<br> | ||
|
||
And a matrix with the **time** spent in total by each implementation **to load and transcribe** the dataset: | ||
|
||
|Load+transcribe|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|35m:15s|28m:38s|31m:20s|32m:11s| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:00s|14m:10s|17m:09s|20m:59s| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|10m:21s|19m:35s|12m:36s|19m:24s| | ||
|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**6m:50s**|**11m:45s**|**6m:22s**|**11m:32s**| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)\*|31m:08s|33m:22s|28m:28s|33m:41s| | ||
|
||
\* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in [this section](./whisperx.md). | ||
|
||
<!-- <br> | ||
Here's also a matrix with the **Real-Time Factor or RTF** for short (defined as time to process all of the input divided by the duration of the input) for transcribing **2.23 hours of speech** (rounded to 4 decimals): | ||
|RTF (process time/duration of audio)|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|0.2698|0.2443|0.3149|0.2273| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|0.1629|0.1436|0.1746|0.1647| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|0.0871|0.168|0.0827|0.1799| | ||
|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**0.0355**|**0.0663**|**0.033**|**0.0633**| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)\*|0.0864|0.114|0.0823|0.1126| --> | ||
|
||
<br> | ||
|
||
Finally, a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**): | ||
|
||
|Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|10540 MiB / 220 W|10543 MiB / 260 W|10763 MiB / 207 W|10760 MiB / 258 W| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|4670 MiB / **113 W**|8114 MiB / **185 W**|4541 MiB / **118 W**|8020 MiB / **191 W**| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**3959 MiB** / 190 W|**7212 MiB** / 252 W|**3945 MiB** / 185 W|**7182 MiB** / 252 W| | ||
|[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)*|4600 MiB / 180 W|8000 MiB / 259 W|4600 MiB / 179 W|7998 MiB / 256 W| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)*|9401 MiB / 181 W|12714 MiB / 197 W|9402 MiB / 186 W|12721 MiB / 198 W| | ||
|
||
\* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used. | ||
|
||
## Detailed results per pipeline component for WhisperX | ||
[Click here](./whisperx.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
[Go back](./intro_bn_nl.md) | ||
|
||
<h2>Computational performance on the unlabelled audio of Broadcast News in the Netherlands</h2> | ||
|
||
The unlabelled data is considered to be long-form (one audio file that lasts for a longer period) which reflects more closely the type of data found in audiovisual/oral history archives. Thus, even if the WER is not calculated (due to the lack of complete labelling for this subset), the computational performance information will give us a better estimate of each implementation's performance when applied to longer individual audio files | ||
|
||
More details about the parameters and the dataset can be found [here](./res_labelled.md). | ||
|
||
<br> | ||
|
||
For each Whisper implementation, 2 variables have been modified: | ||
- The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation) | ||
- The compute type: `float16` vs. `float32` (check [here](./res_labelled.md) for more details about this parameter) | ||
- For Huggingface (HF) and WhisperX: `batch_size` | ||
- `4` for `HF float16`, `2` for `HF float32` | ||
- `48` for `WhisperX float16`, `16` for `WhisperX float32` | ||
- For `faster-whisper w/ batching`: | ||
- `40` for `float16` | ||
- `16` for `float32` | ||
|
||
<br> | ||
|
||
Here's a matrix with the **time** spent in total by each implementation **to load and transcribe** the data: | ||
|
||
|Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|1h:12m:12s|57m:12s|1h:28m:55s|1h:09m:57s| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|38m:54s|1h:03m:06s|28m:21s|49m:05s| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|27m:53s|56m:13s|31m:50s|1h:02m:38s| | ||
|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**7m:44s**|**14m:46s**|**6m:39s**|**13m:43s**| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)*|21m:29s|22m:14s|20m:28s|21m:36s| | ||
|
||
\* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section. | ||
<!-- | ||
<br> | ||
Here's also a matrix with the **Real-Time Factor or RTF** for short (defined as time to process all of the input divided by the duration of the input) for transcribing **9.02 hours of speech** (rounded to 4 decimals): | ||
|RTF (process time/duration of audio)|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|0.1918|0.1487|0.2164|0.1641| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|0.0796|0.1206|0.077|0.1141| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|0.0718|0.1434|0.0728|0.1559| | ||
|**[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)**|**0.0231**|**0.0436**|**0.02**|**0.0412**| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)\*|0.0459|0.0592|0.0475|0.058| --> | ||
|
||
<br> | ||
|
||
And also a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**): | ||
|
||
|Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`| | ||
|---|---|---|---|---| | ||
|[OpenAI](https://github.com/openai/whisper)|10842 MiB / 276 W|11090 MiB / 290 W|10952 MiB / 278 W|10997 MiB / 286 W| | ||
|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|11631 MiB / **215 W**|19850 MiB / 283 W|7566 MiB / **202 W**|14328 MiB / 283 W| | ||
|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**4761 MiB** / 261 W|**8684 MiB** / 283 W|**5106 MiB** / 266 W|**9285 MiB** / 281 W| | ||
|[faster-whisper w/ batching](https://github.com/SYSTRAN/faster-whisper/pull/856)|15308 MiB / 266 W|18198 MiB / 278 W|15279 MiB / 266 W|18147 MiB / 277 W| | ||
|[WhisperX](https://github.com/m-bain/whisperX/)*|19053 MiB / 257 W|22013 MiB / **276 W**|19096 MiB / 256 W|22042 MiB / **276 W**| | ||
|
||
\* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used. | ||
|
||
## Detailed results per pipeline component for WhisperX | ||
[Click here](./whisperx.md) |
Oops, something went wrong.