Add XLS-R evaluation results

opensource-spraakherkenning-nl · Mar 8, 2024 · 6a840d8 · 6a840d8
1 parent dc4880f
commit 6a840d8
Show file tree

Hide file tree

Showing 4 changed files with 14 additions and 5 deletions.
diff --git a/UT/Jasmin/jasmin_res.md b/UT/Jasmin/jasmin_res.md
@@ -14,6 +14,7 @@ Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well a
 |faster-whisper v3|28.1%|25.2%|50.9%|62.6%|27.6%|
 |**faster-whisper v2 w/ VAD**|**19.1%**|**11.1%**|**29.5%**|**30.0%**|**12.8%**|
 |faster-whisper v3 w/ VAD|27.5%|22.4%|42.6%|49.4%|25.2%|
+|wav2vec2-xls-r-1b-dutch|22.4%|13.3%|33.8%|36.1%|17.2%|
 
 |Model\Dataset|Jasmin_p_1|Jasmin_p_2|Jasmin_p_3|Jasmin_p_4|Jasmin_p_5|
 |---|---|---|---|---|---|
@@ -26,6 +27,7 @@ Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well a
 |faster-whisper v3|85.8%|68.3%|84.4%|84.5%|51.4%|
 |**faster-whisper v2 w/ VAD**|**28.2%**|**22.9%**|**39.2%**|**51.4%**|**26.8%**|
 |faster-whisper v3 w/ VAD|34.4%|28.6%|48.7%|58.2%|33.6%|
+|wav2vec2-xls-r-1b-dutch|60.2%|62.2%|70.5%|59.1%|47.0%|
 
 **Jasmin_{p,q}_{1,2,3,4,5}** = **p** stands for **comp_p (HMI speech)**, whereas **q** stands for **comp_q (read speech)**. The number that can range from **1-5** represents the corresponding **age group/nativeness** from the corpus (for more details, go back one page).
 
@@ -44,6 +46,7 @@ And its corresponding matrix with the **time** spent in total by each model **to
 |faster-whisper v3|0h:41m:58s|0h:38m:13s|0h:48m:28s|0h:55m:48s|0h:44m:12s|
 |faster-whisper v2 w/ VAD|0h:32m:55s|0h:27m:16s|0h:25m:51s|0h:21m:58s|0h:32m:09s|
 |faster-whisper v3 w/ VAD|0h:40m:33s|0h:31m:45s|0h:37m:36s|0h:37m:11s|0h:38m:00s|
+|wav2vec2-xls-r-1b-dutch|0h:35m:18s|0h:27m:33s|0h:32m:39s|0h:31m:49s|0h:39m:05s|
 
 |Model\Dataset|Jasmin_p_1|Jasmin_p_2|Jasmin_p_3|Jasmin_p_4|Jasmin_p_5|
 |---|---|---|---|---|---|
@@ -56,5 +59,6 @@ And its corresponding matrix with the **time** spent in total by each model **to
 |faster-whisper v3|0h:54m:15s|0h:32m:13s|0h:34m:35s|0h:55m:02s|1h:12m:22s|
 |***faster-whisper v2 w/ VAD***|**0h:09m:59s**|0h:07m:37s|**0h:07m:57s**|**0h:13m:32s**|0h:22m:31s|
 |*faster-whisper v3 w/ VAD*|0h:13m:43s|**0h:07m:17s**|0h:09m:57s|0h:22m:45s|0h:25m:52s|
+|wav2vec2-xls-r-1b-dutch|0h:42m:20s|0h:24m:19s|0h:26m:52s|0h:36m:42s|0h:48m:26s|
 
 <b>*</b> Performance might have been impacted by other processes from other users running on the same GPU since the hardware is available via a cluster system. A rerun using different hardware might be done in the near future.
diff --git a/UT/N-Best/nbest_res.md b/UT/N-Best/nbest_res.md
@@ -23,6 +23,7 @@ Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well a
 |faster-whisper v3|12.5%|25.5%|
 |**faster-whisper v2 w/ VAD**|**10.0%**|**23.9%**|
 |faster-whisper v3 w/ VAD|12.3%|25.1%|
+|wav2vec2-xls-r-1b-dutch|14.8%|33.5%|
 
 <br>
 And here are results for the same models on bn_nl with the foreign speech lines removed from the dataset:
@@ -43,15 +44,16 @@ Here is also a matrix with the **time** spent in total by each model **to evalua
 
 |Model\Dataset|bn_nl|cts_nl|
 |---|---|---|
-|*Kaldi_NL*|**0h:08m:58s**|0h:14m:47s|
+|Kaldi_NL|0h:08m:58s|0h:14m:47s|
 |Whisper v2|1h:11m:59s|0h:53m:55s|
 |Whisper v3|1h:09m:00s|0h:40m:20s|
 |Whisper v2 w/ VAD|0h:52m:03s|0h:40m:09s|
 |Whisper v3 w/ VAD|1h:02m:13s|0h:37m:50s|
 |faster-whisper v2|0h:11m:31s|0h:9m:30s|
 |faster-whisper v3|0h:11m:21s|0h:9m:41s|
 |faster-whisper v2 w/ VAD|0h:12m:13s|0h:9m:36s|
-|*faster-whisper v3 w/ VAD*|0h:12m:25s|**0h:9m:13s**|
+|faster-whisper v3 w/ VAD|0h:12m:25s|0h:9m:13s|
+|**wav2vec2-xls-r-1b-dutch**|**0h:07m:36s**|**0h:07m:52s**|
 
 ### Preprocessing, setup, and postprocessing
 For more details, click [here](./nbest_setup.md).
diff --git a/UT/hardware.md b/UT/hardware.md
@@ -4,7 +4,7 @@
 
 For **Kaldi_NL**, evaluation was run on a local machine instead of a cluster. The local machine used is a Lenovo ThinkPad P15v Gen 3 with AMD Ryzen 7 PRO 6850H CPU and 32 GB of RAM. The reasons are that it is trickier to set up a Docker container on the cluster used for Whisper (see below) since I do not have admin rights and that Kaldi_NL was meant to run on modern local machines rather than requiring to have a powerful GPU/CPU.
 
-For **Whisper**, a high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia A10 with 24 GB VRAM each, using CUDA version 11.6, 256 GB RAM and 56 CPU cores. For more details, check the [wiki](https://jupyter.wiki.utwente.nl/) page of the cluster.
+For **Whisper** and **XLS-R**, a high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia A10 with 24 GB VRAM each, using CUDA version 11.6, 256 GB RAM and 56 CPU cores. For more details, check the [wiki](https://jupyter.wiki.utwente.nl/) page of the cluster.
 
 The implementation used to output word-level timestamps from Whisper is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped). In addition, a more optimized implementation of Whisper has been used, namely [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Both of them use the same parameters as the original implementation from [OpenAI](https://github.com/openai/whisper). The parameters are:
 - `beam_size=5`
@@ -16,10 +16,14 @@ The implementation used to output word-level timestamps from Whisper is [whisper
 
 As for Kaldi_NL, the repository can be found [here](https://github.com/opensource-spraakherkenning-nl/Kaldi_NL). The model used for Kaldi_NL is `radboud_OH` and its corresponding script is `decode_OH.sh`.
 
+For XLS-R, a version of it fine-tuned on Dutch by Jonatas Grosman has been used which can be found [here](https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-dutch).
+
 <br>
 
 It has been observed that `whisper-timestamped` (simplified to "Whisper" in the results matrix) requires on average 7 GB of VRAM per recording, with an initial memory requirement of 9.4 GB. A test has been done on a less-performant GPU, RTX 4070 with 12 GB VRAM, which proved that a GPU with 12 GB VRAM could also be used for the large versions of Whisper.
 
 The better-optimized implementation, `faster-whisper`, however, uses on average 3.2 - 3.7 GB of VRAM per recording. Therefore, this implementation can be used on GPUs with smaller video memory capacity.
 
+XLS-R uses on average 4.3 GB of VRAM per recording, sometimes going as low as 4.0 GB and as high as 8.3 GB. Initial memory requirement varies from 8 GB to 13.7 GB, though the higher value could have been influenced by other previous processes that were executed on the GPU.
+
 For Kaldi_NL, the memory it uses for diarization is 355 MB on average, and for NNet3 decoding, 1.6 GB. Keep in mind that RAM is used in this case since it runs on the CPU.
diff --git a/index.md b/index.md
@@ -2,7 +2,7 @@
 
 Welcome to the page where researchers and developers report performance of various ASR models on Dutch datasets.
 
-<h2>UT's Kaldi_NL vs. Whisper evaluation</h2>
+<h2>UT's Kaldi_NL vs. Whisper vs. XLS-R evaluation</h2>
 
 *UT = University of Twente*
 
@@ -23,7 +23,6 @@ The results in **bold** indicate the best performance for the specific subset(s)
 
 These results were achieved during the PDI-SSH Homo Medicinalis ([HoMed](https://homed.ruhosting.nl/)) project (2021-2024).
 
-
 ## Contributions
 Feel free to click the link at the top that leads you to the GitHub repository of this website. You may add changes if you want by forking the repository, making changes on your fork, then opening a pull request on the source repository.