From ed1bc11507dd65d44d30c01c64a6f94ddebaec32 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Drago=C8=99?= <balandragos5555@gmail.com>
Date: Fri, 12 Apr 2024 17:44:18 +0200
Subject: [PATCH] Add Flemish Jasmin results

---
 UT/Jasmin/jasmin.md       |  7 ++---
 UT/Jasmin/jasmin_res.md   | 56 +++++++++++++++++++++++++++++++++++++--
 UT/Jasmin/jasmin_setup.md |  2 ++
 3 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/UT/Jasmin/jasmin.md b/UT/Jasmin/jasmin.md
index fc9cc7e..997644c 100644
--- a/UT/Jasmin/jasmin.md
+++ b/UT/Jasmin/jasmin.md
@@ -6,11 +6,12 @@
 - [Setup](./jasmin_setup.md)
 
 ### Corpus description
-Jasmin-CGN is a Dutch/Flemish corpus that contains speech from less-represented groups, such as the elderly, children, or non-natives. UT has only evaluated the Dutch part of the corpus.
+Jasmin-CGN is a Dutch/Flemish corpus that contains speech from less-represented groups, such as the elderly, children, or non-natives.
 
 The corpus is split mainly according to 2 criteria:
-1. Read/HMI speech
-2. Speaker groups based on age and native/non-native
+1. Dutch/Flemish
+2. Read/HMI speech
+3. Speaker groups based on age and native/non-native
 
 Thus, we have:
 - `comp_p`: Speech recordings where a human interacts with a machine (HMI) in a Wizard of Oz setup
diff --git a/UT/Jasmin/jasmin_res.md b/UT/Jasmin/jasmin_res.md
index d24c4d6..a2c96db 100644
--- a/UT/Jasmin/jasmin_res.md
+++ b/UT/Jasmin/jasmin_res.md
@@ -1,7 +1,7 @@
 [Go back](./jasmin.md)
 
-## Jasmin results
-Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well as different versions of zero-shot Whisper tested on the corpus:
+## Jasmin Dutch results
+Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well as different models tested on the **Dutch** part of the corpus:
 
 |Model\Dataset|Jasmin_q_1|Jasmin_q_2|Jasmin_q_3|Jasmin_q_4|Jasmin_q_5|
 |---|---|---|---|---|---|
@@ -33,6 +33,7 @@ Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well a
 |MMS - 102 languages|79.8%|79.9%|90.7%|80.5%|56.4%|
 |MMS - 1162 languages|82.4%|87.9%|94.5%|83.3%|59.9%|
 
+
 **Jasmin_{p,q}_{1,2,3,4,5}** = **p** stands for **comp_p (HMI speech)**, whereas **q** stands for **comp_q (read speech)**. The number that can range from **1-5** represents the corresponding **age group/nativeness** from the corpus (for more details, go back one page).
 
 <br>
@@ -70,3 +71,54 @@ And its corresponding matrix with the **time** spent in total by each model **to
 |MMS - 1162 languages|0h:17m:55s|0h:13m:56s|0h:13m:59s|0h:18m:54s|0h:25m:24s|
 
 <b>*</b> Performance might have been impacted by other processes from other users running on the same GPU since the hardware is available via a cluster system. A rerun using different hardware might be done in the near future.
+
+## Jasmin Flemish results
+
+Matrix with **WER** results for the **Flemish** part of the corpus:
+|Model\Dataset|Jasmin_q_1|Jasmin_q_2|Jasmin_q_3|Jasmin_q_4|Jasmin_q_5|
+|---|---|---|---|---|---|
+|Kaldi_NL|59.2%|33.5%|51.3%|43.3%|24.7%|
+|faster-whisper v2|42.4%|11.7%|19.9%|21.0%|16.7%|
+|faster-whisper v3|57.2%|30.6%|44.4%|41.1%|38.7%|
+|**faster-whisper v2 w/ VAD**|**41.8%**|**11.6%**|**19.4%**|**20.5%**|**14.4%**|
+|faster-whisper v3 w/ VAD|56.2%|26.7%|38.4%|50.7%|33.6%|
+|XLS-R FT on Dutch|47.4%|13.3%|30.1%|26.8%|16.4%|
+|MMS - 102 languages|55.3%|22.4%|43.0%|37.0%|23.0%|
+|MMS - 1162 languages|49.2%|21.8%|34.9%|35.8%|22.3%|
+
+|Model\Dataset|Jasmin_p_1|Jasmin_p_2|Jasmin_p_3|Jasmin_p_4|Jasmin_p_5|
+|---|---|---|---|---|---|
+|Kaldi_NL|66.5%|49.8%|66.2%|64.4%|47.4%|
+|faster-whisper v2|87.6%|51.7%|76.1%|67.3%|45.4%|
+|faster-whisper v3|90.5%|65.2%|100.4%|79.9%|68.3%|
+|**faster-whisper v2 w/ VAD**|**28.7%**|**24.3%**|**38.5%**|**49.3%**|**30.6%**|
+|faster-whisper v3 w/ VAD|46.0%|37.7%|57.9%|57.9%|44.6%|
+|XLS-R FT on Dutch|73.2%|62.2%|68.1%|52.2%|47.8%|
+|MMS - 102 languages|86.7%|52.3%|87.8%|78.2%|56.4%|
+|MMS - 1162 languages|86.1%|68.0%|86.3%|76.7%|60.8%|
+
+<br>
+
+And its corresponding matrix with the **time** spent in total by each model **to evaluate** the respective subset:
+
+|Model\Dataset|Jasmin_q_1|Jasmin_q_2|Jasmin_q_3|Jasmin_q_4|Jasmin_q_5|
+|---|---|---|---|---|---|
+|Kaldi_NL|0h:15m:58s|0h:16m:03s|0h:25m:11s|0h:15m:46s|0h:29m:36s|
+|faster-whisper v2|0h:09m:30s|0h:20m:12s|0h:18m:03s|0h:12m:09s|0h:14m:31s|
+|faster-whisper v3|0h:14m:53s|0h:24m:33s|0h:29m:19s|0h:21m:47s|0h:23m:58s|
+|faster-whisper v2 w/ VAD|0h:21m:27s|0h:27m:16s|0h:19m:09s|0h:13m:24s|0h:15m:29s|
+|faster-whisper v3 w/ VAD|0h:13m:17s|0h:23m:29s|0h:23m:14s|0h:26m:40s|0h:19m:54s|
+|XLS-R FT on Dutch|0h:11m:18s|0h:20m:03s|0h:22m:05s|0h:16m:28s|0h:13m:00s|
+|*MMS - 102 languages*|**0h:05m:47s**|0h:09m:09s|**0h:10m:06s**|**0h:07m:37s**|**0h:08m:04s**|
+|**MMS - 1162 languages**|**0h:05m:47s**|**0h:09m:07s**|**0h:10m:06s**|**0h:07m:37s**|**0h:08m:04s**|
+
+|Model\Dataset|Jasmin_p_1|Jasmin_p_2|Jasmin_p_3|Jasmin_p_4|Jasmin_p_5|
+|---|---|---|---|---|---|
+|Kaldi_NL|0h:07m:09s|0h:07m:36s|0h:08m:37s|0h:10m:51s|0h:14m:45s|
+|faster-whisper v2|0h:12m:48s|0h:10m:45s|0h:14m:34s|0h:11m:58s|0h:34m:06s|
+|faster-whisper v3|0h:24m:08s|0h:26m:42s|0h:28m:56s|0h:27m:12s|0h:31m:16s|
+|**faster-whisper v2 w/ VAD**|**0h:05m:41s**|**0h:07m:03s**|**0h:07m:01s**|**0h:08m:11s**|**0h:09m:45s**|
+|*faster-whisper v3 w/ VAD*|0h:06m:44s|0h:08m:23s|0h:10m:08s|0h:13m:52s|0h:10m:40s|
+|XLS-R FT on Dutch|0h:20m:36s|0h:16m:58s|0h:20m:55s|0h:17m:47s|0h:19m:34s|
+|MMS - 102 languages|0h:10m:55s|0h:09m:10s|0h:11m:00s|0h:09m:43s|0h:10m:33s|
+|MMS - 1162 languages|0h:10m:06s|0h:09m:09s|0h:10m:42s|0h:09m:36s|0h:10m:15s|
diff --git a/UT/Jasmin/jasmin_setup.md b/UT/Jasmin/jasmin_setup.md
index 638d851..0812c37 100644
--- a/UT/Jasmin/jasmin_setup.md
+++ b/UT/Jasmin/jasmin_setup.md
@@ -5,6 +5,8 @@
 
 The encoding used in the dataset for the transcriptions is latin_1. In order for the evaluation tool to work, I converted the encoding to UTF-8.
 
+For the Flemish subset, one speaker (`V000055`) does not belong to any of the 5 speaker groups according to the metadata. Therefore, this speaker and their files (`fv160041` and `fv170041`) have been excluded from the evaluation.
+
 ### Postprocessing
 
 A large number of insertions was encountered when evaluating Whisper. This was due to time misalignment at the start of segments. This was addressed by adjusting the `start_time` of the first word of a segment to `end_time - 0.1s`.