The baseline NEL models are built on top of the baseline NER models. No separate training is done for NEL.
For evaluation, two hyperparameters are tuned on dev set: offset and incl_blank. offset is a fixed duration by which we shift the time stamp predictions. incl_blank is a Boolean to decide whether the trailing blank tokens in the CTC emissions are considered as a part of the predicted segment. When incl_blank is True
, the segment between the start and end word separator tokens is considered a hypothesis.
word-F1 metric is evaluated with a tolerance hyperparameter. tolerance, a value between 0 and 1, is the fraction of overlap between a ground-truth word segment and the predicted region needed to count the word as detected; ρ = 1 means a perfect match is required to count the word as detected.
Time stamps are extracted using CTC emissions from the E2E NER model. The frames between start and end special characters constitute the detected entity segment.
Step 1: Extract CTC emissions from E2E NER model and save character-level timestamps.
bash baselines/nel/decode.sh e2e_ner dev
Step 2: Hyperparameter search on dev.
bash baselines/nel/eval_nel.sh e2e
Time stamps are extracted using CTC emissions from the ASR model. The frames corresponding to the entity phrase as detected by the text NER model constitute the detected entity segment.
Add evaluation scripts in the table format.
ASR model: wav2vec2.0 finetuned for ASR text NER model: DeBERTa-Base finetuned for NER
Step 1: Extract CTC emissions from the ASR model and save character-level timestamps.
bash baselines/nel/decode.sh asr dev
Step 2: Hyperparameter search on dev.
bash baselines/nel/eval_nel.sh ppl
Perfect ASR: assuming access to GT transcripts, so the predicted time stamps are the same as the GT force-aligned time stamps.
Evaluate output of the text NER model on dev.
bash baselines/nel/eval_nel.sh oracle_ppl