This repository contains the code for our TACL paper:
Revisiting Meta-evaluation for Grammatical Error Correction.
If you use this code, please cite our paper:
@misc{kobayashi2024revisiting,
title={Revisiting Meta-evaluation for Grammatical Error Correction},
author={Masamune Kobayashi and Masato Mita and Mamoru Komachi},
year={2024},
eprint={2403.02674},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
First, evaluate the sentences in outputs/subset
at the system level with the evaluation metric you want to meta-evaluate. Then, calculate the correlation using the system scores provided by humans (Table 2).1
To conduct system-level meta-evaluation, simply run:
python corr_system.py --human_score HUMAN_SCORE --metric_score METRIC_SCORE --systems SYSTEMS
HUMAN_SCORE
The file of human evaluation scores for each system inscores/human
directory. The system scores are arranged alphabetically as follows: 1: BART, 2: BERT-fuse, 3: GECToR_BERT, 4: GECToR_ens, 5: GPT-3.5, 6: INPUT, 7: LM-Critic, 8: PIE, 9: REF-F, 10: REF-M, 11: Riken-Tohoku, 12: T5, 13: TemplateGEC, 14: TransGEC, 15: UEDIN-MS.METRIC_SCORE
The file of evaluation metric scores for each system inscores/metric
directory. Please create system-level evaluation score files for the target metrics. All scores should be sorted alphabetically.SYSTEMS
Set of systems to consider for meta-evaluation. The default is set tobase
. To consider fluently corrected sentences, use+REF-F_GPT-3.5
. To consider uncorrected sentences, useINPUT
. Specifyall
to use all 15 systems.
To conduct system-level window analysis, simply run:
python window_analysis_system.py --human_score HUMAN_SCORE --metric_score METRIC_SCORE --window_size WINDOW_SIZE
WINDOW_SIZE
The number of systems to use for window analysis.
First, evaluate the sentences in outputs/subset
at the sentence level with the evaluation metric you want to meta-evaluate. Then, calculate the correlation using the sentence scores provided by humans.
To conduct sentence-level meta-evaluation, simply run:
python corr_sentence.py --human_score HUMAN_SCORE --metric_score METRIC_SCORE --order ORDER --systems SYSTEMS
HUMAN_SCORE
The xml file of human evaluation scores for each sentence indata
directory. Consider the evaluation granularity of the metric (edit or sentence) when making your selection.METRIC_SCORE
The file of evaluation metric scores for each sentence inscores/metric/sentence_score
directory. Please create evaluation score files for the target metrics.ORDER
If a higher metric score indicates better system performance, choosehigher
; otherwise, chooselower
.SYSTEMS
Set of systems to consider for meta-evaluation. The default is set tobase
. To consider fluently corrected sentences, use+REF-F_GPT-3.5
. To consider uncorrected sentences, useINPUT
. Specifyall
to use all 15 systems.
We provide the evaluation scores for each system to promote the use of existing evaluation metrics2.
M2 | SentM2 | PT-M2 | ERRANT | SentERRANT | PT-ERRANT | GoToScorer | GLEU | Scribendi Score | SOME | IMPARA | |
---|---|---|---|---|---|---|---|---|---|---|---|
BART | 50.3 | 51.29 | 50.41 | 46.66 | 50.31 | 48.89 | 15.86 | 63.46 | 527 | 0.7933 | 5.31 |
BERT-fuse | 62.77 | 63.21 | 62.99 | 58.99 | 61.81 | 61.18 | 21.1 | 68.5 | 739 | 0.8151 | 5.816 |
GECToR-BERT | 61.83 | 61.23 | 60.76 | 58.05 | 59.76 | 59.17 | 18.98 | 66.56 | 640 | 0.8016 | 5.644 |
GECToR-ens | 63.53 | 60.37 | 59.21 | 61.43 | 59.64 | 58.57 | 16.58 | 65.08 | 529 | 0.786 | 5.17 |
GPT-3.5 | 53.5 | 53.28 | 53.41 | 44.12 | 49.23 | 48.93 | 22.85 | 65.93 | 835 | 0.8379 | 6.376 |
INPUT | 0.0 | 31.33 | 31.33 | 0.0 | 31.4 | 31.33 | 0.0 | 56.6 | 0.0 | 0.7506 | 4.089 |
LM-Critic | 55.5 | 58.0 | 57.16 | 52.38 | 56.41 | 55.86 | 16.23 | 64.39 | 683 | 0.8028 | 5.543 |
PIE | 59.93 | 60.7 | 60.69 | 55.89 | 59.35 | 58.65 | 21.07 | 67.83 | 601 | 0.8066 | 5.659 |
REF-F | 47.48 | 48.99 | 46.54 | 33.24 | 41.41 | 39.69 | 21.7 | 60.34 | 711 | 0.8463 | 6.569 |
REF-M | 60.12 | 62.3 | 62.91 | 54.77 | 60.11 | 60.6 | 23.92 | 67.27 | 754 | 0.8155 | 5.908 |
Riken-Tohoku | 64.74 | 64.22 | 64.31 | 61.88 | 63.13 | 62.73 | 20.94 | 68.37 | 678 | 0.8123 | 5.757 |
T5 | 65.07 | 65.2 | 66.13 | 60.65 | 63.24 | 63.77 | 20.46 | 68.81 | 668 | 0.8202 | 6.045 |
TemplateGEC | 56.29 | 57.59 | 57.47 | 51.34 | 56.08 | 55.96 | 14.7 | 65.07 | 448 | 0.7972 | 5.52 |
TransGEC | 68.08 | 68.33 | 68.37 | 64.43 | 66.76 | 66.37 | 21.93 | 70.2 | 779 | 0.82 | 6.035 |
UEDIN-MS | 64.55 | 62.68 | 62.67 | 61.33 | 61.38 | 61.19 | 18.94 | 67.41 | 666 | 0.808 | 5.591 |
Footnotes
-
Human rankings can be created by replacing the judgments.xml file in Grundkiewicz et al. (2015)’s repository. ↩
-
All scores are based on the entire text of the CoNLL-2014 test set and differ from the scores obtained using the subset used for our meta-evaluation. ↩