Skip to content

Commit

Permalink
Merge pull request #10 from ZurichNLP/evaluation
Browse files Browse the repository at this point in the history
Evaluation
  • Loading branch information
simon-clematide authored May 16, 2024
2 parents c0f15b6 + f228f59 commit 087b138
Show file tree
Hide file tree
Showing 49 changed files with 3,395 additions and 55 deletions.
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ Details can be found at the SwissNLP Website: https://www.swisstext.org/call-for
- The format of the submission should be identical to the format of the training/sample data.
- Hackathon-like submissions (24-48h of work) are welcome. Please indicate in your submission mail whether your submission should be categorized as hackathon-like.

To evaluate your predictions, create a folder in the `data/participant_submissions` directory with your team name. Load the result files as `.jsonl` into this folder and name them: `teamname_task1_run1.jsonl` (for Task 1) or `teamname_task2_run1.jsonl` (for Task 2). From the root folder, run `python evaluation/evaluation.py`. The results will be saved in `results/task1` or `results/task2`, depending on the task.


## Schedule

- 12th February 2024: Announcement of Shared Task on Swiss NLP
Expand All @@ -47,8 +50,8 @@ The distributed data will be in JSON Lines (JSONL) format, where each line is a
"ABSTRACT": "the full text of the abstract (string)",
"URL": "a link to the full document (string)",
"SDG": "for Task 1, an integer representing the SDG the abstract is classified under. SDGs are numbered from 0 to 17, where 0 represents the ‘non-relevant’ category.",
"MAIN_TARGET": "for Task 2, a string representing the primary SDG target the abstract addresses",
"SECONDARY_TARGETS": "for Task 2, an array of strings representing additional SDG targets the abstract addresses"
"TARGET": "for Task 2, a string representing the primary SDG target the abstract addresses",
"TARGETS": "for Task 2, an array of strings representing all SDG targets the abstract addresses"
}
```

Expand All @@ -73,7 +76,7 @@ The distributed data will be in JSON Lines (JSONL) format, where each line is a
"ABSTRACT": "As part of a trans-disciplinary research project, a series of surveys and interventions were conducted in different arsenic-affected regions of rural Bangladesh. Surveys of institutional stakeholders identified deep tubewells and piped water systems as the most preferred options, and the same preferences were found in household surveys of populations at risk. Psychological surveys revealed that these two technologies were well-supported by potential users, with self-efficacy and social norms being the principle factors driving behavior change. The principle drawbacks of deep tubewells are that installation costs are too high for most families to own private wells, and that for various socio-cultural-religious reasons, people are not willing to walk long distances to access communal tubewells. In addition, water sector planners have reservations about greater exploitation of the deep aquifer, out of concern for current or future geogenic contamination. Groundwater models and field studies have shown that in the great majority of the affected areas, the risk of arsenic contamination of deep groundwater is small; salinity, iron, and manganese are more likely to pose problems. These constituents can in some cases be avoided by exploiting an intermediate depth aquifer of good chemical quality, which is hydraulically and geochemically separate from the arsenic-contaminated shallow aquifer. Deep tubewells represent a technically sound option throughout much of the arsenic-affected regions, and future mitigation programs should build on and accelerate construction of deep tubewells. Utilization of deep tubewells, however, could be improved by increasing the tubewell density (which requires stronger financial support) to reduce travel times, by considering water quality in a holistic way, and by accompanying tubewell installation with motivational interventions based on psychological factors. By combining findings from technical and social sciences, the efficiency and success of arsenic mitigation in general - and installation of deep tubewells in particular - can be significantly enhanced.",
"URL": "https://www.zora.uzh.ch/id/eprint/89201",
"SDG": 6,
"MAIN_TARGET": "6.1",
"SECONDARY_TARGETS": ["6.3", "6.5"]
"TARGET": "6.1",
"TARGETS": ["6.1", "6.3", "6.5"]
}
```
156 changes: 156 additions & 0 deletions data/participant_submissions/MeHuBe/MeHuBe_TASK1_RUN1.jsonl

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/participant_submissions/MeHuBe/MeHuBe_TASK1_RUN2.jsonl

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/participant_submissions/MeHuBe/MeHuBe_TASK1_RUN3.jsonl

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/participant_submissions/baseline/llama3_TASK1_RUN1.jsonl

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/participant_submissions/baseline/llama3_cot_TASK1_RUN1.jsonl

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/participant_submissions/bcode/bcode_TASK1_RUN1.jsonl

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/participant_submissions/pronto/PRONTO_TASK1_RUN1.jsonl

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

102 changes: 51 additions & 51 deletions data/task2_train.jsonl

Large diffs are not rendered by default.

156 changes: 156 additions & 0 deletions data/testdata_results.jsonl

Large diffs are not rendered by default.

121 changes: 121 additions & 0 deletions evaluation/evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
import json
import os
from datetime import datetime
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
import re

class Evaluator:
def __init__(self, test_data_folder='../data'):
self.test_data_folder = test_data_folder
self.submission_folder = '../data/participant_submissions'

def load_data(self, filepath):
with open(filepath, 'r', encoding='utf-8') as f:
return [json.loads(line) for line in f]

def save_report(self, report, task_number, prediction_file_path, evaluationtype=""):
base_filename = os.path.basename(prediction_file_path)
base_filename = os.path.splitext(base_filename)[0]
report_filename = f"../results/task{task_number}/{base_filename}_task{task_number}_report_{evaluationtype}.txt"
with open(report_filename, 'w') as f:
f.write(report)
print(f"Report saved to {report_filename}")

def match_data_by_id(self, predictions, true_data):
true_data_dict = {item['ID']: item for item in true_data}
matched_predictions = []
matched_true_data = []
for pred in predictions:
if pred['ID'] in true_data_dict:
matched_predictions.append(pred)
matched_true_data.append(true_data_dict[pred['ID']])
return matched_predictions, matched_true_data

def task_1_goldlabel(self, predictions, true_data):
pred_labels = [item['SDG'] for item in predictions]
true_labels = [item['SDG'] for item in true_data]
report = classification_report(true_labels, pred_labels, output_dict=False, zero_division=0)
print("Task 1 Goldlabel Evaluation:\n", report)
return report

def task_1_secondary(self, predictions, true_data):
for i, item in enumerate(predictions):
if item['SDG'] in true_data[i]['SDGS']:
true_data[i]['SDG'] = item['SDG']
pred_labels = [item['SDG'] for item in predictions]
true_labels = [item['SDG'] for item in true_data]
report = classification_report(true_labels, pred_labels, output_dict=False, zero_division=0)
print("Task 1 Secondary Evaluation:\n", report)
return report

def evaluate_task_1(self, predictions, true_data, prediction_file_path):
predictions, true_data = self.match_data_by_id(predictions, true_data)
report = self.task_1_goldlabel(predictions, true_data)
self.save_report(report, 1, prediction_file_path, "goldlabel")
report = self.task_1_secondary(predictions, true_data)
self.save_report(report, 1, prediction_file_path, "secondary")

def evaluate_task_2_goldlabel(self, predictions, true_data):
lb = LabelBinarizer()
true_main_targets = lb.fit_transform([item['TARGET'] for item in true_data])
pred_main_targets = lb.transform([item['TARGET'] for item in predictions])
main_target_report = classification_report(true_main_targets, pred_main_targets, target_names=lb.classes_, zero_division=0)
print("Task 2 Goldlabel - Main Target Evaluation\n", main_target_report)
return main_target_report

def evaluate_task_2_secondary(self, predictions, true_data):
mlb = MultiLabelBinarizer()
true_secondary_targets = mlb.fit_transform([item['TARGETS'] for item in true_data])
pred_secondary_targets = mlb.transform([item['TARGETS'] for item in predictions])
secondary_target_report = classification_report(true_secondary_targets, pred_secondary_targets, target_names=mlb.classes_, zero_division=0)
print("Task 2 Secondary - Secondary Targets Evaluation\n", secondary_target_report)
return secondary_target_report

def evaluate_task_2(self, predictions, true_data, prediction_file_path):
predictions, true_data = self.match_data_by_id(predictions, true_data)
# Goldlabel Evaluation
main_target_report = self.evaluate_task_2_goldlabel(predictions, true_data)
self.save_report(main_target_report, 2, prediction_file_path, "goldlabel")

# Secondary Evaluation
secondary_target_report = self.evaluate_task_2_secondary(predictions, true_data)
self.save_report(secondary_target_report, 2, prediction_file_path, "secondary")

def evaluate(self, task_number, prediction_file_path):
original_data_path = f"{self.test_data_folder}/testdata_results.jsonl"
predictions = self.load_data(prediction_file_path)
true_data = self.load_data(original_data_path)

if task_number == 1:
self.evaluate_task_1(predictions, true_data, prediction_file_path)
elif task_number == 2:
self.evaluate_task_2(predictions, true_data, prediction_file_path)
else:
print("Invalid task number. Please specify 1 or 2.")

def evaluate_all_participants(self):
for folder in os.listdir(self.submission_folder):
folder_path = os.path.join(self.submission_folder, folder)
for file in os.listdir(folder_path):
if file.endswith('.json'):
file = self.convert_to_jsonl(file, folder_path)
if file.endswith('.jsonl'):
print(f"################## Evaluating {file} ##################")
regex = re.compile(r'task\d+')
full_task = regex.findall(file.lower())[0]
task_number = int(full_task[-1])
self.evaluate(task_number, f"{self.submission_folder}/{folder}/{file}")

def convert_to_jsonl(self, file, folder_path):
with open(f"{folder_path}/{file}", 'r', encoding='utf-8') as f:
data = json.load(f)
with open(f"{folder_path}/{os.path.splitext(file)[0]}.jsonl", 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item) + '\n')
file = os.path.splitext(file)[0] + '.jsonl'
return file

if __name__ == '__main__':
evaluator = Evaluator()
evaluator.evaluate_all_participants()
24 changes: 24 additions & 0 deletions results/task1/MeHuBe_TASK1_RUN1_task1_report_goldlabel.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.71 0.29 0.41 83
1 1.00 0.67 0.80 3
2 0.67 0.50 0.57 4
3 0.07 0.33 0.11 3
4 0.67 0.33 0.44 6
5 0.44 1.00 0.62 4
6 0.60 0.75 0.67 4
7 0.29 0.67 0.40 3
8 0.09 0.40 0.14 5
9 0.50 0.60 0.55 5
10 0.30 0.75 0.43 4
11 0.50 0.50 0.50 4
12 0.43 0.50 0.46 6
13 0.25 1.00 0.40 2
14 1.00 0.20 0.33 5
15 0.44 0.80 0.57 5
16 0.10 0.33 0.15 3
17 0.00 0.00 0.00 7

accuracy 0.39 156
macro avg 0.45 0.53 0.42 156
weighted avg 0.58 0.39 0.41 156
24 changes: 24 additions & 0 deletions results/task1/MeHuBe_TASK1_RUN1_task1_report_secondary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.71 0.29 0.41 83
1 1.00 0.67 0.80 3
2 1.00 0.60 0.75 5
3 0.13 0.50 0.21 4
4 0.67 0.33 0.44 6
5 0.44 1.00 0.62 4
6 0.60 0.75 0.67 4
7 0.43 0.75 0.55 4
8 0.09 0.40 0.14 5
9 0.50 1.00 0.67 3
10 0.30 0.75 0.43 4
11 0.50 0.50 0.50 4
12 0.43 0.60 0.50 5
13 0.50 1.00 0.67 4
14 1.00 0.20 0.33 5
15 0.44 0.80 0.57 5
16 0.10 0.33 0.15 3
17 0.00 0.00 0.00 5

accuracy 0.42 156
macro avg 0.49 0.58 0.47 156
weighted avg 0.60 0.42 0.43 156
24 changes: 24 additions & 0 deletions results/task1/MeHuBe_TASK1_RUN2_task1_report_goldlabel.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.71 0.29 0.41 83
1 1.00 0.67 0.80 3
2 0.67 0.50 0.57 4
3 0.12 0.67 0.20 3
4 0.33 0.17 0.22 6
5 0.40 1.00 0.57 4
6 0.50 0.50 0.50 4
7 0.25 0.67 0.36 3
8 0.09 0.40 0.15 5
9 0.33 0.40 0.36 5
10 0.38 0.75 0.50 4
11 0.67 0.50 0.57 4
12 0.33 0.50 0.40 6
13 0.25 1.00 0.40 2
14 0.67 0.40 0.50 5
15 0.50 0.80 0.62 5
16 0.12 0.33 0.18 3
17 0.00 0.00 0.00 7

accuracy 0.38 156
macro avg 0.41 0.53 0.41 156
weighted avg 0.55 0.38 0.40 156
24 changes: 24 additions & 0 deletions results/task1/MeHuBe_TASK1_RUN2_task1_report_secondary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.71 0.29 0.41 83
1 1.00 0.67 0.80 3
2 1.00 0.60 0.75 5
3 0.18 0.75 0.29 4
4 0.33 0.17 0.22 6
5 0.40 1.00 0.57 4
6 0.50 0.50 0.50 4
7 0.38 0.75 0.50 4
8 0.09 0.40 0.15 5
9 0.33 0.67 0.44 3
10 0.38 0.75 0.50 4
11 0.67 0.50 0.57 4
12 0.33 0.60 0.43 5
13 0.50 1.00 0.67 4
14 0.67 0.40 0.50 5
15 0.50 0.80 0.62 5
16 0.12 0.33 0.18 3
17 0.00 0.00 0.00 5

accuracy 0.42 156
macro avg 0.45 0.57 0.45 156
weighted avg 0.58 0.42 0.42 156
24 changes: 24 additions & 0 deletions results/task1/MeHuBe_TASK1_RUN3_task1_report_goldlabel.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.70 0.28 0.40 83
1 1.00 0.67 0.80 3
2 0.67 0.50 0.57 4
3 0.12 0.67 0.20 3
4 0.33 0.17 0.22 6
5 0.44 1.00 0.62 4
6 0.50 0.50 0.50 4
7 0.22 0.67 0.33 3
8 0.09 0.40 0.15 5
9 0.33 0.40 0.36 5
10 0.33 0.75 0.46 4
11 0.67 0.50 0.57 4
12 0.38 0.50 0.43 6
13 0.25 1.00 0.40 2
14 0.67 0.40 0.50 5
15 0.50 0.80 0.62 5
16 0.11 0.33 0.17 3
17 0.00 0.00 0.00 7

accuracy 0.38 156
macro avg 0.41 0.53 0.41 156
weighted avg 0.55 0.38 0.39 156
24 changes: 24 additions & 0 deletions results/task1/MeHuBe_TASK1_RUN3_task1_report_secondary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.70 0.28 0.40 83
1 1.00 0.67 0.80 3
2 1.00 0.60 0.75 5
3 0.18 0.75 0.29 4
4 0.33 0.17 0.22 6
5 0.44 1.00 0.62 4
6 0.50 0.50 0.50 4
7 0.33 0.75 0.46 4
8 0.09 0.40 0.15 5
9 0.33 0.67 0.44 3
10 0.33 0.75 0.46 4
11 0.67 0.50 0.57 4
12 0.38 0.60 0.46 5
13 0.50 1.00 0.67 4
14 0.67 0.40 0.50 5
15 0.50 0.80 0.62 5
16 0.11 0.33 0.17 3
17 0.00 0.00 0.00 5

accuracy 0.41 156
macro avg 0.45 0.56 0.45 156
weighted avg 0.57 0.41 0.42 156
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.77 0.48 0.59 83
1 0.67 0.67 0.67 3
2 0.80 1.00 0.89 4
3 0.12 0.33 0.18 3
4 0.50 0.17 0.25 6
5 0.57 1.00 0.73 4
6 1.00 0.75 0.86 4
7 0.50 1.00 0.67 3
8 0.04 0.20 0.06 5
9 0.43 0.60 0.50 5
10 0.27 0.75 0.40 4
11 1.00 0.50 0.67 4
12 0.75 0.50 0.60 6
13 0.22 1.00 0.36 2
14 1.00 0.80 0.89 5
15 0.83 1.00 0.91 5
16 0.00 0.00 0.00 3
17 0.00 0.00 0.00 7

accuracy 0.52 156
macro avg 0.53 0.60 0.51 156
weighted avg 0.65 0.52 0.55 156
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
precision recall f1-score support

0 0.77 0.48 0.59 83
1 0.67 0.67 0.67 3
2 1.00 1.00 1.00 5
3 0.12 0.33 0.18 3
4 0.50 0.17 0.25 6
5 0.57 1.00 0.73 4
6 1.00 0.75 0.86 4
7 0.50 1.00 0.67 3
8 0.04 0.20 0.06 5
9 0.57 0.80 0.67 5
10 0.27 0.75 0.40 4
11 1.00 0.50 0.67 4
12 0.75 0.60 0.67 5
13 0.56 1.00 0.71 5
14 1.00 0.80 0.89 5
15 0.83 1.00 0.91 5
16 0.00 0.00 0.00 3
17 0.00 0.00 0.00 4

accuracy 0.55 156
macro avg 0.56 0.61 0.55 156
weighted avg 0.68 0.55 0.58 156
Loading

0 comments on commit 087b138

Please sign in to comment.