Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise wikidata resources #234

Open
mcollardanuy opened this issue May 10, 2023 · 0 comments
Open

Optimise wikidata resources #234

mcollardanuy opened this issue May 10, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@mcollardanuy
Copy link
Collaborator

mcollardanuy commented May 10, 2023

At the moment, we always need to load:

mentions_to_wikidata_normalized.json (97M)
wikidata_to_mentions_normalized.json (88M)
mentions_to_wikidata.json (73M)

However wikidata_to_mentions_normalized.json is only really used when generating the DeezyMatch training set, so we could load it only in this case. It is also used to filter mentions in:

T-Res/geoparser/ranking.py

Lines 100 to 141 in 1d887ea

# Load Wikidata mentions-to-wqid:
with open(
self.resources_path + "mentions_to_wikidata_normalized.json", "r"
) as f:
self.mentions_to_wikidata = json.load(f)
# Load Wikidata wqid-to-mentions:
with open(
self.resources_path + "wikidata_to_mentions_normalized.json", "r"
) as f:
self.wikidata_to_mentions = json.load(f)
# Filter mentions to remove noise:
wikidata_to_mentions_filtered = dict()
mentions_to_wikidata_filtered = dict()
for wk in self.wikidata_to_mentions:
wikipedia_mentions = self.wikidata_to_mentions.get(wk)
wikipedia_mentions_stripped = dict(
[
(x, wikipedia_mentions[x])
for x in wikipedia_mentions
if not ", " in x and not " (" in x
]
)
if wikipedia_mentions_stripped:
wikipedia_mentions = wikipedia_mentions_stripped
wikidata_to_mentions_filtered[wk] = dict(
[(x, wikipedia_mentions[x]) for x in wikipedia_mentions]
)
for m in wikidata_to_mentions_filtered[wk]:
if m in mentions_to_wikidata_filtered:
mentions_to_wikidata_filtered[m][
wk
] = wikidata_to_mentions_filtered[wk][m]
else:
mentions_to_wikidata_filtered[m] = {
wk: wikidata_to_mentions_filtered[wk][m]
}
self.mentions_to_wikidata = mentions_to_wikidata_filtered
self.wikidata_to_mentions = wikidata_to_mentions_filtered
del mentions_to_wikidata_filtered
del wikidata_to_mentions_filtered

However, this could be done directly in wiki2gaz, it'd make more sense.

Finally, mentions_to_wikidata_normalized.json and mentions_to_wikidata.json could be merged into just one resource, with a tuple with both scores for each mention-wikidata pair.

On the other hand, we also require the following two resources:

wikidata_gazetteer.csv (24M)
entity2class.txt (20M)

Which could be merged because they share the same ID. That'd require modifying the wiki2gaz scripts and also the T-Res linker.

@mcollardanuy mcollardanuy changed the title Optimise wikidata mentions resources Optimise wikidata resources May 10, 2023
@mcollardanuy mcollardanuy added the enhancement New feature or request label May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant