Optimise wikidata resources #234

mcollardanuy · 2023-05-10T08:34:02Z

At the moment, we always need to load:

mentions_to_wikidata_normalized.json (97M)
wikidata_to_mentions_normalized.json (88M)
mentions_to_wikidata.json (73M)

However wikidata_to_mentions_normalized.json is only really used when generating the DeezyMatch training set, so we could load it only in this case. It is also used to filter mentions in:

T-Res/geoparser/ranking.py

Lines 100 to 141 in 1d887ea

    
           # Load Wikidata mentions-to-wqid: 
        
           with open( 
        
               self.resources_path + "mentions_to_wikidata_normalized.json", "r" 
        
           ) as f: 
        
               self.mentions_to_wikidata = json.load(f) 
        
           # Load Wikidata wqid-to-mentions: 
        
           with open( 
        
               self.resources_path + "wikidata_to_mentions_normalized.json", "r" 
        
           ) as f: 
        
               self.wikidata_to_mentions = json.load(f) 
        
           # Filter mentions to remove noise: 
        
           wikidata_to_mentions_filtered = dict() 
        
           mentions_to_wikidata_filtered = dict() 
        
           for wk in self.wikidata_to_mentions: 
        
               wikipedia_mentions = self.wikidata_to_mentions.get(wk) 
        
               wikipedia_mentions_stripped = dict( 
        
                   [ 
        
                       (x, wikipedia_mentions[x]) 
        
                       for x in wikipedia_mentions 
        
                       if not ", " in x and not " (" in x 
        
                   ] 
        
               ) 
        
               if wikipedia_mentions_stripped: 
        
                   wikipedia_mentions = wikipedia_mentions_stripped 
        
               wikidata_to_mentions_filtered[wk] = dict( 
        
                   [(x, wikipedia_mentions[x]) for x in wikipedia_mentions] 
        
               ) 
        
               for m in wikidata_to_mentions_filtered[wk]: 
        
                   if m in mentions_to_wikidata_filtered: 
        
                       mentions_to_wikidata_filtered[m][ 
        
                           wk 
        
                       ] = wikidata_to_mentions_filtered[wk][m] 
        
                   else: 
        
                       mentions_to_wikidata_filtered[m] = { 
        
                           wk: wikidata_to_mentions_filtered[wk][m] 
        
                       } 
        
           self.mentions_to_wikidata = mentions_to_wikidata_filtered 
        
           self.wikidata_to_mentions = wikidata_to_mentions_filtered 
        
           del mentions_to_wikidata_filtered 
        
           del wikidata_to_mentions_filtered

However, this could be done directly in wiki2gaz, it'd make more sense.

Finally, mentions_to_wikidata_normalized.json and mentions_to_wikidata.json could be merged into just one resource, with a tuple with both scores for each mention-wikidata pair.

On the other hand, we also require the following two resources:

wikidata_gazetteer.csv (24M)
entity2class.txt (20M)

Which could be merged because they share the same ID. That'd require modifying the wiki2gaz scripts and also the T-Res linker.

The text was updated successfully, but these errors were encountered:

mcollardanuy changed the title ~~Optimise wikidata mentions resources~~ Optimise wikidata resources May 10, 2023

mcollardanuy added the enhancement New feature or request label May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise wikidata resources #234

Optimise wikidata resources #234

mcollardanuy commented May 10, 2023 •

edited

Loading

Optimise wikidata resources #234

Optimise wikidata resources #234

Comments

mcollardanuy commented May 10, 2023 • edited Loading

mcollardanuy commented May 10, 2023 •

edited

Loading