Fit/Transform does not give top n matches #81

shaluchiipi · 2024-11-27T08:21:34Z

I was using get_matches() to get top 5 matches. Now, since moving to production thought of using Fit/Predict but seems it returns only top first matches for each item. Is there any other way to get top 5 matches in Fit/Predict

I am matching current text notes (non-semantic long text) with historical ones. Historical data will be large in lakhs. So, to make code more efficient planning to pass historical text notes in fit and current text notes in transform. Planning to retrain it monthly.

Sample Current/Input Data:

ID	Alloc_No	Text_Notes
2354657	78	RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657	35	TFHGDVASFHC4636587//5748UJKNM
345676	889	WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586	57	YWEIHFDSK//2435467//WEKSFDHLV
465768	3777	324TYVHBJN//435465//HUJNKHJKN

Sample Reference/Historical Data:

ID	Alloc_No	Text_Notes
2354657	78	RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657	35	TFHGDVASFHC4636587//5748UJKNM
345676	889	WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586	57	YWEIHFDSK//2435467//WEKSFDHLV
465768	3777	324TYVHBJN//435465//HUJNKHJKN
2354657	78	RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657	35	TFHGDVASFHC4636587//5748UJKNM
345676	889	WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586	57	YWEIHFDSK//2435467//WEKSFDHLV
465768	3777	324TYVHBJN//435465//HUJNKHJKN

Sample Code:

Old Code using get_matches():

Passing reference historical text notes to "to_list"

to_list = Ref.Text_Notes.to_list()

for i in range(0,Input.shape[0]):

# Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it

from_list=[]
from_list.append(Input.Text_Notes[i])
#print(to_list)
model = PolyFuzz("TF-IDF").match(from_list, to_list)
matches=model.get_matches().sort_values(by='Similarity',ascending=False)
matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
list1=list(dict1.items())[:5]
dict2.update({Input['ID'][i]: list1})

dict2

New Changed Code for Production using Fit/Transform:

Fit the reference historical text notes

Frequency - monthly

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

Match the new text notes

Frequency - Daily

dict2 ={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
dict2

Now the issue is in transform i don't get similarity score for all reference rather only top 1 match whereas I need top 5

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-11-27T09:45:55Z

Thank you for sharing your description! Could you perhaps create a minimum reproducible example for me to try? I see some unformatted code here and there in your issue but I'm not quite sure what to run and in what order. It can be as simple as fitting on just a couple of rows just to showcase the issue.

shaluchiipi · 2024-12-02T19:41:50Z

Github_Example_Input.xlsx
Github_Example_Ref.xlsx
#I ran this code and copied it here for your reference. Also have added the input and reference excel files.

import pandas as pd
from polyfuzz import PolyFuzz

Input=pd.read_excel('Github_Example_Input.xlsx')
Ref=pd.read_excel('Github_Example_Ref.xlsx')

Old Code:

to_list = Ref.Text_Notes.to_list()
dict2 ={}
for i in range(0,Input.shape[0]):
#Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
    from_list=[]
    from_list.append(Input.Text_Notes[i])
    #print(to_list)
    model = PolyFuzz("TF-IDF").match(from_list, to_list)
    matches=model.get_matches().sort_values(by='Similarity',ascending=False)
    matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
    dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
    list1=list(dict1.items())[:5]
    dict2.update({Input['ID'][i]: list1})
dict2

New Code:

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

matches1={}
for i in range(0,Input.shape[0]):
    to_list=[]
    to_list.append(Input.Text_Notes[i])
    model = PolyFuzz.load("TF-IDF")
    matches=model.transform(to_list)
    print(matches)
    matches1.update({Input['ID'][i]:matches.values()})
matches1

MaartenGr · 2024-12-03T08:07:45Z

@shaluchiipi Thank you for sharing the example, it is greatly appreciated!

I typically don't download and open files from sources I'm not familiar with. Could you perhaps showcase the example with just a couple of docs like so:

docs = ["my doc", "another doc", "etc."]

I'm assuming we don't need to have a large dataset for this, right?

MaartenGr · 2024-12-03T08:08:51Z

Also, note that I updated your message to include these ``` brackets. Without them, it's not clear to me what the indentation is, where the code starts, etc. Please use those in the feature when sharing code.

shaluchiipi · 2024-12-03T09:59:23Z

Input and Ref Creation

import pandas as pd
from polyfuzz import PolyFuzz

Input={'ID': {0: 2354657, 1: 4354657, 2: 345676, 3: 34747586, 4: 465768},
 'Alloc_No': {0: 78, 1: 35, 2: 889, 3: 57, 4: 3777},
 'Text_Notes': {0: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
  1: 'TFHGDVASFHC4636587//5748UJKNM',
  2: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
  3: 'YWEIHFDSK//2435467//WEKSFDHLV',
  4: '324TYVHBJN//435465//HUJNKHJKN'}}

Input=pd.DataFrame(Input)

Ref={'ID': {0: 2354657,
  1: 4354657,
  2: 345676,
  3: 34747586,
  4: 465768,
  5: 2354657,
  6: 4354657,
  7: 345676,
  8: 34747586,
  9: 465768},
 'Alloc_No': {0: 78,
  1: 35,
  2: 889,
  3: 57,
  4: 3777,
  5: 78,
  6: 35,
  7: 889,
  8: 57,
  9: 3777},
 'Text_Notes': {0: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
  1: 'TFHGDVASFHC4636587//5748UJKNM',
  2: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
  3: 'YWEIHFDSK//2435467//WEKSFDHLV',
  4: '324TYVHBJN//435465//HUJNKHJKN',
  5: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
  6: 'TFHGDVASFHC4636587//5748UJKNM',
  7: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
  8: 'YWEIHFDSK//2435467//WEKSFDHLV',
  9: '324TYVHBJN//435465//HUJNKHJKN'}}

Ref=pd.DataFrame(Ref)

Old Code:

from_list = Ref.Text_Notes.to_list()
dict2 ={}
for i in range(0,Input.shape[0]):
#Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
    to_list=[]
    to_list.append(Input.Text_Notes[i])
    #print(to_list)
    model = PolyFuzz("TF-IDF").match(from_list, to_list)
    matches=model.get_matches().sort_values(by='Similarity',ascending=False)
    matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
    dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
    list1=list(dict1.items())[:5]
    dict2.update({Input['ID'][i]: list1})
dict2

#Converting dictionary to Result dataframe with top 5 matches

Result=pd.DataFrame()
Result['Input_ID']=dict2.keys()
for i in range(0,Result.shape[0]):
    Int_dict1=list(dict2.values())[i]
    for j in range(0,len(Int_dict1)):
        Int_dict2=list(dict2.values())[i][j][1]
        Result.loc[i,'Match'+str(j)+'_Ref_ID']=list(Int_dict2.values())[0]
        Result.loc[i,'Match'+str(j)+'_Similarity']=list(Int_dict2.values())[1]
        Result.loc[i,'Match'+str(j)+'_Alloc']=list(Int_dict2.values())[2]
        Result.loc[i,'Match'+str(j)+'_From']=list(Int_dict2.values())[3]
        Result.loc[i,'Match'+str(j)+'_To']=list(Int_dict2.values())[4]
Result

New Code:

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

matches1={}
for i in range(0,Input.shape[0]):
    to_list=[]
    to_list.append(Input.Text_Notes[i])
    model = PolyFuzz.load("TF-IDF")
    matches=model.transform(to_list)
    print(matches)
    matches1.update({Input['ID'][i]:matches.values()})
matches1

shaluchiipi · 2024-12-09T03:03:00Z

Is there any way to get top5 matches in transform?

MaartenGr · 2024-12-09T11:30:25Z

Sorry for the late reply, I have been sick for the last week. I just checked the code and I believe it should work if you change this:

model = PolyFuzz("TF-IDF")

to this:

tfidf = TFIDF(min_similarity=0, top_n=3)
model = PolyFuzz(tfidf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fit/Transform does not give top n matches #81

Fit/Transform does not give top n matches #81

shaluchiipi commented Nov 27, 2024 •

edited

Loading

MaartenGr commented Nov 27, 2024

shaluchiipi commented Dec 2, 2024 •

edited by MaartenGr

Loading

MaartenGr commented Dec 3, 2024

MaartenGr commented Dec 3, 2024

shaluchiipi commented Dec 3, 2024

shaluchiipi commented Dec 9, 2024

MaartenGr commented Dec 9, 2024

Fit/Transform does not give top n matches #81

Fit/Transform does not give top n matches #81

Comments

shaluchiipi commented Nov 27, 2024 • edited Loading

Sample Current/Input Data:

Sample Reference/Historical Data:

Sample Code:

Old Code using get_matches():

Passing reference historical text notes to "to_list"

New Changed Code for Production using Fit/Transform:

Fit the reference historical text notes

Frequency - monthly

Match the new text notes

Frequency - Daily

MaartenGr commented Nov 27, 2024

shaluchiipi commented Dec 2, 2024 • edited by MaartenGr Loading

Old Code:

New Code:

MaartenGr commented Dec 3, 2024

MaartenGr commented Dec 3, 2024

shaluchiipi commented Dec 3, 2024

Input and Ref Creation

Old Code:

New Code:

shaluchiipi commented Dec 9, 2024

MaartenGr commented Dec 9, 2024

shaluchiipi commented Nov 27, 2024 •

edited

Loading

shaluchiipi commented Dec 2, 2024 •

edited by MaartenGr

Loading