-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fit/Transform does not give top n matches #81
Comments
Thank you for sharing your description! Could you perhaps create a minimum reproducible example for me to try? I see some unformatted code here and there in your issue but I'm not quite sure what to run and in what order. It can be as simple as fitting on just a couple of rows just to showcase the issue. |
Github_Example_Input.xlsx import pandas as pd
from polyfuzz import PolyFuzz
Input=pd.read_excel('Github_Example_Input.xlsx')
Ref=pd.read_excel('Github_Example_Ref.xlsx') Old Code:to_list = Ref.Text_Notes.to_list()
dict2 ={}
for i in range(0,Input.shape[0]):
#Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
from_list=[]
from_list.append(Input.Text_Notes[i])
#print(to_list)
model = PolyFuzz("TF-IDF").match(from_list, to_list)
matches=model.get_matches().sort_values(by='Similarity',ascending=False)
matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
list1=list(dict1.items())[:5]
dict2.update({Input['ID'][i]: list1})
dict2 New Code:from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")
matches1={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
matches1.update({Input['ID'][i]:matches.values()})
matches1 |
@shaluchiipi Thank you for sharing the example, it is greatly appreciated! I typically don't download and open files from sources I'm not familiar with. Could you perhaps showcase the example with just a couple of docs like so: docs = ["my doc", "another doc", "etc."] I'm assuming we don't need to have a large dataset for this, right? |
Also, note that I updated your message to include these ``` brackets. Without them, it's not clear to me what the indentation is, where the code starts, etc. Please use those in the feature when sharing code. |
Input and Ref Creationimport pandas as pd
from polyfuzz import PolyFuzz
Input={'ID': {0: 2354657, 1: 4354657, 2: 345676, 3: 34747586, 4: 465768},
'Alloc_No': {0: 78, 1: 35, 2: 889, 3: 57, 4: 3777},
'Text_Notes': {0: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
1: 'TFHGDVASFHC4636587//5748UJKNM',
2: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
3: 'YWEIHFDSK//2435467//WEKSFDHLV',
4: '324TYVHBJN//435465//HUJNKHJKN'}}
Input=pd.DataFrame(Input)
Ref={'ID': {0: 2354657,
1: 4354657,
2: 345676,
3: 34747586,
4: 465768,
5: 2354657,
6: 4354657,
7: 345676,
8: 34747586,
9: 465768},
'Alloc_No': {0: 78,
1: 35,
2: 889,
3: 57,
4: 3777,
5: 78,
6: 35,
7: 889,
8: 57,
9: 3777},
'Text_Notes': {0: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
1: 'TFHGDVASFHC4636587//5748UJKNM',
2: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
3: 'YWEIHFDSK//2435467//WEKSFDHLV',
4: '324TYVHBJN//435465//HUJNKHJKN',
5: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
6: 'TFHGDVASFHC4636587//5748UJKNM',
7: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
8: 'YWEIHFDSK//2435467//WEKSFDHLV',
9: '324TYVHBJN//435465//HUJNKHJKN'}}
Ref=pd.DataFrame(Ref) Old Code:from_list = Ref.Text_Notes.to_list()
dict2 ={}
for i in range(0,Input.shape[0]):
#Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
to_list=[]
to_list.append(Input.Text_Notes[i])
#print(to_list)
model = PolyFuzz("TF-IDF").match(from_list, to_list)
matches=model.get_matches().sort_values(by='Similarity',ascending=False)
matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
list1=list(dict1.items())[:5]
dict2.update({Input['ID'][i]: list1})
dict2
#Converting dictionary to Result dataframe with top 5 matches
Result=pd.DataFrame()
Result['Input_ID']=dict2.keys()
for i in range(0,Result.shape[0]):
Int_dict1=list(dict2.values())[i]
for j in range(0,len(Int_dict1)):
Int_dict2=list(dict2.values())[i][j][1]
Result.loc[i,'Match'+str(j)+'_Ref_ID']=list(Int_dict2.values())[0]
Result.loc[i,'Match'+str(j)+'_Similarity']=list(Int_dict2.values())[1]
Result.loc[i,'Match'+str(j)+'_Alloc']=list(Int_dict2.values())[2]
Result.loc[i,'Match'+str(j)+'_From']=list(Int_dict2.values())[3]
Result.loc[i,'Match'+str(j)+'_To']=list(Int_dict2.values())[4]
Result New Code:from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")
matches1={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
matches1.update({Input['ID'][i]:matches.values()})
matches1 |
Is there any way to get top5 matches in transform? |
Sorry for the late reply, I have been sick for the last week. I just checked the code and I believe it should work if you change this: model = PolyFuzz("TF-IDF") to this: tfidf = TFIDF(min_similarity=0, top_n=3)
model = PolyFuzz(tfidf) |
I was using get_matches() to get top 5 matches. Now, since moving to production thought of using Fit/Predict but seems it returns only top first matches for each item. Is there any other way to get top 5 matches in Fit/Predict
I am matching current text notes (non-semantic long text) with historical ones. Historical data will be large in lakhs. So, to make code more efficient planning to pass historical text notes in fit and current text notes in transform. Planning to retrain it monthly.
Sample Current/Input Data:
<style> </style>Sample Reference/Historical Data:
<style> </style>Sample Code:
Old Code using get_matches():
Passing reference historical text notes to "to_list"
to_list = Ref.Text_Notes.to_list()
for i in range(0,Input.shape[0]):
dict2
New Changed Code for Production using Fit/Transform:
Fit the reference historical text notes
Frequency - monthly
from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")
Match the new text notes
Frequency - Daily
dict2 ={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
dict2
Now the issue is in transform i don't get similarity score for all reference rather only top 1 match whereas I need top 5
The text was updated successfully, but these errors were encountered: