Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit/Transform does not give top n matches #81

Open
shaluchiipi opened this issue Nov 27, 2024 · 7 comments
Open

Fit/Transform does not give top n matches #81

shaluchiipi opened this issue Nov 27, 2024 · 7 comments

Comments

@shaluchiipi
Copy link

shaluchiipi commented Nov 27, 2024

I was using get_matches() to get top 5 matches. Now, since moving to production thought of using Fit/Predict but seems it returns only top first matches for each item. Is there any other way to get top 5 matches in Fit/Predict

I am matching current text notes (non-semantic long text) with historical ones. Historical data will be large in lakhs. So, to make code more efficient planning to pass historical text notes in fit and current text notes in transform. Planning to retrain it monthly.

Sample Current/Input Data:

<style> </style>
ID Alloc_No Text_Notes
2354657 78 RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657 35 TFHGDVASFHC4636587//5748UJKNM
345676 889 WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586 57 YWEIHFDSK//2435467//WEKSFDHLV
465768 3777 324TYVHBJN//435465//HUJNKHJKN

Sample Reference/Historical Data:

<style> </style>
ID Alloc_No Text_Notes
2354657 78 RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657 35 TFHGDVASFHC4636587//5748UJKNM
345676 889 WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586 57 YWEIHFDSK//2435467//WEKSFDHLV
465768 3777 324TYVHBJN//435465//HUJNKHJKN
2354657 78 RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657 35 TFHGDVASFHC4636587//5748UJKNM
345676 889 WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586 57 YWEIHFDSK//2435467//WEKSFDHLV
465768 3777 324TYVHBJN//435465//HUJNKHJKN

Sample Code:

Old Code using get_matches():

Passing reference historical text notes to "to_list"

to_list = Ref.Text_Notes.to_list()

for i in range(0,Input.shape[0]):

# Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it

from_list=[]
from_list.append(Input.Text_Notes[i])
#print(to_list)
model = PolyFuzz("TF-IDF").match(from_list, to_list)
matches=model.get_matches().sort_values(by='Similarity',ascending=False)
matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
list1=list(dict1.items())[:5]
dict2.update({Input['ID'][i]: list1})

dict2

New Changed Code for Production using Fit/Transform:

Fit the reference historical text notes

Frequency - monthly

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

Match the new text notes

Frequency - Daily

dict2 ={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
dict2

Now the issue is in transform i don't get similarity score for all reference rather only top 1 match whereas I need top 5

@MaartenGr
Copy link
Owner

Thank you for sharing your description! Could you perhaps create a minimum reproducible example for me to try? I see some unformatted code here and there in your issue but I'm not quite sure what to run and in what order. It can be as simple as fitting on just a couple of rows just to showcase the issue.

@shaluchiipi
Copy link
Author

shaluchiipi commented Dec 2, 2024

Github_Example_Input.xlsx
Github_Example_Ref.xlsx
#I ran this code and copied it here for your reference. Also have added the input and reference excel files.

import pandas as pd
from polyfuzz import PolyFuzz

Input=pd.read_excel('Github_Example_Input.xlsx')
Ref=pd.read_excel('Github_Example_Ref.xlsx')

Old Code:

to_list = Ref.Text_Notes.to_list()
dict2 ={}
for i in range(0,Input.shape[0]):
#Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
    from_list=[]
    from_list.append(Input.Text_Notes[i])
    #print(to_list)
    model = PolyFuzz("TF-IDF").match(from_list, to_list)
    matches=model.get_matches().sort_values(by='Similarity',ascending=False)
    matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
    dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
    list1=list(dict1.items())[:5]
    dict2.update({Input['ID'][i]: list1})
dict2

New Code:

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

matches1={}
for i in range(0,Input.shape[0]):
    to_list=[]
    to_list.append(Input.Text_Notes[i])
    model = PolyFuzz.load("TF-IDF")
    matches=model.transform(to_list)
    print(matches)
    matches1.update({Input['ID'][i]:matches.values()})
matches1

@MaartenGr
Copy link
Owner

@shaluchiipi Thank you for sharing the example, it is greatly appreciated!

I typically don't download and open files from sources I'm not familiar with. Could you perhaps showcase the example with just a couple of docs like so:

docs = ["my doc", "another doc", "etc."]

I'm assuming we don't need to have a large dataset for this, right?

@MaartenGr
Copy link
Owner

Also, note that I updated your message to include these ``` brackets. Without them, it's not clear to me what the indentation is, where the code starts, etc. Please use those in the feature when sharing code.

@shaluchiipi
Copy link
Author

Input and Ref Creation

import pandas as pd
from polyfuzz import PolyFuzz

Input={'ID': {0: 2354657, 1: 4354657, 2: 345676, 3: 34747586, 4: 465768},
 'Alloc_No': {0: 78, 1: 35, 2: 889, 3: 57, 4: 3777},
 'Text_Notes': {0: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
  1: 'TFHGDVASFHC4636587//5748UJKNM',
  2: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
  3: 'YWEIHFDSK//2435467//WEKSFDHLV',
  4: '324TYVHBJN//435465//HUJNKHJKN'}}

Input=pd.DataFrame(Input)

Ref={'ID': {0: 2354657,
  1: 4354657,
  2: 345676,
  3: 34747586,
  4: 465768,
  5: 2354657,
  6: 4354657,
  7: 345676,
  8: 34747586,
  9: 465768},
 'Alloc_No': {0: 78,
  1: 35,
  2: 889,
  3: 57,
  4: 3777,
  5: 78,
  6: 35,
  7: 889,
  8: 57,
  9: 3777},
 'Text_Notes': {0: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
  1: 'TFHGDVASFHC4636587//5748UJKNM',
  2: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
  3: 'YWEIHFDSK//2435467//WEKSFDHLV',
  4: '324TYVHBJN//435465//HUJNKHJKN',
  5: 'RHJ…..//32456hjfg//vkcmEGHJJJYMM',
  6: 'TFHGDVASFHC4636587//5748UJKNM',
  7: 'WUSERHIFKDJVN//23475//IUOSJDFGKV',
  8: 'YWEIHFDSK//2435467//WEKSFDHLV',
  9: '324TYVHBJN//435465//HUJNKHJKN'}}

Ref=pd.DataFrame(Ref)

Old Code:

from_list = Ref.Text_Notes.to_list()
dict2 ={}
for i in range(0,Input.shape[0]):
#Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
    to_list=[]
    to_list.append(Input.Text_Notes[i])
    #print(to_list)
    model = PolyFuzz("TF-IDF").match(from_list, to_list)
    matches=model.get_matches().sort_values(by='Similarity',ascending=False)
    matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
    dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
    list1=list(dict1.items())[:5]
    dict2.update({Input['ID'][i]: list1})
dict2

#Converting dictionary to Result dataframe with top 5 matches

Result=pd.DataFrame()
Result['Input_ID']=dict2.keys()
for i in range(0,Result.shape[0]):
    Int_dict1=list(dict2.values())[i]
    for j in range(0,len(Int_dict1)):
        Int_dict2=list(dict2.values())[i][j][1]
        Result.loc[i,'Match'+str(j)+'_Ref_ID']=list(Int_dict2.values())[0]
        Result.loc[i,'Match'+str(j)+'_Similarity']=list(Int_dict2.values())[1]
        Result.loc[i,'Match'+str(j)+'_Alloc']=list(Int_dict2.values())[2]
        Result.loc[i,'Match'+str(j)+'_From']=list(Int_dict2.values())[3]
        Result.loc[i,'Match'+str(j)+'_To']=list(Int_dict2.values())[4]
Result

New Code:

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

matches1={}
for i in range(0,Input.shape[0]):
    to_list=[]
    to_list.append(Input.Text_Notes[i])
    model = PolyFuzz.load("TF-IDF")
    matches=model.transform(to_list)
    print(matches)
    matches1.update({Input['ID'][i]:matches.values()})
matches1

@shaluchiipi
Copy link
Author

Is there any way to get top5 matches in transform?

@MaartenGr
Copy link
Owner

Sorry for the late reply, I have been sick for the last week. I just checked the code and I believe it should work if you change this:

model = PolyFuzz("TF-IDF")

to this:

tfidf = TFIDF(min_similarity=0, top_n=3)
model = PolyFuzz(tfidf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants