You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that to calculate the normalized weights for the embeddings we divide by the Laplace smoothed cluster sizes as seen in the code here.
However, for the embeddings whose cluster sizes are zero, the Laplace smoothing replaces it with a very small value (some function of epsilon). When these updated cluster sizes are used to normalize (by dividing the running ema_dw with the updated cluster size) and update the embeddings, the corresponding embeddings with zero cluster sizes are updated to a very high value. These updated embeddings then have an ever lower probability of being chosen in the future.
Is my understanding of this issue correct or am I missing something? If this is indeed correct is there a way to mitigate this to have a higher perplexity score?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi,
I understand that to calculate the normalized weights for the embeddings we divide by the Laplace smoothed cluster sizes as seen in the code here.
However, for the embeddings whose cluster sizes are zero, the Laplace smoothing replaces it with a very small value (some function of epsilon). When these updated cluster sizes are used to normalize (by dividing the running ema_dw with the updated cluster size) and update the embeddings, the corresponding embeddings with zero cluster sizes are updated to a very high value. These updated embeddings then have an ever lower probability of being chosen in the future.
Is my understanding of this issue correct or am I missing something? If this is indeed correct is there a way to mitigate this to have a higher perplexity score?
Thanks!
The text was updated successfully, but these errors were encountered: