The Text Similarity API computes surface similarity between two pieces of text (long or short) using well known measures namely Jaccard, Dice and Cosine. Determining similarity between texts is crucial to many applications such as clustering, duplicate removal, merging similar topics or themes, text retrieval and etc. Let's say we have the following two product listings on eBay:
"text1": "iphone 4s black new", "text2": "iphone 4s black old"
How can you tell that these two listings are almost the same? You can use text similarity measures for this. The results from the Text Similarity API shows how close these two texts are using different measures:
{ "cosine": "0.750", "jaccard": "0.600", "dice": "0.750", "average":"0.700" }
In text mining applications, you can heuristically set a similarity threshold. Meaning, if the similarity score between two pieces of text is greater than a value, say 0.5, then you can consider these two units as being similar. Threshold levels are dependent on the application need. Here are some recommendations:
- For strict similarity, use a threshold of 0.5 and above
- For a more liberal similarity, use a score lesser than 0.5
- In some cases, you can avoid thresholds by ranking texts by similarity scores and using only the top N most similar texts.
To use this api, you would essentially have to set 3 parameters:
- text1: your first unit of text or text tokens
- text2: your second unit of text or text tokens
- clean: perform cleaning on your text before similarity computation?
You can have fairly lengthy units of texts (e.g. two plain text documents) but the maximum payload size is 1MB per request. The text that you provide can be plain words, words with Part of Speech Annotations (POS) (e.g.the/dt cow/nn jumps/vb) or combined tokens such as n-grams (e.g. this_cat cat_is is_cute).
Before you start, please ensure that you have a valid API key.
The TextSimilarity endpoint accepts a JSON request via POST. It takes in 3 parameters:
<th class="rtecenter" scope="col">
<strong>Type</strong>
</th>
<th class="rtecenter" scope="col">
<strong>Required?</strong>
</th>
<th class="rtecenter" scope="col">
Description
</th>
</tr>
<td class="rtecenter">
text
</td>
<td class="rtecenter">
Yes
</td>
<td class="rtecenter">
first text
</td>
</tr>
<tr>
<td>
text2
</td>
<td class="rtecenter">
text
</td>
<td class="rtecenter">
Yes
</td>
<td class="rtecenter">
second text
</td>
</tr>
<tr>
<td>
clean
</td>
<td class="rtecenter">
text
</td>
<td class="rtecenter">
No (Default=true)
</td>
<td class="rtecenter">
lowercase, remove punctuation and numbers?
</td>
</tr>
Parameter name |
---|
text1 |
Points to note:
- There is no maximum length for the text, but a 1MB maximum payload per request.
- The text can be in any language - The text that you provide can be:
- plain text, (e.g. the cow jumps over the moon)
- text with POS annotations (e.g. the/dt cow/nn jumps/vb)
- manipulated texts such as n-grams (e.g. thiscat catis iscute).
- Since this is a json request, your text has to be properly escaped and encoded in UTF-8
Requests can be sent in any language as long as it is formatted according to the expected JSON format. There is a library called the unirest library that handles http request and response in several languages including Java, Python, Ruby, Node.js, PHP and more. Here is an example, using the Java Unirest library:
// These code snippets use an open-source library. http://unirest.io/java
HttpResponse response = Unirest.post("https://rxnlp-core.p.mashape.com/<strong>computeSimilarity</strong>")
.header("X-Mashape-Key", "<your_api_key>")
.header("Content-Type", "application/json")
.header("Accept", "application/json")
.body("{'text1':'this is test 1','text2':'this is test 2!', 'clean':'false'}") .asJson();
- 'text1' and 'text2' are the two texts that you want to compute similarity over and are both mandatory.
- 'clean' indicates if you want your text to be cleaned up prior to computing text similarity and this is optional
- Content type with application/json is mandatory to indicate the type of request being sent
- X-Mashape-Key is mandatory and it is the key that allows you access to the API Here is a simple wrapper for the text similarity API in Java using HttpURLConnection
Text Similarity returns a JSON response. It returns the Cosine, Jaccard and Dice similarity scores along with the average based on these 3 scores. Here is an example request and response output:
Request:
{ "text1":"this is test 2", "text2":"this is test 2!", "clean":"true" }
Response:
{ "cosine": "1.000", "jaccard": "1.000", "dice": "1.000", "average": "1.000" }
Request:
{ "text1":"this is test 2", "text2":"this is test 2!", "clean":"false" }
Response:
{
cosine :0.750 ,
jaccard: 0.600,
dice: 0.750,
average:0.700
}
Since you have access to different similarity measures, you can choose to use one of these measures at all times or all of it at once. You can also use the average scores.
If you have very short texts and want a strict measure that ensures only phrases that are very similar get high scores, then Jaccard would be ideal. However, if your text is more than 5 words long, Cosine or Dice may be more appropriate since these measures tend not to over-penalize non-overlapping terms. You can also average all three scores. In either case, please do some experimentation before you decide which measure(s) to use.
There are several ways to improve similarity (meaning finding more overlaps). Here are some ideas to improve reliability in the similarity measures:
- Stem the text units before computing similarity
- Remove determiners (e.g. the, an, a) [see list]
- Remove stop words [full stop word list] [minimal stop list] [stop words in other languages]
Text Similarity is language-neutral and would thus work for all languages.