This is a command-line program that performs plagiarism detection using a N- tuple comparison algorithm allowing for synonyms in the text. The program takes in 3 required arguments, and one optional. In other cases such as no arguments, the program prints out usage instructions.
- file name for a list of synonyms
- input file 1
- input file 2
- (optional) the number N, the tuple size. If not supplied, the default should be N=3.
The synonym file has lines each containing one group of synonyms. For example a line saying "run sprint jog" means these words should be treated as equal.
The input files are declared plagiarized based on the number of N-tuples in file1 that appear in file2, where the tuples are compared by accounting for synonyms as described above. For example, the text "go for a run" has two 3-tuples, ["go for a", "for a run"] both of which appear in the text "go for a jog".
The output of the program is the percent of tuples in file1 which appear in file2. So for the above example, the output would be one line saying "100%". In another example, for texts "go for a run" and "went for a jog" and N=3 we would output "50%" because only one 3-tuple in the first text appears in the second one.