CJuicer beats humans (and Unix's diff)

Maybe you remember my previous post about detection of copied assignments? Well, now I can say it succeeded. CJuicer is a flex script, generating a lexycal analyzer with a rudimentary parser of C code, it outputs a PostScript with the "logical tree" of loops, function calls and conditionals. Same trees, copied assignment (unless it is a very simple code... then almost everyone writes the same), without problems with changing names of variables. Thus it could beat diff.

After revising a lot of images (well, 34) I detected 5 copies, one literal (i.e. diff-detectable) but 2 were harder, as they had changed variable names, in an otherwise identical code.

Click to enlarge

Graphs looked equal, and after examining the code, it was clear.

Student A
int i,j;
double *xt,*yt, *hpaso, *s, *c;

Student B
int i,j;
double *xt,*yt, *hmi, *s, *c;

One was using a taylored function to allocate memory (something I always frown upon), and the other one was using the usual malloc/calloc. Besides this, different comments (in a quite broad sense... they looked very similar) and slightly different variable names kill diff, but they don't fool the graphical analyzer.

For easier inspection I modified the Juicer code to output a string of numbers together with the ps. It is easier this way, more like what I was doing when comparing the trees: instead of looking for 3 bumps at the start, I can look for 343434. Easier.

Student A: 7 7 7 7 7 7 7 7 7 7 57 67 7 11227 7 7 7 7 7 7 7 7 7 7 7 157 7 6257 657 61212567 7 7 7 7
Student B: 7 7 7 7 7 7 7 7 7 7 57 67 7 11227 7 7 7 7 7 7 7 7 7 7 7 157 7 6257 657 61212567 7 7 7 7

If you would like to use it, or have a look, check my Google code page, you can find the source in the CJuicer link. Keep in mind, if you use it to check for copies, be sure to use exactly the same revision: different revisions may give different results. I'm still solving some problems and bugs. Keep in touch!

You may also like
Written by Ruben Berenguel