I have found that there is research on this topic. The best short-and-sweet summary, I have found is on Y! answers. From the tools mentioned there, I chose the SIM tool by Dick Grune. DOS binary is available there, but the trick was to select command parms that will fit best comparing two html files. I have found that the following combination gives the most to the point answer: sim_text.exe -nT -r 100 job1.html job2.html. It will show only relatively large common sequences (over 100 chars), and if it does show any of those, you better check that the two files don't correspond to the same job opening.
I have also tried stripping the html tags from these html files using Lynx browser with the -dump option. DOS binaries for Lynx, after some digging, I was able to find here. I had also to create the following lynx.bat:
@ECHO OFF
set home=c:\bin\lynx\temp
set temp=c:\bin\lynx\temp
set lynx_cfg=c:\bin\lynx\lynx.cfg
set lynx_save_space=c:\bin\lynx\temp
c:\bin\lynx\lynx.exe %1 %2 %3 %4 %5
The results were not convincing. Actually, for some reason, the file similarity was less obvious when using the stripped files than when I used the original html files. Not sure why. I haven't analyzed this issue.
No comments:
Post a Comment