Tuesday, December 30, 2008

How Similar Are Two Text Files?

Suppose you are doing a job hunt (like I'm doing right now) and after applying for quite a few of jobs you see you are loosing track of jobs you have already applied for. The company name is often not shown, since most recruiting/staffing agencies don't reveal it in their job postings. The job titles are often very similar. After inquiring 2 or 3 times a recruiting company about a job you have already been submitted for by another staffing agency (and after you have already wasted your time on duplicate application process), you just wish you'd bee able to quickly detect that the job description you are looking at is very similar to one you have applied to 3 weeks earlier. What do you do? Using the diff isn't very useful, since the job descriptions are often slightly modified by the staffing agency before posting. You want to search for a similar instead of identical text.

I have found that there is research on this topic. The best short-and-sweet summary, I have found is on Y! answers. From the tools mentioned there, I chose the SIM tool by Dick Grune. DOS binary is available there, but the trick was to select command parms that will fit best comparing two html files. I have found that the following combination gives the most to the point answer: sim_text.exe -nT -r 100 job1.html job2.html. It will show only relatively large common sequences (over 100 chars), and if it does show any of those, you better check that the two files don't correspond to the same job opening.

I have also tried stripping the html tags from these html files using Lynx browser with the -dump option. DOS binaries for Lynx, after some digging, I was able to find here. I had also to create the following lynx.bat:

@ECHO OFF
set home=c:\bin\lynx\temp
set temp=c:\bin\lynx\temp
set lynx_cfg=c:\bin\lynx\lynx.cfg
set lynx_save_space=c:\bin\lynx\temp
c:\bin\lynx\lynx.exe %1 %2 %3 %4 %5

The results were not convincing. Actually, for some reason, the file similarity was less obvious when using the stripped files than when I used the original html files. Not sure why. I haven't analyzed this issue.

No comments: