Blogs about Deep Learning, Machine Learning, AI, NLP, Security, Oracle Traffic Director,Oracle iPlanet WebServer

  • September 18, 2017

Finding similarity between text documents

I have tried using NLTK package in python to find similarity between two or more text documents.  One common use case is to check all the bug reports on a product to see if two bug reports are duplicates.

A document is characterised by a vector where the value of each dimension corresponds to the number of times that term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. For more details on cosine similarity refer this link.

So I downloaded a few bugs from https://bugzilla.mozilla.org/show_bug.cgi?id=bugid

First step is to import all the relevant packages. Open a file, read all lines and the words and tokenise them. Convert words into lower case.

Use Porter Stemmer to stem the words.  Stemming is the process of reducing inflected words into their word stem or root form. Like "runs", "running" get converted into it's root form "run".

Remove stop words like "a", "the".In natural language processing, useless words (data), are referred to as stop words. For more information on stop word removal refer this link.

Then count the occurrence of each word in the document.


then calculate the cosine similarity between 2 different bug reports.


Here is the output which shows that Bug#599831 and Bug#1055525 are more similar than the rest of the pairs.


Things to improve

  • This is just 1-Gram analysis not taking into account of group of words. For example "core" and "dump" are read as individual words not as a single phrase "core dump". For more information on N-grams refer this link.
  • Similar words with same meaning (like "core dump" and "crash") have not been taken into account.
  • All the documents (bugs) are downloaded as single text file. Ideally different weights should be given to bug subject and description.
  • There are other methods to find similarity of documents. Can try K means clustering or run linear regression on duplicate bugs
  • Only two document comparison is being done. all word lists could be calculated from all the documents (like all bug reports).
  • For now have considered only english documents

I have blogged this in my personal blogs as well.


Join the discussion

Comments ( 2 )
  • Sriram Bhamidipati Friday, September 21, 2018
    Nice Article. do you have a git repo to clone
  • ahmed mohammed Friday, December 14, 2018
    nice job
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.