I have tried using NLTK package in python to find similarity between two or more text documents. One common use case is to check all the bug reports on a product to see if two bug reports are duplicates.
A document is characterised by a vector where the value of each dimension corresponds to the number of times that term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. For more details on cosine similarity refer this link.
So I downloaded a few bugs from https://bugzilla.mozilla.org/show_bug.cgi?id=bugid
First step is to import all the relevant packages. Open a file, read all lines and the words and tokenise them. Convert words into lower case.
Use Porter Stemmer to stem the words. Stemming is the process of reducing inflected words into their word stem or root form. Like "runs", "running" get converted into it's root form "run".
Remove stop words like "a", "the".In natural language processing, useless words (data), are referred to as stop words. For more information on stop word removal refer this link.
Then count the occurrence of each word in the document.
then calculate the cosine similarity between 2 different bug reports.
Here is the output which shows that Bug#599831 and Bug#1055525 are more similar than the rest of the pairs.