I’m working on a little task that compares the similarity of text documents. One of the most common methods of doing this is called the Vector Space Model. In short, you map words from the documents you want to compare onto a vector that is based on the words found in all documents. Then, you find the cosine of the angle between the vectors of the documents that you want to compare. This is called the cosine measure. When the cosine measure is 0, the documents have no similarity. A value of 1 is yielded when the documents are equal.
I found an example implementation of a basic document search engine by Maciej Ceglowski, written in Perl, here. I thought I’d find the equivalent libraries in Python and code me up an implementation.
- Parse and stem the documents.
It is important, when comparing words, to compare the word stems. e.g., cat and cats should always be compared as simply cat. There are a few word stemming algorithms already available. I found an implementation of the Porter Stemming algorithm in Python here. If you want to run the attached file, you’ll need to download porter.py.
I filter the documents through a regular expression to pick out everything composed of a-z, a “-” or a single quote. I also convert the words to lower case. All the words in all the documents are added to a dictionary that keeps track of the word and the number of times it has been used. Before adding a word to the dictionary, I check a list of stop words. Words like “I”, “am”, “you”, “and” make documents appear to be more related than they really are. I found that a good list of stop words comes with the Postgresql tsearch2 full text indexing module. Maciej pointed out in the Perl implementation that it was important to check the stop words before stemming the word.
splitter=re.compile ( "[a-z\-']+", re.I )
stop_words=['i','am','the','you'] # replace with real stop words
w=word.lower() # or you could pass in lower case words to begin with
if w not in stop_words:
all_words[ws] += 1
- Reorganize the master word list
There is probably a better way to do this. Perhaps an object that keeps the keys sorted to begin with or something. As far as I know though, Python doesn’t have a native dictionary with sorted keys, so I simply create a new dictionary that contains a tuple with the index of the key and the count obtained previously.
key_idx=dict() # key-> ( position, count )
for i in range(len(keys)):
key_idx[keys[i]] = (i,all_words[keys[i]])
del keys # not necessary, but I didn't need these any longer
- Map each document onto it’s own vector
This is why you need the ordered dictionary. Each word from each document maps onto a vector that represents all the words. If you have a list of all words “apple”, “cat”, “dog”, and you have a document with the word “cat”, the resulting vector for the document would be: [0, 1, 0].
The arrays for all your documents might be really big. Fortunately, NumPy offers a way to represent sparce array data. You can create a zeroed out vector and then set the values for words individually. For this example, I just use 1 if the word is included in the document. You could instead, use the frequency of words to set values less than 1 and greater than zero for more complex query requirements (like comparing documents against a search query)
from numpy import zeros
v=zeros(len(key_idx)) # returns array([0,0,0....len(key_idx)])
for word in splitter.findall(doc):
# returns (key index, key count) or None
if keydata: v[keydata] = 1
- Use NumPy to complete the cosine measure calculation
The Cosine measure is calculated by taking the dot product of the two vectors, and then dividing by the product of the norms of the vectors
cos(A,B) = dot(A,B) / ( || A || * || B || )
To do vector math, you could implement your own routine. There is already a good linear algebra implementation for Python. Just download NumPy from www.scipy.org.
from numpy import dot
from numpy.linalg import norm
print "Similarity: %s" % float(dot(v1,v2) / (norm(v1) * norm(v2)))
I found a handly little online implementation of the cosine measure here, that helped to verify this was working correctly.
That’s it. The attached Python Cosine Measure Implementation has a compare function that takes two documents and returns the similarity value.
s=ds2.compare("I like dogs and cats", "My cat runs from dogs.")
Great post, seems very simular to diff.
The Vector Space Model and diff actually don’t have very much in common. diff outputs the differences between texts based on each line of text. Anything not output by diff is the same in the texts. This program does not output the similarities in the texts. Instead, it outputs a number that represents how similar the texts are. If the output is 0, there is no similarity. An output of 1 means they are pretty much the same documents. With the Vector Space Model, you could have an output of 1 even if the words were in a completely different order.
Thanks for linking to my calculator. I am moving the cosine similarity calculator over to http://www.appliedsoftwaredesign.com very soon. Sorry for the inconvenience.
Thanks Tyler. I’ve updated the link.
This is a neat tutorial, but out of curiosity, why do you need to set the vectors to the same length? The cosine measurement normalizes vectors, IIRC (else why would we divide by the product of the magnitudes of v1 and v2?). In fact, cosine similarity is popular precisely because it can make sense of vectors of various magnitudes.
The vector represents a map of words included in the text as compared to words included in the domain of all words included in all texts. I suppose you’d get the same result if the vectors were only long enough to include the words in each text, but you’d have to do extra programming to figure out which word was at the maximum index length and then create the vector that size. It seems much easier to just create a vector the maximum size. IIRC, there isn’t any overhead with creating a larger vector with the NumPy library.
Thanks again for linking to my online cosine similarity calculator. I have moved it (hopefully for the last time) over to:
(Feel free to delete this comment. Just wanted to give ya a heads up).
Updated, thanks for letting me know.