I'm working on a little task that compares the similarity of text documents. One of the most common methods of doing this is called the Vector Space Model. In short, you map words from the documents you want to compare onto a vector that is based on the words found in all documents. Then, you find the cosine of the angle between the vectors of the documents that you want to compare. This is called the cosine measure. When the cosine measure is 0, the documents have no similarity. A value of 1 is yielded when the documents are equal.
I found an example implementation of a basic document search engine by Maciej Ceglowski, written in Perl, here. I thought I'd find the equivalent libraries in Python and code me up an implementation.
- Parse and stem the documents.
It is important, when comparing words, to compare the word stems. e.g., cat and cats should always be compared as simply cat. There are a few word stemming algorithms already available. I found an implementation of the Porter Stemming algorithm in Python here. If you want to run the attached file, you'll need to download porter.py.
I filter the documents through a regular expression to pick out everything composed of a-z, a "-" or a single quote. I also convert the words to lower case. All the words in all the documents are added to a dictionary that keeps track of the word and the number of times it has been used. Before adding a word to the dictionary, I check a list of stop words. Words like "I", "am", "you", "and" make documents appear to be more related than they really are. I found that a good list of stop words comes with the Postgresql tsearch2 full text indexing module. Maciej pointed out in the Perl implementation that it was important to check the stop words before stemming the word.
splitter=re.compile ( "[a-z\-']+", re.I )
stop_words=['i','am','the','you'] # replace with real stop words
w=word.lower() # or you could pass in lower case words to begin with
if w not in stop_words:
all_words[ws] += 1
- Reorganize the master word list
There is probably a better way to do this. Perhaps an object that keeps the keys sorted to begin with or something. As far as I know though, Python doesn't have a native dictionary with sorted keys, so I simply create a new dictionary that contains a tuple with the index of the key and the count obtained previously.
key_idx=dict() # key-> ( position, count )
for i in range(len(keys)):
key_idx[keys[i]] = (i,all_words[keys[i]])
del keys # not necessary, but I didn't need these any longer
- Map each document onto it's own vector
This is why you need the ordered dictionary. Each word from each document maps onto a vector that represents all the words. If you have a list of all words "apple", "cat", "dog", and you have a document with the word "cat", the resulting vector for the document would be: [0, 1, 0].
The arrays for all your documents might be really big. Fortunately, NumPy offers a way to represent sparce array data. You can create a zeroed out vector and then set the values for words individually. For this example, I just use 1 if the word is included in the document. You could instead, use the frequency of words to set values less than 1 and greater than zero for more complex query requirements (like comparing documents against a search query)
from numpy import zeros
v=zeros(len(key_idx)) # returns array([0,0,0....len(key_idx)])
for word in splitter.findall(doc):
# returns (key index, key count) or None
if keydata: v[keydata] = 1
- Use NumPy to complete the cosine measure calculation
The Cosine measure is calculated by taking the dot product of the two vectors, and then dividing by the product of the norms of the vectors
cos(A,B) = dot(A,B) / ( || A || * || B || )
To do vector math, you could implement your own routine. There is already a good linear algebra implementation for Python. Just download NumPy from www.scipy.org.
from numpy import dot
from numpy.linalg import norm
print "Similarity: %s" % float(dot(v1,v2) / (norm(v1) * norm(v2)))
I found a handly little online implementation of the cosine measure here, that helped to verify this was working correctly.
That's it. The attached Python Cosine Measure Implementation has a compare function that takes two documents and returns the similarity value.
s=ds2.compare("I like dogs and cats", "My cat runs from dogs.")