Similarity of texts: The Vector Space Model with Python

I’m working on a little task that compares the similarity of text documents. One of the most common methods of doing this is called the Vector Space Model. In short, you map words from the documents you want to compare onto a vector that is based on the words found in all documents. Then, you find the cosine of the angle between the vectors of the documents that you want to compare. This is called the cosine measure. When the cosine measure is 0, the documents have no similarity. A value of 1 is yielded when the documents are equal.

I found an example implementation of a basic document search engine by Maciej Ceglowski, written in Perl, here. I thought I’d find the equivalent libraries in Python and code me up an implementation.

Continue reading “Similarity of texts: The Vector Space Model with Python”