I’ve been working on my thesis for a little too long. I’m hopefully about finished, but that is beside the point. Part of what I’ve been working on revolves around the Apriori Data Mining algorithm. If you know what Apriori is, and you are looking for how to implement it, then this post is for you.
I’m working on a little task that compares the similarity of text documents. One of the most common methods of doing this is called the Vector Space Model. In short, you map words from the documents you want to compare onto a vector that is based on the words found in all documents. Then, you find the cosine of the angle between the vectors of the documents that you want to compare. This is called the cosine measure. When the cosine measure is 0, the documents have no similarity. A value of 1 is yielded when the documents are equal.
I found an example implementation of a basic document search engine by Maciej Ceglowski, written in Perl, here. I thought I’d find the equivalent libraries in Python and code me up an implementation.
Continue reading “Similarity of texts: The Vector Space Model with Python”