Comments on: Similarity of texts: The Vector Space Model with Python

By: Dennis

Dennis — Sun, 15 Apr 2012 19:51:07 +0000

Updated, thanks for letting me know.

By: Tyler

Tyler — Fri, 13 Apr 2012 06:13:09 +0000

Hey Dennis,

Thanks again for linking to my online cosine similarity calculator. I have moved it (hopefully for the last time) over to:

http://www.appliedsoftwaredesign.com/archives/cosine-similarity-calculator

(Feel free to delete this comment. Just wanted to give ya a heads up).

Thanks again!

By: Dennis

Dennis — Mon, 12 May 2008 19:56:13 +0000

The vector represents a map of words included in the text as compared to words included in the domain of all words included in all texts. I suppose you’d get the same result if the vectors were only long enough to include the words in each text, but you’d have to do extra programming to figure out which word was at the maximum index length and then create the vector that size. It seems much easier to just create a vector the maximum size. IIRC, there isn’t any overhead with creating a larger vector with the NumPy library.

By: Jeremy

Jeremy — Mon, 12 May 2008 01:26:01 +0000

This is a neat tutorial, but out of curiosity, why do you need to set the vectors to the same length? The cosine measurement normalizes vectors, IIRC (else why would we divide by the product of the magnitudes of v1 and v2?). In fact, cosine similarity is popular precisely because it can make sense of vectors of various magnitudes.

By: Dennis

Dennis — Wed, 12 Dec 2007 13:05:39 +0000

Thanks Tyler. I’ve updated the link.

By: Tyler

Tyler — Wed, 12 Dec 2007 12:02:24 +0000

Thanks for linking to my calculator. I am moving the cosine similarity calculator over to http://www.appliedsoftwaredesign.com very soon. Sorry for the inconvenience.

By: Dennis

Dennis — Mon, 22 Oct 2007 03:46:18 +0000

The Vector Space Model and diff actually don’t have very much in common. diff outputs the differences between texts based on each line of text. Anything not output by diff is the same in the texts. This program does not output the similarities in the texts. Instead, it outputs a number that represents how similar the texts are. If the output is 0, there is no similarity. An output of 1 means they are pretty much the same documents. With the Vector Space Model, you could have an output of 1 even if the words were in a completely different order.

By: Tmpvar

Tmpvar — Mon, 22 Oct 2007 02:17:27 +0000

Great post, seems very simular to diff.

Cheers