<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Similarity of texts: The Vector Space Model with Python</title>
	<atom:link href="http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/feed/" rel="self" type="application/rss+xml" />
	<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/</link>
	<description>Where stuff from my brain lands</description>
	<pubDate>Wed, 20 Aug 2008 03:36:15 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
		<item>
		<title>By: Dennis</title>
		<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-509</link>
		<dc:creator>Dennis</dc:creator>
		<pubDate>Mon, 12 May 2008 19:56:13 +0000</pubDate>
		<guid isPermaLink="false">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-509</guid>
		<description>The vector represents a map of words included in the text as compared to words included in the domain of all words included in all texts.  I suppose you'd get the same result if the vectors were only long enough to include the words in each text, but you'd have to do extra programming to figure out which word was at the maximum index length and then create the vector that size.  It seems much easier to just create a vector the maximum size.  IIRC, there isn't any overhead with creating a larger vector with the NumPy library.</description>
		<content:encoded><![CDATA[<p>The vector represents a map of words included in the text as compared to words included in the domain of all words included in all texts.  I suppose you&#8217;d get the same result if the vectors were only long enough to include the words in each text, but you&#8217;d have to do extra programming to figure out which word was at the maximum index length and then create the vector that size.  It seems much easier to just create a vector the maximum size.  IIRC, there isn&#8217;t any overhead with creating a larger vector with the NumPy library.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeremy</title>
		<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-508</link>
		<dc:creator>Jeremy</dc:creator>
		<pubDate>Mon, 12 May 2008 01:26:01 +0000</pubDate>
		<guid isPermaLink="false">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-508</guid>
		<description>This is a neat tutorial, but out of curiosity, why do you need to set the vectors to the same length? The cosine measurement normalizes vectors, IIRC (else why would we divide by the product of the magnitudes of v1 and v2?). In fact, cosine similarity is popular precisely because it can make sense of vectors of various magnitudes.</description>
		<content:encoded><![CDATA[<p>This is a neat tutorial, but out of curiosity, why do you need to set the vectors to the same length? The cosine measurement normalizes vectors, IIRC (else why would we divide by the product of the magnitudes of v1 and v2?). In fact, cosine similarity is popular precisely because it can make sense of vectors of various magnitudes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dennis</title>
		<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-172</link>
		<dc:creator>Dennis</dc:creator>
		<pubDate>Wed, 12 Dec 2007 13:05:39 +0000</pubDate>
		<guid isPermaLink="false">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-172</guid>
		<description>Thanks Tyler.  I've updated the link.</description>
		<content:encoded><![CDATA[<p>Thanks Tyler.  I&#8217;ve updated the link.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tyler</title>
		<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-170</link>
		<dc:creator>Tyler</dc:creator>
		<pubDate>Wed, 12 Dec 2007 12:02:24 +0000</pubDate>
		<guid isPermaLink="false">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-170</guid>
		<description>Thanks for linking to my calculator.  I am moving the cosine similarity calculator over to www.appliedsoftwaredesign.com very soon.  Sorry for the inconvenience.</description>
		<content:encoded><![CDATA[<p>Thanks for linking to my calculator.  I am moving the cosine similarity calculator over to <a href="http://www.appliedsoftwaredesign.com" rel="nofollow">http://www.appliedsoftwaredesign.com</a> very soon.  Sorry for the inconvenience.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dennis</title>
		<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-27</link>
		<dc:creator>Dennis</dc:creator>
		<pubDate>Mon, 22 Oct 2007 03:46:18 +0000</pubDate>
		<guid isPermaLink="false">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-27</guid>
		<description>The Vector Space Model and diff actually don't have very much in common.  diff outputs the differences between texts based on each line of text.  Anything not output by diff is the same in the texts.  This program does not output the similarities in the texts.  Instead, it outputs a number that represents how similar the texts are.  If the output is 0, there is no similarity. An output of 1 means they are pretty much the same documents.  With the Vector Space Model, you could have an output of 1 even if the words were in a completely different order.</description>
		<content:encoded><![CDATA[<p>The Vector Space Model and diff actually don&#8217;t have very much in common.  diff outputs the differences between texts based on each line of text.  Anything not output by diff is the same in the texts.  This program does not output the similarities in the texts.  Instead, it outputs a number that represents how similar the texts are.  If the output is 0, there is no similarity. An output of 1 means they are pretty much the same documents.  With the Vector Space Model, you could have an output of 1 even if the words were in a completely different order.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tmpvar</title>
		<link>http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-26</link>
		<dc:creator>Tmpvar</dc:creator>
		<pubDate>Mon, 22 Oct 2007 02:17:27 +0000</pubDate>
		<guid isPermaLink="false">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python/#comment-26</guid>
		<description>Great post, seems very simular to diff.

Cheers</description>
		<content:encoded><![CDATA[<p>Great post, seems very simular to diff.</p>
<p>Cheers</p>
]]></content:encoded>
	</item>
</channel>
</rss>
