<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Never Try to Parse Excel Spreadsheets</title>
	<atom:link href="http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/feed/" rel="self" type="application/rss+xml" />
	<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/</link>
	<description>learning Chinese, teaching English, trying to understand more</description>
	<lastBuildDate>Fri, 03 Sep 2010 16:22:23 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>By: Daniel Lynes</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4918</link>
		<dc:creator>Daniel Lynes</dc:creator>
		<pubDate>Thu, 11 Dec 2008 14:54:01 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4918</guid>
		<description>If you want to parse Excel spreadsheets with ease, take a look at the &lt;a href=&quot;http://search.cpan.org/~jmcnamara/Spreadsheet-ParseExcel/lib/Spreadsheet/ParseExcel.pm&quot; rel=&quot;nofollow&quot;&gt;Spreadsheet::ParseExcel&lt;/a&gt; class from CPAN for Perl.

Makes slicing and dicing Excel spreadsheets a breeze (without COM).  It works on Windows, Solaris, Linux and Mac OSX.</description>
		<content:encoded><![CDATA[<p>If you want to parse Excel spreadsheets with ease, take a look at the <a href="http://search.cpan.org/~jmcnamara/Spreadsheet-ParseExcel/lib/Spreadsheet/ParseExcel.pm" rel="nofollow">Spreadsheet::ParseExcel</a> class from CPAN for Perl.</p>
<p>Makes slicing and dicing Excel spreadsheets a breeze (without COM).  It works on Windows, Solaris, Linux and Mac OSX.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4917</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Sun, 10 Aug 2008 20:21:44 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4917</guid>
		<description>Sweet!  I&#039;ll put &quot;cruft&quot; into lesson 1!</description>
		<content:encoded><![CDATA[<p>Sweet!  I&#8217;ll put &#8220;cruft&#8221; into lesson 1!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Prince Roy</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4916</link>
		<dc:creator>Prince Roy</dc:creator>
		<pubDate>Sun, 10 Aug 2008 12:13:18 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4916</guid>
		<description>is &#039;cruft&#039; one of the vocabulary words your kids learn?  If so, that kid will beat 98% of his US counterparts on the verbal section of the SAT and GRE.</description>
		<content:encoded><![CDATA[<p>is &#8216;cruft&#8217; one of the vocabulary words your kids learn?  If so, that kid will beat 98% of his US counterparts on the verbal section of the SAT and GRE.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tony Pace</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4915</link>
		<dc:creator>Tony Pace</dc:creator>
		<pubDate>Fri, 08 Aug 2008 05:45:47 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4915</guid>
		<description>To swim into slightly deeper waters, you should check out nltk, a python package for language analysis. So far I&#039;ve only used it for parts of speech classification, but it&#039;s capable of a lot more. It can be made to seek out words or grammar patterns from any corpus (good ones are included, but for kids you might need a graded one). You could use that to create worksheets from vocabulary lists. Heck, even individualized ones, assuming you had a good system to create individualized wordlists.
Reportlab is another good package for pdf generation.</description>
		<content:encoded><![CDATA[<p>To swim into slightly deeper waters, you should check out nltk, a python package for language analysis. So far I&#8217;ve only used it for parts of speech classification, but it&#8217;s capable of a lot more. It can be made to seek out words or grammar patterns from any corpus (good ones are included, but for kids you might need a graded one). You could use that to create worksheets from vocabulary lists. Heck, even individualized ones, assuming you had a good system to create individualized wordlists.<br />
Reportlab is another good package for pdf generation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4914</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Thu, 07 Aug 2008 16:38:51 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4914</guid>
		<description>Tony, those active state libraries are amazing.</description>
		<content:encoded><![CDATA[<p>Tony, those active state libraries are amazing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4913</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Thu, 07 Aug 2008 16:38:05 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4913</guid>
		<description>Matt, I&#039;m positive you don&#039;t really want to see the rest of the code.  I did the whole thing from an interactive shell and half the &quot;code&quot; consisted of little experiments regarding python syntax.  This was a &lt;i&gt;very&lt;/i&gt; under engineered approach... possibly due to my utter failure at making a &quot;real&quot; program to parse the spreadsheets.

Speaking of spreadsheets, that Joel guy&#039;s post was interesting.  Maybe I should take back my comment about &quot;identical statues&quot; insanity.

As far as linking the words list and the parts of speech list, yeah.  It&#039;s a good idea.  Originally, I was thinking of making an object to represent vocabulary items and giving it &quot;word&quot;, &quot;part of speech&quot;, &quot;bookworm level&quot;, &quot;lesson taught&quot; and &quot;chinese&quot; components.  I may still do that at some point.  This time, though I really just needed to get a few hundred more words added to my list.

FYI, the list will almost certainly never surpass a few thousand entries, so run-time concerns are irrelevant compared to Mark&#039;s-time concerns.</description>
		<content:encoded><![CDATA[<p>Matt, I&#8217;m positive you don&#8217;t really want to see the rest of the code.  I did the whole thing from an interactive shell and half the &#8220;code&#8221; consisted of little experiments regarding python syntax.  This was a <i>very</i> under engineered approach&#8230; possibly due to my utter failure at making a &#8220;real&#8221; program to parse the spreadsheets.</p>
<p>Speaking of spreadsheets, that Joel guy&#8217;s post was interesting.  Maybe I should take back my comment about &#8220;identical statues&#8221; insanity.</p>
<p>As far as linking the words list and the parts of speech list, yeah.  It&#8217;s a good idea.  Originally, I was thinking of making an object to represent vocabulary items and giving it &#8220;word&#8221;, &#8220;part of speech&#8221;, &#8220;bookworm level&#8221;, &#8220;lesson taught&#8221; and &#8220;chinese&#8221; components.  I may still do that at some point.  This time, though I really just needed to get a few hundred more words added to my list.</p>
<p>FYI, the list will almost certainly never surpass a few thousand entries, so run-time concerns are irrelevant compared to Mark&#8217;s-time concerns.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4912</link>
		<dc:creator>Robin</dc:creator>
		<pubDate>Thu, 07 Aug 2008 14:52:13 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4912</guid>
		<description>Matt: In Python 2.4 or newer, you would use the set type for checking whether the word is contained. Also, you can use a generator expression instead of a list comprehension on the second line, because the list is only needed temporarily anyway (and only wastes memory). And with the izip function, we can even get rid of the index:

&lt;code&gt;
from itertools import izip
l2set = set(l2words)
b2dict = [(word, pos) for (word, pos) in izip(b2words, b2pos) if word not in l2set]
&lt;/code&gt;</description>
		<content:encoded><![CDATA[<p>Matt: In Python 2.4 or newer, you would use the set type for checking whether the word is contained. Also, you can use a generator expression instead of a list comprehension on the second line, because the list is only needed temporarily anyway (and only wastes memory). And with the izip function, we can even get rid of the index:</p>
<p><pre><code>
from itertools import izip
l2set = set(l2words)
b2dict = [(word, pos) for (word, pos) in izip(b2words, b2pos) if word not in l2set]
</code></pre></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Ball</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4911</link>
		<dc:creator>Matt Ball</dc:creator>
		<pubDate>Thu, 07 Aug 2008 14:03:41 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4911</guid>
		<description>Joel Spolsky (ex-Excel lead) wrote &lt;a href=&quot;http://www.joelonsoftware.com/items/2008/02/19.html&quot; rel=&quot;nofollow&quot;&gt;this article&lt;/a&gt; about the Excel format, and why you should never parse it, even in a serious application -- the format is too complicated, and you&#039;ll never get all the little features right.  Instead, Joel suggests that you use Com Objects to do the work for you.  Alternatively, he suggests that you use simpler file formats, like CSV (comma separated values).

If you&#039;re trying to remove duplicates in Excel, I think there&#039;s a way to put the whole thing into a Pivot table, and remove duplicates that way.  There&#039;s a bit of a learning curve with pivot tables, but they can be very powerful if you want to keep it all in excel only.

Concerning the code example, this algorithm looks simple, but I think there is a problem if the number of words get large.  The algorithm operates in at best O(n^2) run time -- that is, the run time varies with the square of the number of words in the list (or more precisely, the product of the number of elements in the two lists).  It may even run in O(n^3) time if the del operator takes O(n) time (see more below).

What I would recommend instead is to put all the elements of l2words into a hash table ( O(n) operation) so that the &#039;in&#039; operator executes in (roughly) constant time instead of O(n) time.  I think this is what you&#039;d want, roughly:

&lt;code&gt;&lt;pre&gt;
l2dict = dict([(word, 1) for word in l2words])
for i in range(len(b2words)):
&lt;t /&gt;    if b2words[i] in l2dict:
&lt;t /&gt;&lt;t /&gt;        del b2words[i]
&lt;t /&gt;&lt;t /&gt;        del b2pos[i]
&lt;/pre&gt;&lt;/code&gt;

(I&#039;m a Python hack, so there is undoubtedly a more elegant way to create a dictionary (i.e. hash table) from a list.)

&#039;del&#039; is probably an O(n) operation on b2words (making it O(n^2) overall).  It depends on whether b2words is implemented internally as a linked list or a fixed-position array.  In either case, either &#039;del&#039; is slow, or &#039;[i]&#039; is slow.  To be on the safe side, it&#039;s probably better to create a new list instead of deleting from the existing list:

&lt;code&gt;&lt;pre&gt;
l2dict = dict([(word, 42) for word in l2words])
b2pos = [b2pos[i] for i, word in enumerate(b2words) if word not in l2dict];
b2words = [word for word in b2words if word not in l2dict];
&lt;/pre&gt;&lt;/code&gt;

I suspect that you may have meant for b2pos and b2words to be more closely linked.  Without seeing the rest of the code, you may have wanted this to be a dictionary of pairs.  Maybe this code is the right approach:

&lt;code&gt;&lt;pre&gt;
l2dict = dict([(word, 91) for word in l2words])
b2dict = dict([(word, b2pos[i]) for i, word in enumerate(b2words) if word not in l2dict])
&lt;/pre&gt;&lt;/code&gt;

I haven&#039;t run any of this code, so I&#039;m expecting at least a couple errors.

A C programmer will typically tackle these types of problems using lots of &#039;for (i=0; i&lt;n; i++)&#039; loops.  In using more modern languages, I&#039;ve decided that it&#039;s almost always some kind of mistake when you see this.  Usually, there&#039;s some other construct based on lists, arrays, or hash tables that executes faster and more gracefully.  Python has a cool construct called the &#039;list comprehension&#039; that makes it easy to do list manipulations.</description>
		<content:encoded><![CDATA[<p>Joel Spolsky (ex-Excel lead) wrote <a href="http://www.joelonsoftware.com/items/2008/02/19.html" rel="nofollow">this article</a> about the Excel format, and why you should never parse it, even in a serious application &#8212; the format is too complicated, and you&#8217;ll never get all the little features right.  Instead, Joel suggests that you use Com Objects to do the work for you.  Alternatively, he suggests that you use simpler file formats, like CSV (comma separated values).</p>
<p>If you&#8217;re trying to remove duplicates in Excel, I think there&#8217;s a way to put the whole thing into a Pivot table, and remove duplicates that way.  There&#8217;s a bit of a learning curve with pivot tables, but they can be very powerful if you want to keep it all in excel only.</p>
<p>Concerning the code example, this algorithm looks simple, but I think there is a problem if the number of words get large.  The algorithm operates in at best O(n^2) run time &#8212; that is, the run time varies with the square of the number of words in the list (or more precisely, the product of the number of elements in the two lists).  It may even run in O(n^3) time if the del operator takes O(n) time (see more below).</p>
<p>What I would recommend instead is to put all the elements of l2words into a hash table ( O(n) operation) so that the &#8216;in&#8217; operator executes in (roughly) constant time instead of O(n) time.  I think this is what you&#8217;d want, roughly:</p>
<p><pre><code>&lt;pre&gt;
l2dict = dict([(word, 1) for word in l2words])
for i in range(len(b2words)):
&lt;t /&gt;    if b2words[i] in l2dict:
&lt;t /&gt;&lt;t /&gt;        del b2words[i]
&lt;t /&gt;&lt;t /&gt;        del b2pos[i]
&lt;/pre&gt;</code></pre></p>
<p>(I&#8217;m a Python hack, so there is undoubtedly a more elegant way to create a dictionary (i.e. hash table) from a list.)</p>
<p>&#8216;del&#8217; is probably an O(n) operation on b2words (making it O(n^2) overall).  It depends on whether b2words is implemented internally as a linked list or a fixed-position array.  In either case, either &#8216;del&#8217; is slow, or &#8216;[i]&#8216; is slow.  To be on the safe side, it&#8217;s probably better to create a new list instead of deleting from the existing list:</p>
<p><pre><code>&lt;pre&gt;
l2dict = dict([(word, 42) for word in l2words])
b2pos = [b2pos[i] for i, word in enumerate(b2words) if word not in l2dict];
b2words = [word for word in b2words if word not in l2dict];
&lt;/pre&gt;</code></pre></p>
<p>I suspect that you may have meant for b2pos and b2words to be more closely linked.  Without seeing the rest of the code, you may have wanted this to be a dictionary of pairs.  Maybe this code is the right approach:</p>
<p><pre><code>&lt;pre&gt;
l2dict = dict([(word, 91) for word in l2words])
b2dict = dict([(word, b2pos[i]) for i, word in enumerate(b2words) if word not in l2dict])
&lt;/pre&gt;</code></pre></p>
<p>I haven&#8217;t run any of this code, so I&#8217;m expecting at least a couple errors.</p>
<p>A C programmer will typically tackle these types of problems using lots of &#8216;for (i=0; i&lt;n; i++)&#8217; loops.  In using more modern languages, I&#8217;ve decided that it&#8217;s almost always some kind of mistake when you see this.  Usually, there&#8217;s some other construct based on lists, arrays, or hash tables that executes faster and more gracefully.  Python has a cool construct called the &#8216;list comprehension&#8217; that makes it easy to do list manipulations.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4910</link>
		<dc:creator>Robin</dc:creator>
		<pubDate>Thu, 07 Aug 2008 08:11:23 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4910</guid>
		<description>Mark: Yes, it does :)</description>
		<content:encoded><![CDATA[<p>Mark: Yes, it does <img src='http://toshuo.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://toshuo.com/2008/never-try-to-parse-excel-spreadsheets/comment-page-1/#comment-4909</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Thu, 07 Aug 2008 05:48:07 +0000</pubDate>
		<guid isPermaLink="false">http://toshuo.com/?p=572#comment-4909</guid>
		<description>Hmm... neat.  So

&lt;code&gt;
if b2words[i] in l2words:
&lt;/code&gt;

will just traverse l2words to see if b2words[i] matches anything in l2words?  Thanks for the tip!</description>
		<content:encoded><![CDATA[<p>Hmm&#8230; neat.  So</p>
<p><pre><code>
if b2words[i] in l2words:
</code></pre></p>
<p>will just traverse l2words to see if b2words[i] matches anything in l2words?  Thanks for the tip!</p>
]]></content:encoded>
	</item>
</channel>
</rss>
