Never Try to Parse Excel Spreadsheets

August 7th, 2008 by Mark

Recently, I’ve been getting more and more bogged down in my EFL curriculum work. It’s a very rough process, but even so it’s become unwieldy. The problem is the way I’m keeping track of vocabulary words.

Frequency is key

Whenever I create a new lesson, I have to choose vocabulary items to introduce. Rather than do so on a strictly topical basis, I’ve opted to focus on usage frequency. In particular, I do everything within reason to teach the headwords used for the Oxford Bookworms series of graded readers. Extensive reading is possibly the single most important part of our program, and it’s worth it to make a few sacrifices to get the kids reading actual books as soon as possible.

Spreadsheets can save time

In order to keep track of everything, I’ve made a big spreadsheet for all the vocabulary items in the curriculum. One field is the word itself, another field represents which lesson I teach it in, another is the Bookworms level in which it first appears. The Chinese translation is in still another field, and there’s one final one for any notes I might have about the vocab item. I can sort this spreadsheet by column headers to see the words taught in a given lesson, or all the adjectives taught between any given lessons, or any number of other useful combination. It’s great for making review sheets for my students.

Spreadsheets still aren’t perfect, though

So what’s the problem? The problem is that I started the list with only level one Bookworms words. Before, when I made new lessons, I used to look through the remaining level one words and choose the most appropriate group of them I could. After exhausting the level one list, things got a bit tougher. Now, when make a new lesson, that means adding new words to my master spreadsheet. Obviously, I want to add level 2 words, but sometimes there’s a word that just has to be added regardless of where it appears in the Oxford frequency lists. When that happens, I have to hunt through the list of words that appear in level two readers. If the word I’m looking for isn’t there, I have to look through the level three list, and then the level four list. It sucks.

Worse still, if I were to just copy the entire level 2 list into my master spreadsheet, there would be hundreds of duplicates, which would take hours to remove by hand. No good. Especially since the task would just become all the more unmanageable when it was time to add the level 3 list into my master spreadsheet.

Aha! I’ll just whip up a quick program to parse my excel file and remove duplicates!

Ha! What a fool I was! After spending hours pounding my head against my computer desk, I came to the realization that the excel file format is pure evil. It has decades of cruft that hark back to the days in which desktops had less computing power than my sports watch! And it won’t yield its information gracefully! Had the .xls file format been the result of one particular person, I’d have said his decision making skills rivaled those of Easter Island native who said, “I know! Let’s cut down the last tree and commit suicide because we need more identical statues!“.

Finally, I came to my senses, saved the list as text file and wrote a quick python script to read the vocabulary list into a few big arrays of strings, and then it was easy to remove duplicates:

for i in range(0, len(b2words)-1):
	for w in l2words:
		if w == b2words[i]:
			del b2words[i]
			del b2pos[i]

Then I output my list of level two bookworms (that aren’t duplicates of any words already in my curriculum) into a simple text file, and the part of speech information into another text file. Finally, I used my good friends cut and paste to add them into my master spreadsheet and I set the Bookworm level for the whole group to two.

Why, oh why couldn’t I have thought of that solution before 4 a.m.?

Tags: , ,

12 Responses to “Never Try to Parse Excel Spreadsheets”

  1. 1 Robin Says:

    Python for the win! :)

    You had a one-off error there with the second argument of range (list(range(0, 3)) == [0, 1, 2]). The code can also be made simpler like this:

    
    for i in range(len(b2words)):
        if b2words[i] in l2words:
            del b2words[i]
            del b2pos[i]
    

  2. 2 Tony Pace Says:

    next time try this from activestate:
    http://code.activestate.com/recipes/440661/

    I can attest it works very well. I use another card generation program called nandeck in conjunction with this wordlist to make cards for all the vocabulary.

  3. 3 ZL Says:

    If you’re just looking to remove duplicates I wouuld suggest a great text editor called Ultraedit. Just past your collums into it (will be tab delimmited) Choose Sort>Advanced Options>Remove Duplicates and then sort. Then simply paste back into Excel. Even if not for this task, you should check it out, really great app.
    http://www.ultraedit.com

  4. 4 Mark Says:

    Hmm… neat. So

    
    if b2words[i] in l2words:
    

    will just traverse l2words to see if b2words[i] matches anything in l2words? Thanks for the tip!

  5. 5 Robin Says:

    Mark: Yes, it does :)

  6. 6 Matt Ball Says:

    Joel Spolsky (ex-Excel lead) wrote this article about the Excel format, and why you should never parse it, even in a serious application — the format is too complicated, and you’ll never get all the little features right. Instead, Joel suggests that you use Com Objects to do the work for you. Alternatively, he suggests that you use simpler file formats, like CSV (comma separated values).

    If you’re trying to remove duplicates in Excel, I think there’s a way to put the whole thing into a Pivot table, and remove duplicates that way. There’s a bit of a learning curve with pivot tables, but they can be very powerful if you want to keep it all in excel only.

    Concerning the code example, this algorithm looks simple, but I think there is a problem if the number of words get large. The algorithm operates in at best O(n^2) run time — that is, the run time varies with the square of the number of words in the list (or more precisely, the product of the number of elements in the two lists). It may even run in O(n^3) time if the del operator takes O(n) time (see more below).

    What I would recommend instead is to put all the elements of l2words into a hash table ( O(n) operation) so that the ‘in’ operator executes in (roughly) constant time instead of O(n) time. I think this is what you’d want, roughly:

    <pre>
    l2dict = dict([(word, 1) for word in l2words])
    for i in range(len(b2words)):
    <t />    if b2words[i] in l2dict:
    <t /><t />        del b2words[i]
    <t /><t />        del b2pos[i]
    </pre>

    (I’m a Python hack, so there is undoubtedly a more elegant way to create a dictionary (i.e. hash table) from a list.)

    ‘del’ is probably an O(n) operation on b2words (making it O(n^2) overall). It depends on whether b2words is implemented internally as a linked list or a fixed-position array. In either case, either ‘del’ is slow, or ‘[i]‘ is slow. To be on the safe side, it’s probably better to create a new list instead of deleting from the existing list:

    <pre>
    l2dict = dict([(word, 42) for word in l2words])
    b2pos = [b2pos[i] for i, word in enumerate(b2words) if word not in l2dict];
    b2words = [word for word in b2words if word not in l2dict];
    </pre>

    I suspect that you may have meant for b2pos and b2words to be more closely linked. Without seeing the rest of the code, you may have wanted this to be a dictionary of pairs. Maybe this code is the right approach:

    <pre>
    l2dict = dict([(word, 91) for word in l2words])
    b2dict = dict([(word, b2pos[i]) for i, word in enumerate(b2words) if word not in l2dict])
    </pre>

    I haven’t run any of this code, so I’m expecting at least a couple errors.

    A C programmer will typically tackle these types of problems using lots of ‘for (i=0; i<n; i++)’ loops. In using more modern languages, I’ve decided that it’s almost always some kind of mistake when you see this. Usually, there’s some other construct based on lists, arrays, or hash tables that executes faster and more gracefully. Python has a cool construct called the ‘list comprehension’ that makes it easy to do list manipulations.

  7. 7 Robin Says:

    Matt: In Python 2.4 or newer, you would use the set type for checking whether the word is contained. Also, you can use a generator expression instead of a list comprehension on the second line, because the list is only needed temporarily anyway (and only wastes memory). And with the izip function, we can even get rid of the index:

    
    from itertools import izip
    l2set = set(l2words)
    b2dict = [(word, pos) for (word, pos) in izip(b2words, b2pos) if word not in l2set]
    

  8. 8 Mark Says:

    Matt, I’m positive you don’t really want to see the rest of the code. I did the whole thing from an interactive shell and half the “code” consisted of little experiments regarding python syntax. This was a very under engineered approach… possibly due to my utter failure at making a “real” program to parse the spreadsheets.

    Speaking of spreadsheets, that Joel guy’s post was interesting. Maybe I should take back my comment about “identical statues” insanity.

    As far as linking the words list and the parts of speech list, yeah. It’s a good idea. Originally, I was thinking of making an object to represent vocabulary items and giving it “word”, “part of speech”, “bookworm level”, “lesson taught” and “chinese” components. I may still do that at some point. This time, though I really just needed to get a few hundred more words added to my list.

    FYI, the list will almost certainly never surpass a few thousand entries, so run-time concerns are irrelevant compared to Mark’s-time concerns.

  9. 9 Mark Says:

    Tony, those active state libraries are amazing.

  10. 10 Tony Pace Says:

    To swim into slightly deeper waters, you should check out nltk, a python package for language analysis. So far I’ve only used it for parts of speech classification, but it’s capable of a lot more. It can be made to seek out words or grammar patterns from any corpus (good ones are included, but for kids you might need a graded one). You could use that to create worksheets from vocabulary lists. Heck, even individualized ones, assuming you had a good system to create individualized wordlists.
    Reportlab is another good package for pdf generation.

  11. 11 Prince Roy Says:

    is ‘cruft’ one of the vocabulary words your kids learn? If so, that kid will beat 98% of his US counterparts on the verbal section of the SAT and GRE.

  12. 12 Mark Says:

    Sweet! I’ll put “cruft” into lesson 1!

Leave a Reply

Quicktags: