Wednesday, February 23, 2011

Bits from the past

A few weeks ago my father found in a backup disk some old texts he had written years ago: composed in WordStar for DOS; likely in our first computer, a PC XT (8-12Mhz!). Digging these files we concluded they were from about 1991 or 1992.... Too long ago! All I can say in my defense is that... I was younger? A child? The fact is that I used that WordStar version a lot.

Unfortunately these files were unreadable by any word processor I use and tried. Import filters promised a lot, a none of them worked for me. So it deserved some deeper digging.... First of all, for nostalgy sake, this is how WS looked like (now within a "modern" Win XP, running virtualized in my linux):

Now back to importing these files: I found the site wordstar.org with plenty of information, but most downloads were for Windows, and also most were not free. Here is a list of downloads. I tried some of them (under Wine), like: WS-Con, WSRTF and a few more. None of them were fully working; most have problems with accents or whatever encoding these files used.

Fortunately I found a text from in site, which describes the file format. This format is quite simple, and has a nice design allowing to extract "most" of the text by just looking the lowest 7 bits of each bytes, and discarding everything with the 8th bit set. If you want formatting, you would have to interpret those high bytes, though they are not too complex and we used very few in our old texts.

So I read this and wrote a python script to process them... The first few attempts were mangling, again, all my accents and the 'ñ' character (these are in Spanish) so I had to start digging at ascii codes. I have almost tatooed in my memory that 'ñ' = 164, after typing "Alt-1-6-4" so many times in DOS. (There were only US keyboards by that time... and I still use them). But a character 164 means something else in python or in nowadays encodings... sometimes as bad as:

>>> chr(164)
'\xa4'
>>> chr(164).encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa4 in position 0: ordinal not in range(128)
while the 'ñ' has different codes today:

>>> 'ñ'
'\xc3\xb1'   ### This is UTF-16 I think
>>> 'ñ'.decode('utf-8')
u'\xf1'      ### Its UTF-8 encoding
so which was the correct encoding? There is a small note in this page, saying "how to type in microsoft windows", and later a note saying DOS was using "codepage 437". That's cool, I had already found the list of all encodings python was bundled with in /usr/lib/python2.6/encodings (and there is indeed a file cp437.py)

So the real key was to do something like chr(164).decode('cp437'). That returns the unicode string u'\xf1', which is the "real" 'ñ'. That made the trick, and the script was done.

As a side note: I found some more characters I could not filter out initially: Double-byte codes like ESC-'4' or ESC-'5' around words... what was that? Some sleepy neuron remembered that we use to have a Star NX-1001 (multifont!), and I suspected that was a printer code. In fact, the manual (which still exists! go Star!) says those are the start of italicized text, and the return to normal font face. So that's not part of WordStar format, it's another problem - the way we handled formatting in our printer. (It was good to remember that, our best printer, again!).

Now if you read this far you must be a nostalgic looking for old memories, of really looking to translate a wordstar file. I uploaded the script to http://pastebin.com/pfY8Dbgv - it converts to plain text or a basic HTML. I hope it helps you!

No comments:

Post a Comment