Unfortunately these files were unreadable by any word processor I use and tried. Import filters promised a lot, a none of them worked for me. So it deserved some deeper digging.... First of all, for nostalgy sake, this is how WS looked like (now within a "modern" Win XP, running virtualized in my linux):
Now back to importing these files: I found the site wordstar.org with plenty of information, but most downloads were for Windows, and also most were not free. Here is a list of downloads. I tried some of them (under Wine), like: WS-Con, WSRTF and a few more. None of them were fully working; most have problems with accents or whatever encoding these files used.
Fortunately I found a text from in site, which describes the file format. This format is quite simple, and has a nice design allowing to extract "most" of the text by just looking the lowest 7 bits of each bytes, and discarding everything with the 8th bit set. If you want formatting, you would have to interpret those high bytes, though they are not too complex and we used very few in our old texts.
So I read this and wrote a python script to process them... The first few attempts were mangling, again, all my accents and the 'ñ' character (these are in Spanish) so I had to start digging at ascii codes. I have almost tatooed in my memory that 'ñ' = 164, after typing "Alt-1-6-4" so many times in DOS. (There were only US keyboards by that time... and I still use them). But a character 164 means something else in python or in nowadays encodings... sometimes as bad as:
>>> chr(164)while the 'ñ' has different codes today:
'\xa4'
>>> chr(164).encode('utf-8')
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa4 in position 0: ordinal not in range(128)
>>> 'ñ'
'\xc3\xb1' ### This is UTF-16 I think
>>> 'ñ'.decode('utf-8')
u'\xf1' ### Its UTF-8 encoding
So the real key was to do something like chr(164).decode('cp437'). That returns the unicode string u'\xf1', which is the "real" 'ñ'. That made the trick, and the script was done.
As a side note: I found some more characters I could not filter out initially: Double-byte codes like ESC-'4' or ESC-'5' around words... what was that? Some sleepy neuron remembered that we use to have a Star NX-1001 (multifont!), and I suspected that was a printer code. In fact, the manual (which still exists! go Star!) says those are the start of italicized text, and the return to normal font face. So that's not part of WordStar format, it's another problem - the way we handled formatting in our printer. (It was good to remember that, our best printer, again!).
Now if you read this far you must be a nostalgic looking for old memories, of really looking to translate a wordstar file. I uploaded the script to http://pastebin.com/pfY8Dbgv - it converts to plain text or a basic HTML. I hope it helps you!
No comments:
Post a Comment