Python and UTF-8 rant
On Wednesday, Thursday and Friday, I felt compelled to rant and scream and make a fuss.
March 15
I need to emit a rant. You can safely ignore this post.
I'm working on some software for work: an application that will take an uncompressed DAISY book (with WAVE audio) and encode it to a compressed book (with MP3 audio) suitable for distributing on a CD, and create MP3s for online distribution.
I'm doing this in Python, because that's my
So ... don't use Notepad? Do all my configuration through the interface, which at least works? No, of course not, that's why I chose the ConfigParser module in the first place! I wanted to be able to edit .ini files by hand; that's why they exist.
I'm not quite ready to give upon Python yet, since nearly all the programme is written in it (except for the parts I didn't write myself, like the MP3 encoder). But seriously, Guido et al. need to get working on making Python's unicode handling mature.
Addendum, March 16
So I have it partially figured out. I need to use the codecs module to explicitly specify the encoing of the file. I was already passing around a file object instead of a filename (so I could close the file immediately after having read/written it), so this was pretty easy to do.
Except that it still chokes on files from Windows machines, with the byte order mark at the beginning. Apparantly this problem isn't present for utf-16; it's just the Python handles utf-8 wrong. There's an easy fix, if you're dealing with the unicode string--take off the BOM, if present. Problem is, I'm not dealing with the unicode strings directly--those are being handled by the module. So it looks like I'm going to have to write a wrapper to the codecs.open(f, m, 'utf-8') function just to read and write utf-friggen-8.
Further reading indicates that a utf-8-sig decoder will strip the BOM if present. Although the patch was written last April, it wasn't committed until January, ie after the release of the most recent version.
Also, the utf-8 encoder doesn't give me \r\n line terminators, it only gives \n, which means that the files are not editable in Notepad. Guess what the default application is for opening .ini files in Windows?
Addendum, March 17
Screw all this. I'm tired of wasting my time on this. I've switched over to utf-16, which is the native Windows Unicode encoding, anyway. If Python ever gets the fix in, or if Windows decides to drop the utf-8 BOM, then I'll switch back.
