
My Python Gochas: Non UTF Encoding support does not always work

    As claimed by a lot of tutorials, Python is well known to be a very good platform for manipulating XML. Yeah it's true, however there is a trap waiting for you.

Comparing with other XML APIs, such as MSXML or IBM's Xcerces-C, the default Python XML APIs (xml.dom, etree, or even PyXML) does not provide full supports for multiple encodings. For example, if you comes from China, you might want to encode your document in GB2312/GBK or BIG5. However, you might receive some information like below when you are parsing it (tried on Python 2.6.2) :

<?xml version="1.0" encoding="GB2312"?>

import xml.dom.minidom
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

It's confusing to most of the Python newbies because the document above is actually a valid XML document. It's also surprising because python itself can correctly handle GB2312 since python 1.6.

But my friend, don't just start complaining: Python does nothing wrong.

The strange behavor comes from the library. Python uses Expat as the implementation of almost all XML libraries. Currently Expat supports very limited encodings: US-ASCII,  UTF-8,  UTF-16, and ISO-8859-1 (If you are working on Linux/UNIX, check man page of xmlwf tool). However, it is compliant with XML standard, section 2.2:

The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

As we see, XML standard requires only supports on UTF-8 and UTF-16, and XML parsers can freely make their decisions to support other encodings. Expat supports four encodings, that's all.

So, if you want to use XML with Python, please think about it carefully when choosing your default encodings. If there are requirements to support "any" encodings, don't just use Python blindly.


GoTask:工作模型(二) 查询和任务管理



在我的经验中,多数情况下短期任务往往以小时为单位计算,这一点相信每一个Outlook用户都会很熟悉:当需要定义一个会议邀请时,Outlook总是以半小时为默认单位指定时间,而一个任务一般最长也就是全天,即所谓的All Day Event。但是Outlook并不擅长显示或处理持续时间超过一天的任务——或许重复任务(Recursive Tasks)算是一种,但其实际上是可以被分解的短期任务集合。在工作中我发现,我的同事们很少会在Outlook中大量定义跨天任务。



  1. 查找每个已经开始,并且起始日期在一周前的任务。
  2. 查找每个已经开始,并且起始日期在三周前的任务。


