Comparing with other XML APIs, such as MSXML or IBM's Xcerces-C, the default Python XML APIs (xml.dom, etree, or even PyXML) does not provide full supports for multiple encodings. For example, if you comes from China, you might want to encode your document in GB2312/GBK or BIG5. However, you might receive some information like below when you are parsing it (tried on Python 2.6.2) :
<?xml version="1.0" encoding="GB2312"?>
<hello>world</hello>
import xml.dom.minidom
xml.dom.minidom.parse('test.gbk.xml')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30
It's confusing to most of the Python newbies because the document above is actually a valid XML document. It's also surprising because python itself can correctly handle GB2312 since python 1.6.
But my friend, don't just start complaining: Python does nothing wrong.
The strange behavor comes from the library. Python uses Expat as the implementation of almost all XML libraries. Currently Expat supports very limited encodings: US-ASCII, UTF-8, UTF-16, and ISO-8859-1 (If you are working on Linux/UNIX, check man page of xmlwf tool). However, it is compliant with XML standard, section 2.2:
The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.
As we see, XML standard requires only supports on UTF-8 and UTF-16, and XML parsers can freely make their decisions to support other encodings. Expat supports four encodings, that's all.
So, if you want to use XML with Python, please think about it carefully when choosing your default encodings. If there are requirements to support "any" encodings, don't just use Python blindly.