python解析xml的字符集问题的处理

作者: 2hei 发表于2010年5月 7日 18:12
版权声明: 可以转载, 转载时务必以超链形式标明文章原始出处和作者信息及版权声明
http://www.2hei.net/mt/2010/05/python-xml-encoding-utf8.html

python版本:2.6

案例一: test.xml
<?xml version="1.0" encoding="utf8"?>
调用:
xmldoc = minidom.parse(test.xml)
报错:
Traceback (most recent call last):
  File "D:\project\src\myapp\src\xml\testdomxml.py", line 14, in <module>
    xmldoc = minidom.parse(response)
  File "D:\Python\lib\xml\dom\minidom.py", line 1918, in parse
    return expatbuilder.parse(file)
  File "D:\Python\lib\xml\dom\expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "D:\Python\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30


修改后:test2.xml
<?xml version="1.0" encoding="utf-8"?>
再次调用
xmldoc = minidom.parse(test2.xml)
没有问题了。 囧一个!

详细可见python bug 列表: http://bugs.python.org/msg63471


案例二:
xmldoc = minidom.parse(urllib.urlopen('http://rss.sina.com.cn/news/marquee/ddt.xml'))
正常调用

xmldoc = minidom.parse(urllib.urlopen('http://news.163.com/special/00011K6L/rss_newstop.xml''))
报错:
  File "D:\Python\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

观察sina和163的两个rss源文件看,并未发现特别的异常,不过将163的保存为文件rssnew163.xml,在其头部添加
<?xml version="1.0" encoding="utf-8"?>
然后再调用
xmldoc = minidom.parse("rssnew163.xml")
问题解决,看来还是字符编码的问题了。
对于使用urllib实时更新rss的就需要预先处理一下了,先保存rss文件,然后添加上述行,或者将xml文件转换成utf-8编码即可。

| | Comments (3) | TrackBacks (0)

3 Comments

kingslee 说:

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

http://docs.python.org/library/xml.dom.minidom.html

kingslee 说:

http://docs.python.org/library/xml.dom.minidom.html

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

发表一个评论

关于这篇文章

这篇文章由2hei2010年5月 7日 18:12发布.

上一篇:SHELL中的2进制、10进制、8进制、16进制之间的转换

下一篇:linux sz rz tools

回到首页 或者查看归档文章

  • Powered by FeedBurner
  • Add to Google Reader or Homepage
  • Add to My AOL
  • Subscribe in NewsGator Online
  • del.icio.us/2heidel.icio.us/2hei
  • Subscribe to feed feeds