Python的字符编码检测库：charade和chardet的区别

【背景】

之前用过Python的chardet：https://pypi.python.org/pypi/chardet（代码下载在：https://github.com/dcramer/chardet）。

现在，在看Requests的编码方式时，看到有个新的字符编码检测库：charade https://pypi.python.org/pypi/charade

然后，就想要搞清楚，charade和之间的chardet的区别。

【python的字符编码检测库：charade和chardet的区别】

看了charade官网：https://github.com/sigmavirus24/charade的解释后，才知道：

原先Mark Pilgrim写的chardet，是分两个版本维护的，Python 2.x和Python 3.x，由此带来的维护和使用相对不方便。

而本身两个版本中的大部分代码都是一样的，所以有了统一的可能性。后来，sigmavirus24在基于chardet的基础上，为了Requests，而去做了优化，做了统一，弄出了这个：charade即，简述为：

charade是：

Forked version of chardet, being ported to support python 2 and python 3 for kennethreitz/requests

https://github.com/kennethreitz/requests/issues/951

至此：以后，如果需要在Python下使用字符编码检测库的话，那么就可以选用更方便的charade了。

注：charade的资料：

python里面的pypi主页：https://pypi.python.org/pypi/charade
github主页：https://github.com/sigmavirus24/charade

（1）chardet guesses the encoding of text files.

Detects…

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-2, windows-1250 (Hungarian)
ISO-8859-5, windows-1251 (Bulgarian)
windows-1252 (English)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

Requires Python 2.1 or later.

（2）Charade: The Universal character encoding detector

Detects

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-2, windows-1250 (Hungarian)
ISO-8859-5, windows-1251 (Bulgarian)
windows-1252 (English)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

Requires Python 2.6 or later

python抓取中文网页乱码通用解决方法

我们经常通过python做采集网页数据的时候，会碰到一些乱码问题，今天给大家分享一个解决网页乱码，尤其是中文网页的通用方法。

首页我们需要安装chardet模块，这个可以通过easy_install 或者pip来安装。

安装完以后我们在控制台上导入模块，如果正常就可以。

比如我们遇到的一些ISO-8859-2也是可以通过下面的方法解决的。

直接上代码吧：

import urllib2
import sys
import chardet

req = urllib2.Request(“http://www.163.com/”)##这里可以换成http://www.baidu.com,http://www.sohu.com
content = urllib2.urlopen(req).read()
typeEncode = sys.getfilesystemencoding()##系统默认编码
infoencode = chardet.detect(content).get(‘encoding’,’utf-8′)##通过第3方模块来自动提取网页的编码
html = content.decode(infoencode,’ignore’).encode(typeEncode)##先转换成unicode编码，然后转换系统编码输出
print html

通过上面的代码，相信能够解决你采集乱码的问题。

说明：chardet.detect(content)返回结果是一个字典如{‘confidence’: 0.99, ‘encoding’: ‘utf-8’}，所以接着使用字典的get()方法来获取encoding属性的值。

ps:在实际使用过程中发现，chardet模块并不能100%识别文件的编码类型，有时会识别错误。

转载请注明：jinglingshu的博客 » Python的字符编码检测库：charade和chardet的区别 ||python抓取中文网页乱码通用解决方法

Python的字符编码检测库：charade和chardet的区别 ||python抓取中文网页乱码通用解决方法

Python的字符编码检测库：charade和chardet的区别

python抓取中文网页乱码通用解决方法

与本文相关的文章

Hi，您需要填写昵称和邮箱！