python中使用charset判断字符串编码_python

概述背景Python中的字符串编码算是让人头疼的事情。在web开发中，用户输入的字符串通过前端直接透传过来，如果是一些比较奇怪的字符，可能就涉及到Python的编解码转换了。Python自身提供了str和bytes之间的转换，可以通过encode()和decode()函数进行转换，但是比较麻烦的一点是，我们首先要要背景

　　Python中的字符串编码算是让人头疼的事情。在web开发中，用户输入的字符串通过前端直接透传过来，如果是一些比较奇怪的字符，可能就涉及到Python的编解码转换了。Python自身提供了str和bytes之间的转换，可以通过encode()和decode()函数进行转换，但是比较麻烦的一点是，我们首先要要知道其编码方式，然后才能知道如何对其进行编解码。经过网上搜索得知python有一个charset库，专治此类编码不解之谜。

简介

项目地址：https://github.com/chardet/chardet

支持检测的字符集

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and SimplifIEd Chinese)EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)EUC-KR, ISO-2022-KR, Johab (Korean)KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, @R_301_5087@-1251 (Cyrillic)ISO-8859-5, @R_301_5087@-1251 (Bulgarian)ISO-8859-1, @R_301_5087@-1252 (Western European languages)ISO-8859-7, @R_301_5087@-1253 (Greek)ISO-8859-8, @R_301_5087@-1255 (Visual and Logical Hebrew)TIS-620 (Thai)@H_301_31@

需要版本：Python 3.6+.(实际上Python2.7也可以）

安装

sudo pip3 install chardet

使用1. 命令行

chardetect somefile someotherfile

例子：

chardetect get-pip.py tune.sh

上图检测出了两个文件的编码，以及其预测可能性（confIDence）：99%和100%

2. python module

1) 使用chardet.detect检测编码类型

import urllibrawdata = urllib.urlopen('http://yahoo.co.jp/').read()import chardet#检测rawdata类型chardet.detect(rawdata)

2) 使用Universaldetector检测大文件的编码（非贪婪模式）

#Coding: utf8import urllibfrom chardet.universaldetector import UniversalDetectorusock = urllib.urlopen('http://yahoo.co.jp/')#生成UniversalDetector对象detector = UniversalDetector()#循环遍历文件每行for line in usock.readlines():    #Feed当前读取到的行给detector，它会自行检测编码    detector.Feed(line)    #当detector被Feed了足够多的行且能猜测出编码，detector.done会被置为True    if detector.done: break#close()是防止detector没有足够信心，最后做一轮计算，确认编码detector.close()usock.close()print(detector.result)

最终打印结果：{'confIDence': 0.99, 'language': '', 'enCoding': 'utf-8'}

3) 使用Universaldetector检测多个大文件的编码（非贪婪模式）

#Coding: utf8import globfrom chardet.universaldetector import UniversalDetectordetector = UniversalDetector()#遍历所有.xml后缀结尾的大文件for filename in glob.glob('*.xml'):    print filename.ljust(60),    #每一轮检测前使用@R_502_5990@()重置detector    detector.@R_502_5990@()    for line in file(filename, 'rb'):        detector.Feed(line)        if detector.done: break    #每一轮检测完后close（）    detector.close()    print detector.result

以上就是chardet对于字符集判断使用，对于Python字符集问题，你是不是更加有信心了呢？

参考文档：https://chardet.readthedocs.io/en/latest/usage.HTML#example-using-the-detect-function

传送门：2021最新测试资料&大厂职位

博主：测试生财（一个不为996而996的测开码农）
座右铭：专注测试开发与自动化运维，努力读书思考写作，为内卷的人生奠定财务自由。
内容范畴：技术提升，职场杂谈，事业发展，阅读写作，投资理财，健康人生。
csdn：https://blog.csdn.net/ccgshigao
博客园：https://www.cnblogs.com/qa-freeroad/
51cto：https://blog.51cto.com/14900374
微信公众号：测试生财（定期分享独家内容和资源）

总结

以上是内存溢出为你收集整理的python中使用charset判断字符串编码全部内容，希望文章能够帮你解决python中使用charset判断字符串编码所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1188436.html

python中使用charset判断字符串编码

发表评论

评论列表（0条）