编程 Python

Python 转换文本编码实现解析

Posted in Python onAugust 27, 2019

最近在做周报的时候，需要把csv文本中的数据提取出来制作表格后生产图表。

在获取csv文本内容的时候，基本上都是用with open(filename, encoding ='UTF-8') as f:来打开csv文本，但是实际使用过程中发现有些csv文本并不是utf-8格式，从而导致程序在run的过程中报错，每次都需要手动去把该文本文件的编码格式修改成utf-8，再次来run该程序，所以想说：直接在程序中判断并修改文本编码。

基本思路：先查找该文本是否是utf-8的编码，如果不是则修改为utf-8编码的文本，然后再处理。

python有chardet库可以查看到文本的encoding信息：

detect函数只需要一个非unicode字符串参数，返回一个字典（例如：{'encoding': 'utf-8', 'confidence': 0.99}）。该字典包括判断到的编码格式及判断的置信度。

import chardet
def get_encode_info(file):
  with open(file, 'rb') as f:
    return chardet.detect(f.read())['encoding']

不过这个在从处理小文件的时候性能还行，如果文本稍微过大就很慢了，目前我本地的csv文件是近200k，就能明显感觉到速度过慢了，效率低下。不过chardet库中提供UniversalDetector对象来处理：创建UniversalDetector对象，然后对每个文本块重复调用其feed方法。如果检测器达到了最小置信阈值，它就会将detector.done设置为True。

一旦您用完了源文本，请调用detector.close()，这将完成一些最后的计算，以防检测器之前没有达到其最小置信阈值。结果将是一个字典，其中包含自动检测的字符编码和置信度(与charde.test函数返回的相同)。

from chardet.universaldetector import UniversalDetector
def get_encode_info(file):
 with open(file, 'rb') as f:
    detector = UniversalDetector()
 for line in f.readlines():
      detector.feed(line)
 if detector.done:
 break
    detector.close()
 return detector.result['encoding']

在做编码转换的时候遇到问题：UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 178365: character maps to <undefined>

def read_file(file):
 with open(file, 'rb') as f:
 return f.read()
def write_file(content, file):
 with open(file, 'wb') as f:
    f.write(content)
def convert_encode2utf8(file, original_encode, des_encode):
  file_content = read_file(file)
  file_decode = file_content.decode(original_encode)  #-->此处有问题
  file_encode = file_decode.encode(des_encode)
  write_file(file_encode, file)

这是由于byte字符组没解码好，要加另外一个参数errors。官方文档中写道：

bytearray.decode(encoding=”utf-8”, errors=”strict”)

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

意思就是字符数组解码成一个utf-8的字符串，可能被设置成不同的处理方案，默认是‘严格'的，有可能抛出UnicodeError，可以改成‘ignore'，'replace'就能解决。

所以将此行代码file_decode = file_content.decode(original_encode)修改成file_decode = file_content.decode(original_encode,'ignore')即可。

完整代码：

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
   detector = UniversalDetector()
   for line in f.readlines():
     detector.feed(line)
     if detector.done:
       break
   detector.close()
   return detector.result['encoding']

def read_file(file):
  with open(file, 'rb') as f:
    return f.read()

def write_file(content, file):
  with open(file, 'wb') as f:
    f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
  file_content = read_file(file)
  file_decode = file_content.decode(original_encode,'ignore')
  file_encode = file_decode.encode(des_encode)
  write_file(file_encode, file)

if __name__ == "__main__":
  filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'
  file_content = read_file(filename)
  encode_info = get_encode_info(filename)
  if encode_info != 'utf-8':
    convert_encode2utf8(filename, encode_info, 'utf-8')
  encode_info = get_encode_info(filename)
  print(encode_info)

参考：https://chardet.readthedocs.io/en/latest/usage.html

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python 转换文本编码实现解析

- Author -

danvy617

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

用Python抢过年的火车票附源码

Dec 07 Python

python安装与使用redis的方法

Apr 19 Python

python 连接各类主流数据库的实例代码

Jan 30 Python

使用python实现BLAST

Feb 12 Python

Python3.5字符串常用操作实例详解

May 01 Python

Python如何操作office实现自动化及win32com.client的运用

Apr 01 Python

python学生管理系统的实现

Apr 05 Python

python开发前景如何

Jun 11 Python

Python3+RIDE+RobotFramework自动化测试框架搭建过程详解

Sep 23 Python

python实现快速文件格式批量转换的方法

Oct 16 Python

python实现启动一个外部程序,并且不阻塞当前进程

Dec 05 Python

浅谈哪个Python库才最适合做数据可视化

Jun 28 Python

python-opencv获取二值图像轮廓及中心点坐标的代码

Aug 27 #Python

python定位xpath 节点位置的方法

Aug 27 #Python

python实现截取屏幕保存文件,删除N天前截图的例子

Aug 27 #Python

python自动化UI工具发送QQ消息的实例

Aug 27 #Python

python 调用pyautogui 实时获取鼠标的位置、移动鼠标的方法

Aug 27 #Python

对Python获取屏幕截图的4种方法详解

Aug 27 #Python

python对常见数据类型的遍历解析

Aug 27 #Python

You might like

上海永华YH-R296(华普R-96)12波段立体声收音机的分析和打理

2021/03/02 无线电

生成随机字符串和验证码的类的PHP实例

2013/12/24 PHP

PHP四舍五入精确小数位及取整

2014/01/14 PHP

使用phpstorm和xdebug实现远程调试的方法

2015/12/29 PHP

php获取POST数据的三种方法实例详解

2016/12/20 PHP

php 删除指定文件夹的实例讲解

2017/07/25 PHP

使用javascript获取flash加载的百分比的实现代码

2011/05/25 Javascript

基于jquery点击自以外任意处，关闭自身的代码

2012/02/10 Javascript

jQuery实现图片加载完成后改变图片大小的方法

2016/03/29 Javascript

Bootstrap网格系统详解

2016/04/26 Javascript

基于bootstrap按钮式下拉菜单组件的搜索建议插件

2017/03/25 Javascript

jquery实现图片轮播器

2017/05/23 jQuery

JS实现字符串中去除指定子字符串方法分析

2018/05/17 Javascript

小程序scroll-view组件实现滚动的示例代码

2018/09/20 Javascript

express + jwt + postMan验证实现持久化登录

2019/06/05 Javascript

Javascript数组方法reduce的妙用之处分享

2019/06/10 Javascript

利用Electron简单撸一个Markdown编辑器的方法

2019/06/10 Javascript

webpack HappyPack实战详解

2019/10/08 Javascript

TensorFlow.js 微信小程序插件开始支持模型缓存的方法

2020/02/21 Javascript

[54:45]2018DOTA2亚洲邀请赛 4.1 小组赛 A组 Optic vs OG

2018/04/02 DOTA

python使用PIL给图片添加文字生成海报示例

2018/08/17 Python

使用Python实现从各个子文件夹中复制指定文件的方法

2018/10/25 Python

Python控制键盘鼠标pynput的详细用法

2019/01/28 Python

解决python有时候import不了当前的包问题

2019/08/28 Python

python数据分析:关键字提取方式

2020/02/24 Python

Visual Studio Code搭建django项目的方法步骤

2020/09/17 Python

10个最常见的HTML5面试题附答案

2016/06/06 HTML / CSS

汤米巴哈马官方网站：Tommy Bahama

2017/05/13 全球购物

新西兰领先的内衣店：Bendon Lingerie新西兰

2018/07/11 全球购物

资深地理教师自我评价

2013/09/21 职场文书

深入开展党的群众路线教育实践活动方案

2014/02/04 职场文书

心理健康日活动总结

2014/05/08 职场文书

我心目中的好老师活动方案

2014/08/19 职场文书

企业法人代表授权委托书

2014/10/02 职场文书

2019旅游导游工作总结

2019/06/27 职场文书

python用字节处理文件实例讲解

2021/04/13 Python