本文翻译自:UnicodeDecodeError when reading CSV file in Pandas with Python
I'm running a program which is processing 30,000 similar files. 我正在运行一个程序,正在处理30,000个类似文件。 A random number of them are stopping and producing this error... 他们中有随机数正在停止并产生此错误...
File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
data = pd.read_csv(filepath, names=fields)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
return parser.read()
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
The source/creation of these files all come from the same place. 这些文件的源/创建都来自同一位置。 What's the best way to correct this to proceed with the import? 纠正此错误以继续导入的最佳方法是什么?
#1楼
#2楼
read_csv
takes an encoding
option to deal with files in different formats. read_csv
采用encoding
选项来处理不同格式的文件。 I mostly use read_csv('file', encoding = "ISO-8859-1")
, or alternatively encoding = "utf-8"
for reading, and generally utf-8
for to_csv
. 我主要使用read_csv('file', encoding = "ISO-8859-1")
,或者encoding = "utf-8"
进行读取,通常使用utf-8
进行to_csv
。
You can also use one of several alias
options like 'latin'
instead of 'ISO-8859-1'
(see python docs , also for numerous other encodings you may encounter). 您还可以使用多个alias
选项之一,例如'latin'
而不是'ISO-8859-1'
(有关可能会遇到的许多其他编码,请参见python docs )。
See relevant Pandas documentation , python docs examples on csv files , and plenty of related questions here on SO. 请参阅相关的Pandas文档 , 关于csv文件的python文档示例以及有关SO的大量相关问题。 A good background resource is What every developer should know about unicode and character sets . 一个好的背景资源是每个开发人员应了解的unicode和字符集 。
To detect the encoding (assuming the file contains non-ascii characters), you can use enca
(see man page ) or file -i
(linux) or file -I
(osx) (see man page ). 要检测编码(假设文件包含非ASCII字符),可以使用enca
(请参见手册页 )或file -i
(Linux)或file -I
(osx)(请参见手册页 )。
#3楼
Simplest of all Solutions: 所有解决方案中最简单的:
- Open the csv file in Sublime text editor . 在Sublime文本编辑器中打开csv文件。
- Save the file in utf-8 format. 以utf-8格式保存文件。
In sublime, Click File -> Save with encoding -> UTF-8 崇高地,单击文件->使用编码保存-> UTF-8
Then, you can read your file as usual: 然后,您可以照常读取文件:
import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')
EDIT 1: 编辑1:
If there are many files, then you can skip the sublime step. 如果文件很多,则可以跳过升华步骤。
Just read the file using 只需使用读取文件
data = pd.read_csv('file_name.csv', encoding='utf-8')
and the other different encoding types are: 其他不同的编码类型是:
encoding = "cp1252"
encoding = "ISO-8859-1"
#4楼
Struggled with this a while and thought I'd post on this question as it's the first search result. 挣扎了一段时间,以为我会在这个问题上发布,因为它是第一个搜索结果。 Adding the encoding="iso-8859-1"
tag to pandas read_csv
didn't work, nor did any other encoding, kept giving a UnicodeDecodeError. 向熊猫read_csv
添加encoding="iso-8859-1"
标签不起作用,其他任何编码也不起作用,并始终给出UnicodeDecodeError。
If you're passing a file handle to pd.read_csv(),
you need to put the encoding
attribute on the file open, not in read_csv
. 如果要将文件句柄传递给pd.read_csv(),
需要将encoding
属性放在打开的文件上,而不是在read_csv
。 Obvious in hindsight, but a subtle error to track down. 事后看来很明显,但要跟踪却有一个细微的错误。
#5楼
Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. 熊猫允许指定编码,但不允许忽略错误以免自动替换有问题的字节。 So there is no one size fits all method but different ways depending on the actual use case. 因此,没有一种适合所有方法的大小,而是取决于实际用例的不同方法。
You know the encoding, and there is no encoding error in the file. 您知道编码,并且文件中没有编码错误。 Great: you have just to specify the encoding: 太好了:您只需要指定编码即可:
file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. 您不希望被编码问题困扰,无论某些文本字段是否包含垃圾内容,都只希望加载该死的文件。 Ok, you only have to use
Latin1
encoding because it accept any possible byte as input (and convert it to the unicode character of same code): 好的,您只需要使用Latin1
编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的unicode字符):pd.read_csv(input_file_and_path, ..., encoding='latin1')
You know that most of the file is written with a specific encoding, but it also contains encoding errors. 您知道大多数文件都是用特定的编码编写的,但是它也包含编码错误。 A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. 一个真实的示例是一个UTF8文件,该文件已使用非utf8编辑器进行了编辑,并且其中包含一些使用不同编码的行。 Pandas has no provision for a special error processing, but Python
open
function has (assuming Python3), andread_csv
accepts a file like object. Pandas没有提供特殊的错误处理措施,但是Pythonopen
函数具有(假设Python3),并且read_csv
接受类似于object的文件。 Typical errors parameter to use here are'ignore'
which just suppresses the offending bytes or (IMHO better)'backslashreplace'
which replaces the offending bytes by their Python's backslashed escape sequence: 这里使用的典型错误参数是'ignore'
,它仅抑制有问题的字节,或者(恕我直言更好)是'backslashreplace'
,其由其Python的反斜杠转义序列代替了有问题的字节:file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)
#6楼
with open('filename.csv') as f:
print(f)
after executing this code you will find encoding of 'filename.csv' then execute code as following 执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码
data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"
there you go 你去