Description
Hello,
we validate user's CSV. One of validation steps is the encoding validation of a line.
Our validation method looks somehow like this
def valid_and_parse_lines
line_number = 2 # count start after headers
loop do
row = csv.gets
break unless row
line = line_initialization(row)
add_error(line_number, line.errors.full_messages) if line.invalid?
rescue ArgumentError => exception
raise if exception.message != 'invalid byte sequence in UTF-8'
add_error(line_number, 'This line contains invalid UTF-8 characters')
ensure
line_number += 1
end
csv.rewind
end
The CSV is loaded this way
@csv ||= ::CSV.new(file || content, encoding: Encoding::UTF_8, headers: true)
What is the issue? If there is some ones of the lines of the first 1kB block (used by stdlib's CSV parser for some detection) contain some invalid lines it leads some lines are skipped.
Here is an example:
labels_invalid_iso8859-1_encoding.csv.zip
We always open the files in UTF-8, this CSV contains two lines with illegal character in ISO-8859-1. The second and fourth lines (after the header) contain the illegal character. In this case, only the second lines is detected, fourth one is unfortunately skipped. There is no problem with this if there are other invalid lines after the first 1kB block.
It seems the bug relates to these lines in csv.rb