Skip to content

CSV parser with first 1kB block and wrong encoding causes unexpected behavior #25

Closed
@deepj

Description

@deepj

Hello,
we validate user's CSV. One of validation steps is the encoding validation of a line.

Our validation method looks somehow like this

    def valid_and_parse_lines
      line_number = 2 # count start after headers

      loop do
        row = csv.gets
        break unless row
        line = line_initialization(row)
        add_error(line_number, line.errors.full_messages) if line.invalid?
      rescue ArgumentError => exception
        raise if exception.message != 'invalid byte sequence in UTF-8'
        add_error(line_number, 'This line contains invalid UTF-8 characters')
      ensure
        line_number += 1
      end

      csv.rewind
    end

The CSV is loaded this way

@csv ||= ::CSV.new(file || content, encoding: Encoding::UTF_8, headers: true)

What is the issue? If there is some ones of the lines of the first 1kB block (used by stdlib's CSV parser for some detection) contain some invalid lines it leads some lines are skipped.

Here is an example:
labels_invalid_iso8859-1_encoding.csv.zip

We always open the files in UTF-8, this CSV contains two lines with illegal character in ISO-8859-1. The second and fourth lines (after the header) contain the illegal character. In this case, only the second lines is detected, fourth one is unfortunately skipped. There is no problem with this if there are other invalid lines after the first 1kB block.

It seems the bug relates to these lines in csv.rb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions