CSV parser with first 1kB block and wrong encoding causes unexpected behavior

Hello,
we validate user's CSV. One of validation steps is the encoding validation of a line.

Our validation method looks somehow like this

```ruby
    def valid_and_parse_lines
      line_number = 2 # count start after headers

      loop do
        row = csv.gets
        break unless row
        line = line_initialization(row)
        add_error(line_number, line.errors.full_messages) if line.invalid?
      rescue ArgumentError => exception
        raise if exception.message != 'invalid byte sequence in UTF-8'
        add_error(line_number, 'This line contains invalid UTF-8 characters')
      ensure
        line_number += 1
      end

      csv.rewind
    end
```

The CSV is loaded this way
```
@csv ||= ::CSV.new(file || content, encoding: Encoding::UTF_8, headers: true)
```

What is the issue? If there is some ones of the lines of the first 1kB block (used by stdlib's CSV parser for some detection) contain some invalid lines it leads some lines are skipped.

Here is an example:
[labels_invalid_iso8859-1_encoding.csv.zip](https://github.com/ruby/csv/files/1753393/labels_invalid_iso8859-1_encoding.csv.zip)

We always open the files in UTF-8, this CSV contains two lines with illegal character in ISO-8859-1. The second and fourth lines (after the header) contain the illegal character. In this case, only the second lines is detected, fourth one is unfortunately skipped. There is no problem with this if there are other invalid lines after the first 1kB block.

It seems the bug relates to these lines in `csv.rb`

- https://github.com/ruby/ruby/blob/ruby_2_5/lib/csv.rb#L2046
- https://github.com/ruby/ruby/blob/ruby_2_5/lib/csv.rb#L2039

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSV parser with first 1kB block and wrong encoding causes unexpected behavior #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CSV parser with first 1kB block and wrong encoding causes unexpected behavior #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions