diff options
author | Heikki Linnakangas | 2021-04-01 09:23:40 +0000 |
---|---|---|
committer | Heikki Linnakangas | 2021-04-01 09:23:40 +0000 |
commit | f82de5c46bdf8cd65812a7b04c9509c218e1545d (patch) | |
tree | f9d687f0e1f50666a4a4cf8fbe366a2cd7e43d1c /src/include/mb/pg_wchar.h | |
parent | ea1b99a6619cd9dcfd46b82ac0d926b0b80e0ae9 (diff) |
Do COPY FROM encoding conversion/verification in larger chunks.
This gives a small performance gain, by reducing the number of calls
to the conversion/verification function, and letting it work with
larger inputs. Also, reorganizing the input pipeline makes it easier
to parallelize the input parsing: after the input has been converted
to the database encoding, the next stage of finding the newlines can
be done in parallel, because there cannot be any newline chars
"embedded" in multi-byte characters in the encodings that we support
as server encodings.
This changes behavior in one corner case: if client and server
encodings are the same single-byte encoding (e.g. latin1), previously
the input would not be checked for zero bytes ('\0'). Any fields
containing zero bytes would be truncated at the zero. But if encoding
conversion was needed, the conversion routine would throw an error on
the zero. After this commit, the input is always checked for zeros.
Reviewed-by: John Naylor
Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
Diffstat (limited to 'src/include/mb/pg_wchar.h')
-rw-r--r-- | src/include/mb/pg_wchar.h | 22 |
1 files changed, 20 insertions, 2 deletions
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index a9aaff9e6dc..0f31e683189 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -306,9 +306,9 @@ typedef enum pg_enc /* * When converting strings between different encodings, we assume that space - * for converted result is 4-to-1 growth in the worst case. The rate for + * for converted result is 4-to-1 growth in the worst case. The rate for * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width - * kanna -> UTF8 is the worst case). So "4" should be enough for the moment. + * kana -> UTF8 is the worst case). So "4" should be enough for the moment. * * Note that this is not the same as the maximum character width in any * particular encoding. @@ -316,6 +316,24 @@ typedef enum pg_enc #define MAX_CONVERSION_GROWTH 4 /* + * Maximum byte length of a string that's required in any encoding to convert + * at least one character to any other encoding. In other words, if you feed + * MAX_CONVERSION_INPUT_LENGTH bytes to any encoding conversion function, it + * is guaranteed to be able to convert something without needing more input + * (assuming the input is valid). + * + * Currently, the maximum case is the conversion UTF8 -> SJIS JIS X0201 half + * width kana, where a pair of UTF-8 characters is converted into a single + * SHIFT_JIS_2004 character (the reverse of the worst case for + * MAX_CONVERSION_GROWTH). It needs 6 bytes of input. In theory, a + * user-defined conversion function might have more complicated cases, although + * for the reverse mapping you would probably also need to bump up + * MAX_CONVERSION_GROWTH. But there is no need to be stingy here, so make it + * generous. + */ +#define MAX_CONVERSION_INPUT_LENGTH 16 + +/* * Maximum byte length of the string equivalent to any one Unicode code point, * in any backend encoding. The current value assumes that a 4-byte UTF-8 * character might expand by MAX_CONVERSION_GROWTH, which is a huge |