Fix receiving large legal tsvector from binary format

Lists: pgsql-hackers
From: Ерохин Денис Владимирович <erohin-d(at)datagile(dot)ru>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Fix receiving large legal tsvector from binary format
Date: 2023-09-29 13:36:27
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

There is a problem on receiving large tsvector from binary format with
getting error "invalid tsvector: maximum total lexeme length exceeded".
Required simple steps to reproduce problem:

- Make table with one column of type 'tsvector'

- Add row with large tsvector (900 Kb will be enougth)

- Save table to file with "copy to with binary"

- Restore table from file with "copy from with binary"

At last step we can't restore a legal tsvector because there is error of
required memory size calculation for tsvector during reading of its
binary.

It seems function "tsvectorrecv" can be improved:

v1-0001-Fix-receiving-of-big-tsvector-from-binary - patch fixes problem of
required size calculation for tsvector where wrong type is used (WordEntry
instead of WordEntryPos). Size of wrong type is bigger than type of needed
type therefore total size turns out larger than needed. So next test of
that size fails during maximum size check.

v1-0002-Replace-magic-one-in-tsvector-binary-receiving - patch removes
magic ones from code. Those ones are used to add "sizeof(uint16)" to
required size where number of WordEntryPos'es will be stored. Now it works
only because size of WordEntryPos is equal to size of uint16. But it is
not obviously during code reading and causes question "what for this magic
one is?"

v1-0003-Optimize-memory-allocation-during-tsvector-binary - patch makes
memory allocation more accuracy and rarely. It seems memory to store
tsvector's data is allocated redundant and reallocations happen more often
than necessary.

Best regards,

Denis Erokhin

Jatoba

Attachment Content-Type Size
v1-0001-Fix-receiving-of-big-tsvector-from-binary.patch application/octet-stream 753 bytes
v1-0002-Replace-magic-one-in-tsvector-binary-receiving.patch application/octet-stream 1.1 KB
v1-0003-Optimize-memory-allocation-during-tsvector-binary.patch application/octet-stream 1.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ерохин Денис Владимирович <erohin-d(at)datagile(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fix receiving large legal tsvector from binary format
Date: 2023-10-01 17:24:01
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

=?koi8-r?B?5dLPyMnOIOTFzsnTIPfMwcTJzcnSz9fJ3g==?= <erohin-d(at)datagile(dot)ru> writes:
> There is a problem on receiving large tsvector from binary format with
> getting error "invalid tsvector: maximum total lexeme length exceeded".

Good catch! Even without an actual failure, we'd be wasting space
on-disk anytime we stored a tsvector received through binary input.

I pushed your 0001 and 0002, but I don't really agree that 0003
is an improvement. It looks to me like it'll result in one
repalloc per lexeme, instead of the O(log N) behavior we had before.
It's not that important to keep the palloc chunk size small here,
given that we don't allow tsvectors to get anywhere near 1Gb anyway.

regards, tom lane