summaryrefslogtreecommitdiff
path: root/src/backend/tsearch/ts_locale.c
AgeCommit message (Collapse)Author
2025-01-01Update copyright for 2025Bruce Momjian
Backpatch-through: 13
2024-12-17Remove ts_locale.c's lowerstr()Peter Eisentraut
lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as str_tolower() that the rest of the system uses, except that the former don't use the common locale provider framework but instead use the global libc locale settings. This patch replaces uses of lowerstr*() with str_tolower(..., DEFAULT_COLLATION_OID). For instances that use a libc locale globally, this will result in exactly the same behavior. For instances that use other locale providers, you now get consistent behavior and are no longer dependent on the libc locale settings (for this case; there are others). Most uses of these functions are for processing dictionary and configuration files. In those cases, using the default collation seems appropriate. At least we don't have a more specific collation available. But the code in contrib/pg_trgm should really depend on the collation of the columns being processed. This is not done here, this can be done in a separate patch. (You can probably construct some edge cases where this change would create some locale-related upgrade incompatibility, for example if before you used a combination of ICU and a differently-behaving libc locale. We can document this in the release notes, but I don't think there is anything more we can do about this.) Reviewed-by: Jeff Davis <[email protected]> Discussion: https://www.postgresql.org/message-id/flat/653f3b84-fc87-45a7-9a0c-bfb4fcab3e7d%40eisentraut.org
2024-12-17Remove ts_locale.c's t_isdigit(), t_isspace(), t_isprint()Peter Eisentraut
These do the same thing as the standard isdigit(), isspace(), and isprint() but with multibyte and encoding support. But all the callers are only interested in analyzing single-byte ASCII characters. So this extra layer is overkill and we can replace the uses with the standard functions. All the t_is*() functions in ts_locale.c are under scrutiny because they don't use the common locale provider framework but instead use the global libc locale settings. For the functions being touched by this patch, we don't need all that anyway, as mentioned above, so the simplest solution is to just remove them. The few remaining t_is*() functions will need a different treatment in a separate patch. pg_trgm has some compile-time options with macros such as KEEPONLYALNUM. These are not documented, and the non-default variant is not supported by any test cases. As part of this undertaking, I'm removing the non-default variant, as it is in the way of cleanup. So in this case, the not-KEEPONLYALNUM code path is gone. Reviewed-by: Jeff Davis <[email protected]> Discussion: https://www.postgresql.org/message-id/flat/653f3b84-fc87-45a7-9a0c-bfb4fcab3e7d%40eisentraut.org
2024-11-28Remove useless casts to (void *)Peter Eisentraut
Many of them just seem to have been copied around for no real reason. Their presence causes (small) risks of hiding actual type mismatches or silently discarding qualifiers Discussion: https://www.postgresql.org/message-id/flat/[email protected]
2024-03-04Remove unused #include's from backend .c filesPeter Eisentraut
as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-01-04Update copyright for 2024Bruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/[email protected] Backpatch-through: 12
2023-03-17Fix t_isspace(), etc., when datlocprovider=i and datctype=C.Jeff Davis
Check whether the datctype is C to determine whether t_isspace() and related functions use isspace() or iswspace(). Previously, t_isspace() checked whether the database default collation was C; which is incorrect when the default collation uses the ICU provider. Discussion: https://postgr.es/m/[email protected] Reviewed-by: Peter Eisentraut Backpatch-through: 15
2023-01-02Update copyright for 2023Bruce Momjian
Backpatch-through: 11
2022-10-06Introduce t_isalnum() to replace t_isalpha() || t_isdigit() tests.Tom Lane
ts_locale.c omitted support for "isalnum" tests, perhaps on the grounds that there were initially no use-cases for that. However, both ltree and pg_trgm need such tests, and we do also have one use-case now in the core backend. The workaround of testing isalpha and isdigit separately seems quite inefficient, especially when dealing with multibyte characters; so let's fill in the missing support. Discussion: https://postgr.es/m/[email protected]
2022-01-08Update copyright for 2022Bruce Momjian
Backpatch-through: 10
2021-01-02Update copyright for 2021Bruce Momjian
Backpatch-through: 9.5
2020-09-24Improve behavior of tsearch_readline(), and remove t_readline().Tom Lane
Commit fbeb9da22, which added the tsearch_readline APIs, left t_readline() in place as a compatibility measure. But that function has been unused and deprecated for twelve years now, so that seems like enough time to remove it. Doing so, and merging t_readline's code into tsearch_readline, aids in making several useful improvements: * The hard-wired 4K limit on line length in tsearch data files is removed, by using a StringInfo buffer instead of a fixed-size buffer. * We can buy back the per-line palloc/pfree added by 3ea7e9550 in the common case where encoding conversion is not required. * We no longer need a separate pg_verify_mbstr call, as that functionality was folded into encoding conversion some time ago. (We could have done some of this stuff while keeping t_readline as a separate API, but there seems little point, since there's no reason for anyone to still be using t_readline directly.) Discussion: https://postgr.es/m/[email protected]
2020-09-23Avoid possible dangling-pointer access in tsearch_readline_callback.Tom Lane
tsearch_readline() saves the string pointer it returns to the caller for possible use in the associated error context callback. However, the caller will usually pfree that string sometime before it next calls tsearch_readline(), so that there is a window where an ereport will try to print an already-freed string. The built-in users of tsearch_readline() happen to all do that pfree at the bottoms of their loops, so that the window is effectively empty for them. However, this is not documented as a requirement, and contrib/dict_xsyn doesn't do it like that, so it seems likely that third-party dictionaries might have live bugs here. The practical consequences of this seem pretty limited in any case, since production builds wouldn't clobber the freed string immediately, besides which you'd not expect syntax errors in dictionary files being used in production. Still, it's clearly a bug waiting to bite somebody. Fix by pstrdup'ing the string to be saved for the error callback, and then pfree'ing it next time through. It's been like this for a long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/[email protected]
2020-01-01Update copyrights for 2020Bruce Momjian
Backpatch-through: update all files in master, backpatch legal files through 9.4
2019-01-02Update copyright for 2019Bruce Momjian
Backpatch-through: certain files through 9.4
2018-11-03Make ts_locale.c's character-type functions cope with UTF-16.Tom Lane
On Windows, in UTF8 database encoding, what char2wchar() produces is UTF16 not UTF32, ie, characters above U+FFFF will be represented by surrogate pairs. t_isdigit() and siblings did not account for this and failed to provide a large enough result buffer. That in turn led to bogus "invalid multibyte character for locale" errors, because contrary to what you might think from char2wchar()'s documentation, its Windows code path doesn't cope sanely with buffer overflow. The solution for t_isdigit() and siblings is pretty clear: provide a 3-wchar_t result buffer not 2. char2wchar() also needs some work to provide more consistent, and more accurately documented, buffer overrun behavior. But that's a bigger job and it doesn't actually have any immediate payoff, so leave it for later. Per bug #15476 from Kenji Uno, who deserves credit for identifying the cause of the problem. Back-patch to all active branches. Discussion: https://postgr.es/m/[email protected]
2018-01-03Update copyright for 2018Bruce Momjian
Backpatch-through: certain files through 9.3
2017-09-22Assume wcstombs(), towlower(), and sibling functions are always present.Tom Lane
These functions are required by SUS v2, which is our minimum baseline for Unix platforms, and are present on all interesting Windows versions as well. Even our oldest buildfarm members have them. Thus, we were not testing the "!USE_WIDE_UPPER_LOWER" code paths, which explains why the bug fixed in commit e6023ee7f escaped detection. Per discussion, there seems to be no more real-world value in maintaining this option. Hence, remove the configure-time tests for wcstombs() and towlower(), remove the USE_WIDE_UPPER_LOWER symbol, and remove all the !USE_WIDE_UPPER_LOWER code. There's not actually all that much of the latter, but simplifying the #if nests is a win in itself. Discussion: https://postgr.es/m/[email protected]
2017-06-21Phase 3 of pgindent updates.Tom Lane
Don't move parenthesized lines to the left, even if that means they flow past the right margin. By default, BSD indent lines up statement continuation lines that are within parentheses so that they start just to the right of the preceding left parenthesis. However, traditionally, if that resulted in the continuation line extending to the right of the desired right margin, then indent would push it left just far enough to not overrun the margin, if it could do so without making the continuation line start to the left of the current statement indent. That makes for a weird mix of indentations unless one has been completely rigid about never violating the 80-column limit. This behavior has been pretty universally panned by Postgres developers. Hence, disable it with indent's new -lpl switch, so that parenthesized lines are always lined up with the preceding left paren. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/[email protected] Discussion: https://postgr.es/m/[email protected]
2017-06-21Phase 2 of pgindent updates.Tom Lane
Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/[email protected] Discussion: https://postgr.es/m/[email protected]
2017-01-03Update copyright via script for 2017Bruce Momjian
2016-01-02Update copyright for 2016Bruce Momjian
Backpatch certain files through 9.1
2015-01-06Update copyright for 2015Bruce Momjian
Backpatch certain files through 9.0
2014-05-06pgindent run for 9.4Bruce Momjian
This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
2014-02-23Prefer pg_any_to_server/pg_server_to_any over pg_do_encoding_conversion.Tom Lane
A large majority of the callers of pg_do_encoding_conversion were specifying the database encoding as either source or target of the conversion, meaning that we can use the less general functions pg_any_to_server/pg_server_to_any instead. The main advantage of using the latter functions is that they can make use of a cached conversion-function lookup in the common case that the other encoding is the current client_encoding. It's notationally cleaner too in most cases, not least because of the historical artifact that the latter functions use "char *" rather than "unsigned char *" in their APIs. Note that pg_any_to_server will apply an encoding verification step in some cases where pg_do_encoding_conversion would have just done nothing. This seems to me to be a good idea at most of these call sites, though it partially negates the performance benefit. Per discussion of bug #9210.
2014-01-07Update copyright for 2014Bruce Momjian
Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
2013-01-01Update copyrights for 2013Bruce Momjian
Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.
2012-01-01Update copyright notices for year 2012.Bruce Momjian
2011-06-09Pgindent run before 9.1 beta2.Bruce Momjian
2011-04-23Fix char2wchar/wchar2char to support collations properly.Tom Lane
These functions should take a pg_locale_t, not a collation OID, and should call mbstowcs_l/wcstombs_l where available. Where those functions are not available, temporarily select the correct locale with uselocale(). This change removes the bogus assumption that all locales selectable in a given database have the same wide-character conversion method; in particular, the collate.linux.utf8 regression test now passes with LC_CTYPE=C, so long as the database encoding is UTF8. I decided to move the char2wchar/wchar2char functions out of mbutils.c and into pg_locale.c, because they work on wchar_t not pg_wchar_t and thus don't really belong with the mbutils.c functions. Keeping them where they were would have required importing pg_locale_t into pg_wchar.h somehow, which did not seem like a good plan.
2011-04-10pgindent run before PG 9.1 beta 1.Bruce Momjian
2011-02-10Fix "variable not used" warnings when USE_WIDE_UPPER_LOWER is notBruce Momjian
defined.
2011-02-08Per-column collation supportPeter Eisentraut
This adds collation support for columns and domains, a COLLATE clause to override it per expression, and B-tree index support. Peter Eisentraut reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
2011-01-01Stamp copyrights for year 2011.Bruce Momjian
2010-09-20Remove cvs keywords from all files.Magnus Hagander
2010-01-02Update copyright for the year 2010.Bruce Momjian
2009-06-118.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef listBruce Momjian
provided by Andrew.
2009-01-01Update copyright for 2009.Bruce Momjian
2008-11-10pg_do_encoding_conversion cannot return NULL (at least not unless the inputTom Lane
is NULL), so remove some useless tests for the case.
2008-06-18Improve error reporting for problems in text search configuration filesTom Lane
by installing an error context subroutine that will provide the file name and line number for all errors detected while reading a config file. Some of the reader routines were already doing that in an ad-hoc way for errors detected directly in the reader, but it didn't help for problems detected in subroutines, such as encoding violations. Back-patch to 8.3 because 8.3 is where people will be trying to debug configuration files.
2008-06-18Move wchar2char() and char2wchar() from tsearch into /mb to be easier toBruce Momjian
use for other modules; also move pnstrdup(). Clean up code slightly.
2008-06-17Move USE_WIDE_UPPER_LOWER define to c.h, and remove TS_USE_WIDE and useBruce Momjian
USE_WIDE_UPPER_LOWER instead.
2008-01-01Update copyrights in source tree to 2008.Bruce Momjian
2007-11-28Improve terminologyPeter Eisentraut
2007-11-24Fix (I think) broken usage of MultiByteToWideChar. I had missed theTom Lane
subtlety that this function only returns a null terminator if it's fed input that includes one; which, in the usage here, it's not. This probably fixes bugs reported by Thomas Haegi.
2007-11-15pgindent run for 8.3.Bruce Momjian
2007-11-09Clean up ts_locale.h/.c. Fix broken and not-consistent-across-platformsTom Lane
behavior of wchar2char/char2wchar; this should resolve bug #3730. Avoid excess computations of pg_mblen in t_isalpha and friends. Const-ify APIs where possible.
2007-08-25Cleanup for some problems in tsearch patch:Tom Lane
- ispell initialization crashed on empty dictionary file - ispell initialization crashed on affix file with prefixes but no suffixes - stop words file was run through pg_verify_mbstr, with database encoding, but it's supposed to be UTF-8; similar bug for synonym files - bunch of comments added, typos fixed, and other cleanup Introduced consistent encoding checking/conversion of data read from tsearch configuration files, by doing this in a single t_readline() subroutine (replacing direct usages of fgets). Cleaned up API for readstopwords too. Heikki Linnakangas
2007-08-21Tsearch2 functionality migrates to core. The bulk of this work is byTom Lane
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing, so anything that's broken is probably my fault. Documentation is nonexistent as yet, but let's land the patch so we can get some portability testing done.