Thread: integrated tsearch has different results than tsearch2
Hello
I am testing fulltext.
1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).
2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.
Original (old) result
ts=# select * from ts_debug('Příliš žluťoučký kůň se napil žluté vody'); ts_name | tok_type | description |
token | dict_name | tsvector--------------+----------+-------------+-----------+
-------------------+ ------------default_czech | word | Word | Příliš |
{cz_ispell,simple} | 'příliš'default_czech | word | Word | žluťoučký |
{cz_ispell,simple} | 'žluťoučký'default_czech | word | Word | kůň | {cz_ispell,simple} |
'kůň'default_czech| lword | Latin word | se | {cz_ispell,simple} |default_czech | lword | Latin word |
napil |
{cz_ispell,simple} | 'napít'default_czech | word | Word | žluté |
{cz_ispell,simple} | 'žlutý'default_czech | lword | Latin word | vody |
{cz_ispell,simple} | 'voda'(7 řádek)
New results:
postgres=# create Text search dictionary cspell(template=ispell,
dictfile=czech, afffile=czech, stopwords=czech);
CREATE TEXT SEARCH DICTIONARY
postgres=# CREATE text search configuration cs (copy=english);
CREATE TEXT SEARCH CONFIGURATION
postgres=# alter text search configuration cs alter mapping for word,
lword with cspell, simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-----------+-----------------+---------------------word | Word | Příliš |
{cspell,simple}| cspell: {příliš}blank | Space symbols | | {} |word | Word | žluťoučký
|{cspell,simple} | cspell: {žluťoučký}blank | Space symbols | | {} |word | Word | kůň
| {cspell,simple} | cspell: {kůň}blank | Space symbols | | {} |lword | Latin word | se
| {cspell,simple} | cspell: {}blank | Space symbols | | {} |lword | Latin word | napil
|{cspell,simple} | simple: {napil}blank | Space symbols | | {} |word | Word | žluté
| {cspell,simple} | simple: {žluté}blank | Space symbols | | {} |lword | Latin word | vody
| {cspell,simple} | simple: {vody}
(13 rows)
This query returned true in 8.2 and now:
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');?column?
----------f
(1 row)
Regards
Pavel Stehule
Pavel,
I can't read your posting. Can you use plain text format ?
Oleg
On Mon, 3 Sep 2007, Pavel Stehule wrote:
> Hello
> I am testing fulltext.
> 1. I am not able use fulltext with latin2 encoding :( I missing noteabout only utf8 dictionaries in doc).
>
> 2. with hspell dictionaries (fresh copy from open office) I gotdifferent and wrong results.
> Original (old) result
> ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody'); ts_name | tok_type | description |
token | dict_name | tsvector --------------+----------+-------------+-----------+-------------------+
------------default_czech | word | Word | P??li? |{cz_ispell,simple} | 'p??li?' default_czech | word
|Word | ?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word | k?? |
{cz_ispell,simple}| 'k??' default_czech | lword | Latin word | se | {cz_ispell,simple} | default_czech |
lword | Latin word | napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word | ?lut?
|{cz_ispell,simple}| '?lut?' default_czech | lword | Latin word | vody |{cz_ispell,simple} | 'voda' (7 ??dek)
> New results:postgres=# create Text search dictionary cspell(template=ispell,dictfile=czech, afffile=czech,
stopwords=czech);CREATETEXT SEARCH DICTIONARYpostgres=# CREATE text search configuration cs (copy=english);CREATE TEXT
SEARCHCONFIGURATION
> postgres=# alter text search configuration cs alter mapping for word,lword with cspell, simple;ALTER TEXT SEARCH
CONFIGURATIONpostgres=#select * from ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias | Description |
Token | Dictionaries | Lexized token-------+---------------+-----------+-----------------+---------------------
word | Word | P??li? | {cspell,simple} | cspell: {p??li?} blank | Space symbols | | {}
| word | Word | ?lu?ou?k? | {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols | | {}
| word | Word | k?? | {cspell,simple} | cspell: {k??} blank | Space symbols | | {}
| lword | Latin word | se | {cspell,simple} | cspell: {} blank | Space symbols | | {}
| lword | Latin word | napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {}
| word | Word | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space symbols | |
{} | lword | Latin word | vody | {cspell,simple} | simple: {vody}(13 rows)
> This query returned true in 8.2 and now:
> postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ to_tsquery('cs','nap?t');
?column?----------f(1 row)
> RegardsPavel Stehule
>
Regards, Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [email protected], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
> 1. I am not able use fulltext with latin2 encoding :( I missing note
> about only utf8 dictionaries in doc).
You can use any server encoding, but dictionary's files should be in utf8 -
dictionary will convert utf8 files into server encoding.
>
>
> 2. with hspell dictionaries (fresh copy from open office) I got
> different and wrong results.
> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> vody') @@ to_tsquery('cs','napít');
> ?column?
> ----------
> f
> (1 row)
Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');
--
Teodor Sigaev E-mail: [email protected]
WWW: http://www.sigaev.ru/
2007/9/3, Teodor Sigaev <[email protected]>: > > 1. I am not able use fulltext with latin2 encoding :( I missing note > > about only utf8 dictionaries in doc). > You can use any server encoding, but dictionary's files should be in utf8 - > dictionary will convert utf8 files into server encoding. > > > > > > > 2. with hspell dictionaries (fresh copy from open office) I got > > different and wrong results. > > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > > vody') @@ to_tsquery('cs','napít'); > > ?column? > > ---------- > > f > > (1 row) > > Pls, output of: > select ts_lexize('cspell','napil'); > select to_tsvector('cs','Příliš žlutý kůň se napil žluté > vody'); > > postgres=# select ts_lexize('cspell','napil');ts_lexize ----------- (1 row) postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); to_tsvector -----------------------------------------------------------'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 (1 row) There is difference 8.2.x postgres=# select lexize('cz_ispell','jablka'); lexize ----------{jablko} (1 row) 8.3 postgres=# select ts_lexize('cspell','jablka');ts_lexize ----------- (1 row) postgres=# select ts_lexize('cspell','jablko');ts_lexize -----------{jablko} (1 row) Pavel Stehule
Pavel Stehule wrote: > 2007/9/3, Teodor Sigaev <[email protected]>: >>> 1. I am not able use fulltext with latin2 encoding :( I missing note >>> about only utf8 dictionaries in doc). >> You can use any server encoding, but dictionary's files should be in utf8 - >> dictionary will convert utf8 files into server encoding. >> >>> >>> 2. with hspell dictionaries (fresh copy from open office) I got >>> different and wrong results. >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté >>> vody') @@ to_tsquery('cs','napít'); >>> ?column? >>> ---------- >>> f >>> (1 row) >> Pls, output of: >> select ts_lexize('cspell','napil'); >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté >> vody'); >> >> > postgres=# select ts_lexize('cspell','napil'); > ts_lexize > ----------- > > (1 row) > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); > to_tsvector > ----------------------------------------------------------- > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 > (1 row) > > There is difference > 8.2.x > postgres=# select lexize('cz_ispell','jablka'); > lexize > ---------- > {jablko} > (1 row) > 8.3 > postgres=# select ts_lexize('cspell','jablka'); > ts_lexize > ----------- > > (1 row) > postgres=# select ts_lexize('cspell','jablko'); > ts_lexize > ----------- > {jablko} > (1 row) Can you post a link to the ispell dictionary file you're using so I and others can reproduce that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I used dictionaries from fedora core packages hunspell-cs-20060303-5.fc7.i386.rpm then I converted it to utf8 with iconv Pavel 2007/9/4, Heikki Linnakangas <[email protected]>: > Pavel Stehule wrote: > > 2007/9/3, Teodor Sigaev <[email protected]>: > >>> 1. I am not able use fulltext with latin2 encoding :( I missing note > >>> about only utf8 dictionaries in doc). > >> You can use any server encoding, but dictionary's files should be in utf8 - > >> dictionary will convert utf8 files into server encoding. > >> > >>> > >>> 2. with hspell dictionaries (fresh copy from open office) I got > >>> different and wrong results. > >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > >>> vody') @@ to_tsquery('cs','napít'); > >>> ?column? > >>> ---------- > >>> f > >>> (1 row) > >> Pls, output of: > >> select ts_lexize('cspell','napil'); > >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté > >> vody'); > >> > >> > > postgres=# select ts_lexize('cspell','napil'); > > ts_lexize > > ----------- > > > > (1 row) > > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); > > to_tsvector > > ----------------------------------------------------------- > > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 > > (1 row) > > > > There is difference > > 8.2.x > > postgres=# select lexize('cz_ispell','jablka'); > > lexize > > ---------- > > {jablko} > > (1 row) > > 8.3 > > postgres=# select ts_lexize('cspell','jablka'); > > ts_lexize > > ----------- > > > > (1 row) > > postgres=# select ts_lexize('cspell','jablko'); > > ts_lexize > > ----------- > > {jablko} > > (1 row) > > Can you post a link to the ispell dictionary file you're using so I and > others can reproduce that? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Pavel Stehule wrote: > I used dictionaries from fedora core packages > > hunspell-cs-20060303-5.fc7.i386.rpm > > then I converted it to utf8 with iconv Ok, thanks. Apparently it's a bug I introduced when I refactored spell.c to use the readline function for reading and recoding the input file. I didn't notice that some calls to STRNCMP used the non-lowercased version of the input line. Patch attached. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/tsearch/spell.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tsearch/spell.c,v retrieving revision 1.2 diff -c -r1.2 spell.c *** src/backend/tsearch/spell.c 25 Aug 2007 00:03:59 -0000 1.2 --- src/backend/tsearch/spell.c 4 Sep 2007 12:31:55 -0000 *************** *** 733,739 **** while ((recoded = t_readline(affix)) != NULL) { pstr = lowerstr(recoded); - pfree(recoded); lineno++; --- 733,738 ---- *************** *** 813,820 **** flag = (unsigned char) *s; goto nextline; } ! if (STRNCMP(str, "COMPOUNDFLAG") == 0 || STRNCMP(str, "COMPOUNDMIN") == 0 || ! STRNCMP(str, "PFX") == 0 || STRNCMP(str, "SFX") == 0) { if (oldformat) ereport(ERROR, --- 812,819 ---- flag = (unsigned char) *s; goto nextline; } ! if (STRNCMP(recoded, "COMPOUNDFLAG") == 0 || STRNCMP(recoded, "COMPOUNDMIN") == 0 || ! STRNCMP(recoded, "PFX") == 0 || STRNCMP(recoded, "SFX") == 0) { if (oldformat) ereport(ERROR, *************** *** 834,839 **** --- 833,839 ---- NIAddAffix(Conf, flag, flagflags, mask, find, repl, suffixes ? FF_SUFFIX : FF_PREFIX); nextline: + pfree(recoded); pfree(pstr); } FreeFile(affix);
2007/9/4, Heikki Linnakangas <[email protected]>: > Pavel Stehule wrote: > > I used dictionaries from fedora core packages > > > > hunspell-cs-20060303-5.fc7.i386.rpm > > > > then I converted it to utf8 with iconv > > Ok, thanks. > > Apparently it's a bug I introduced when I refactored spell.c to use the > readline function for reading and recoding the input file. I didn't > notice that some calls to STRNCMP used the non-lowercased version of the > input line. Patch attached. > > -- It works Thank you Pavel