PostgreSQL full text search randomly dropping lexemes

2023-01-21 05:46 问答作者：

(All this was done with PostgreSQL 9.0.1 on Ubuntu 10.04, FWIW.)

I've tried writing my own lexize function for PostgreSQL specifically for names, based on the double metaphone algorithm. To produce the C lexize function, I started with the dict_xsyn example, and basically swapped double metaphone in for the synonym lookup.

But maybe 20% of the time to_tsvector is invoked, it seems to be dropping lexemes. As an example, I created a table containing the US Census Bureau's list of names.

db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank 
----------+-------+---------+------
 SMITH    | 1.006 |   1.006 |    1
 JOHNSON  |  0.81 |   1.816 |    2
 WILLIAMS | 0.699 |   2.515 |    3
 JONES    | 0.621 |   3.136 |    4
 BROWN    | 0.621 |   3.757 |    5
 DAVIS    |  0.48 |   4.237 |    6
 MILLER   | 0.424 |    4.66 |    7
 WILSON   | 0.339 |       5 |    8
(8 rows)

We can then add a vector column, and populate it with to_tsvector of my metaphone dictionary:

db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
   name   | freq  | cumfreq | rank |              vec              
----------+-------+---------+------+-------------------------------
 SMITH    | 1.006 |   1.006 |    1 | 
 JOHNSON  |  0.81 |   1.816 |    2 | 'ANSN':1 'JNSN':1 'johnson':1
 WILLIAMS | 0.699 |   2.515 |    3 | 
 JONES    | 0.621 |   3.136 |    4 | 
 BROWN    | 0.621 |   3.757 |    5 | 
 DAVIS    |  0.48 |   4.237 |    6 | 
 MILLER   | 0.424 |    4.66 |    7 | 'MLR':1 'miller':1
 WILSON   | 0.339 |       5 |    8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)

A bunch of the vec fields are just empty! In fact:

db=# select count(1) from names where vec = to_tsvector('');
 count 
-------
 41101
(1 row)

I can do that bulk update, and then count the mismatched fields repeatedly, and get different counts each time, because different rows are failing to have real vectors computed.

But I think my lexize function itself is ok?

db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
   name   | freq  | cumfreq | rank |       lexemes        
----------+-------+---------+------+----------------------
 SMITH    | 1.006 |   1.006 |    1 | {smith,SM0,XMT}
 JOHNSON  |  0.81 |   1.816 |    2 | {johnson,JNSN,ANSN}
 WILLIAMS | 0.699 |   2.515 |    3 | {williams,ALMS,FLMS}
 JONES    | 0.621 |   3.136 |    4 | {jones,JNS,ANS}
 BROWN    | 0.621 |   3.757 |    5 | {brown,PRN}
 DAVIS    |  0.48 |   4.237 |    6 | {davis,TFS}
 MILLER   | 0.424 |    4.66 |    7 | {miller,MLR}
 WILSON   | 0.339 |       5 |    8 | {wilson,ALSN,FLSN}
 MOORE    | 0.312 |   5.312 |    9 | {moore,MR}
 TAYLOR   | 0.311 |   5.623 |   10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
 count 
-------
     0
(1 row)

I can do that over and over and get 0 mismatches each time.

I'm creating the relevant text search dictionaries and configurations with:

CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE OR REPLACE FUNCTION ddmetaphon开发者_StackOverflow中文版edict_lexize(internal, internal, internal, internal)
        RETURNS internal
        AS '$libdir/dict_dmetaphone'
        LANGUAGE C STRICT;

CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
        LEXIZE = ddmetaphonedict_lexize,
        INIT   = ddmetaphonedict_init
);

CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
        TEMPLATE = dmetaphonedict_template
);

COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';

create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;

Pretty much straight out of the dict_xsyn example.

What's going on? What can do I do to figure out where these things are being dropped?

ADDED: Just tried this with 9.1. The rate at which to_tsvector produces a dud tsvector is greatly reduced (5 (just 5, not 5k) to 7000 records out of 80000+). But they are still there.

I'd like to first eliminate to_tsvector() as a cause.

I might first try changing to_tsvector() so it returns a constant. For example, change it to return "'ANSN':1 'JNSN':1 'johnson':1" no matter what its input arguments are. (Even if input arguments are NULL.) If your bulk update is still leaving holes in the column "vec", I think that would clear to_tsvector().

Am I right about that?

Also consider a two-step process, using lexize() to populate the column "lexemes", then another function to read the lexemes and produce the values in the column "vec".

继续阅读：full-text-search postgresql tsvector

PostgreSQL full text search randomly dropping lexemes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？