PostgreSQL full text search randomly dropping lexemes
(All this was done with PostgreSQL 9.0.1 on Ubuntu 10.04, FWIW.)
I've tried writing my own lexize function for PostgreSQL specifically for names, based on the double metaphone algorithm. To produce the C lexize function, I started with the dict_xsyn example, and basically swapped double metaphone in for the synonym lookup.
But maybe 20% of the time to_tsvector
is invoked, it seems to be dropping lexemes. As an example, I created a table containing the US Census Bureau's list of names.
db=# select * from names order by rank limit 8;
name | freq | cumfreq | rank
----------+-------+---------+------
SMITH | 1.006 | 1.006 | 1
JOHNSON | 0.81 | 1.816 | 2
WILLIAMS | 0.699 | 2.515 | 3
JONES | 0.621 | 3.136 | 4
BROWN | 0.621 | 3.757 | 5
DAVIS | 0.48 | 4.237 | 6
MILLER | 0.424 | 4.66 | 7
WILSON | 0.339 | 5 | 8
(8 rows)
We can then add a vector column, and populate it with to_tsvector
of my metaphone dictionary:
db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-------------------------------
SMITH | 1.006 | 1.006 | 1 |
JOHNSON | 0.81 | 1.816 | 2 | 'ANSN':1 'JNSN':1 'johnson':1
WILLIAMS | 0.699 | 2.515 | 3 |
JONES | 0.621 | 3.136 | 4 |
BROWN | 0.621 | 3.757 | 5 |
DAVIS | 0.48 | 4.237 | 6 |
MILLER | 0.424 | 4.66 | 7 | 'MLR':1 'miller':1
WILSON | 0.339 | 5 | 8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)
A bunch of the vec fields are just empty! In fact:
db=# select count(1) from names where vec = to_tsvector('');
count
-------
41101
(1 row)
I can do that bulk update, and then count the mismatched fields repeatedly, and get different counts each time, because different rows are failing to have real vectors computed.
But I think my lexize function itself is ok?
db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
name | freq | cumfreq | rank | lexemes
----------+-------+---------+------+----------------------
SMITH | 1.006 | 1.006 | 1 | {smith,SM0,XMT}
JOHNSON | 0.81 | 1.816 | 2 | {johnson,JNSN,ANSN}
WILLIAMS | 0.699 | 2.515 | 3 | {williams,ALMS,FLMS}
JONES | 0.621 | 3.136 | 4 | {jones,JNS,ANS}
BROWN | 0.621 | 3.757 | 5 | {brown,PRN}
DAVIS | 0.48 | 4.237 | 6 | {davis,TFS}
MILLER | 0.424 | 4.66 | 7 | {miller,MLR}
WILSON | 0.339 | 5 | 8 | {wilson,ALSN,FLSN}
MOORE | 0.312 | 5.312 | 9 | {moore,MR}
TAYLOR | 0.311 | 5.623 | 10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
count
-------
0
(1 row)
I can do that over and over and get 0 mismatches each time.
I'm creating the relevant text search dictionaries and configurations with:
CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
RETURNS internal
AS '$libdir/dict_dmetaphone'
LANGUAGE C STRICT;
CREATE OR REPLACE FUNCTION ddmetaphon开发者_StackOverflow中文版edict_lexize(internal, internal, internal, internal)
RETURNS internal
AS '$libdir/dict_dmetaphone'
LANGUAGE C STRICT;
CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
LEXIZE = ddmetaphonedict_lexize,
INIT = ddmetaphonedict_init
);
CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
TEMPLATE = dmetaphonedict_template
);
COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';
create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;
Pretty much straight out of the dict_xsyn example.
What's going on? What can do I do to figure out where these things are being dropped?
ADDED: Just tried this with 9.1. The rate at which to_tsvector produces a dud tsvector is greatly reduced (5 (just 5, not 5k) to 7000 records out of 80000+). But they are still there.
I'd like to first eliminate to_tsvector() as a cause.
I might first try changing to_tsvector() so it returns a constant. For example, change it to return "'ANSN':1 'JNSN':1 'johnson':1" no matter what its input arguments are. (Even if input arguments are NULL.) If your bulk update is still leaving holes in the column "vec", I think that would clear to_tsvector().
Am I right about that?
Also consider a two-step process, using lexize() to populate the column "lexemes", then another function to read the lexemes and produce the values in the column "vec".
精彩评论