开发者

Complex query help for Partial/Match on Address/Zip and Phone

Example Data:

$address_1  = '123 Main St.';
$address_2  = 'Suite 200';
$phone      = '1235551212';
$zip        = '12345';

Example Database:

record_id, address_1, address_2, zip, phone
123, '123 main street', '', '12345', '1234567890'
124, '500 E. Ninja Road', 'Suite #200', '12345-1111', '(321)654-0987'
125, '222 where 4 east circle', 'P.O. Box 3', '11111', '1-123-555-1212'

This can be separate queries but I would need to search for partial matches in the address and zip and full matches in the phone.

So $address_1 would partial match record_id 123 as '123 Main St.' is a variation of '123 main street'

$address_2 would match record_id 124 as 'Suite 200' is a variation of 'Suite #200'

$phone would match record_id 125 as '1235551212' is a variation of '1-123-555-1212'

$zip would match record_id 123 & 124 as '12345' is a variation of '12345-1111' and a match to '12345'

NOTE: Also the values could be switched, meaning that $address_1 could be formated like this: '123 main street' and record_id 123 could be like this: '123 Main St.' (This applies to all fields)

I have been suggested to try ILIKE, LIKE, SIMILAR, CITEXT and FTS (Free Text Search) all of which are great but I'm not sure开发者_运维技巧 how to implement them to get the results I desire. I don't mind running multiple queries for each like a query for $address_1 match and another for $address_2 match and so on. I also know that there will be false positives as well as false negatives but I'm hoping to be around 75% (or better) in accurancy.

One IMPORTANT note is that the Postgres Server is running version 7.4 and there are no plans to upgrade.

Also to add more complexity to the query there are multiple address_1, address_2, zip and phone (Think separate address/phone for like home and office)

Here are my first attempts at solving this problem:

  • Stack Question #1
  • Stack Question #2

I had an idea to produce the most common formats and then pass those as the parameters in the query.

Something like:

$address_1  = array(
   '123 Main St.', // original
   '123 main st.', // lower case
   '123 Main St.', // First Letter Upper Case
   '123 MAIN ST.', // ALL Upper Case
   '123 Main St',  // remove punctuation original
   '123 main st',  // remove punctuation lower case
   '123 Main St',  // remove punctuation First Letter Upper Case
   '123 MAIN ST',  // remove punctuation ALL Upper Case  
   '123 Main',     // remove last word original
   '123 main',     // remove last word lower case
   '123 Main',     // remove last word First Letter Upper Case
   '123 MAIN',     // remove last word ALL Upper Case 
   '123 Main%',    // remove last word original with wildcard
   '123 main%',    // remove last word lower case with wildcard
   '123 Main%',    // remove last word First Letter Upper Case with wildcard
   '123 MAIN%'     // remove last word ALL Upper Case with wildcard
);

Then the query would be something like this:

SELECT * 
FROM tbl_name
WHERE address_1 IN (
   '123 Main St.', '123 main st.', '123 Main St.',
   '123 MAIN ST.', '123 Main St', '123 main st',
   '123 Main St', '123 MAIN ST', '123 Main',
   '123 main', '123 Main' '123 MAIN',
   '123 Main%', '123 main%', '123 Main%', 
   '123 MAIN%'
)

Just seems like a ton of variations I would have to make and I'm still not sure if this would be the most optimal way.

UPDATE:

Well this kinda works (from Stack Question #2)

SELECT * 
FROM tbl_name
WHERE LOWER(address_1) ILIKE LOWER('123 Main%')

With using the UNION method (from Stack Question #1) for each addition address field search like Office and Home


If I understand you correctly, you need fragmentary matches from field to field (e.g. '123" matches '123-111' and '12345' but not '122234' or '122-345') and you need to avoid all symbols when matching. Is this correct?

The psuedocode below is what I would try first:

  • Process all input fields to remove symbols
  • Tokenize the fields to determine fragments to match
  • Perform matches on fragments to data values

It sounds like the best way to achieve this would be through ILIKE queries where the percentage signs are are the beginning and the end of the fragment, such as:

SELECT * FROM [TABLE_NAME] WHERE address_1 ILIKE '%fragment%'

You would have to perform this query multiple times, which could be a performance headache, but I believe that may get you what you want.


I have a few suggestions for approaches.

I would strongly consider stripping all non-numerics from phone number. You can do this in postgres with something like this

SELECT id FROM [table] WHERE regexp_replace(phone, '[^0-9]', '', 'g') = [$phone];

I am not sure if this will work for you, but many address verification systems ignore everything except the numbers for matching purposes.

For example,

12-34 E. 5th street
APT 6B
City, ST 78910

would match 12345678910. which you could implement with something like this...

select * FROM [table] 
WHERE regexp_replace(address_1 || address_2 || zip, '[^0-9]', '', 'g') = $addressNumbers;

additionally you could investigate the postgres SoundEx extentions . http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html which may help simplify correcting spelling variations, casing, spacing, punctuation etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜