开发者

using REPLACE in WHERE clause to check spelling permutations - MS SQL

I have a table like:

| id | lastname | firstname |
|  1 | doe      | john      |
|  2 | oman     | donald    |
|  3 | o'neill  | james     |
|  4 | onackers | sharon    |

Essentially, users are going to be searching by the first letters of the last name.

I want to be able to return results that contain and don't contain punctuation from the database. For instance, when a user searches for: on

I want to return both: o'neill, onackers

I want someone to be able to search "o, on, oneill, o neill, etc" to get o'neill.

So the best way to do this seems to take the lastname column value and have two permutations of it searched in the WHERE clause with an OR. One where any special characters are replaced with the _ in SQL, and one where all non-alpha chars (including spaces) are gone.

I figure I can use the underscore in the SQL replace to keep the one space available.

I'm having a little trouble with the WHERE clause. I'd prefer to do this with a simple REPLACE rather than creating a regex function if possible. If that's a no-go though, I understand:

@la开发者_高级运维st_name (this is the nvarchar input)

SELECT id, lastname, firstname
FROM people
WHERE ((REPLACE(people.lastname, '[^A-Za-z]', '_') like @last_name + '%')
OR (REPLACE(people.lastnname,'[^A-Za-z ]', '') like @last_name + '%'))
ORDER BY lastname

I'm pretty sure the replace part has to be on the other side of the LIKE. I'm messing up the structure but need some help.

I am Using MSSQL Server 2005.

Thank you so much in advance.

UPDATE

It seems like I have two options:

  1. Create a regular expression function using CLR (excuse me if I'm saying this wrong, I'm new to it)
  2. Create extra columns on the table or create a new "fuzzyTable" with the cleaned up last names.

The database gets updated once a night. I have actually already begun the new table approach, as it was what I was originally going to do. However, I'm beginning to think it's smarter to add the "fuzzy" columns to the main table and then on the nightly update to add the adjusted lastnames to the new / updated rows.

Stack Overflow: Which approach is better? User-defined REGEX function I can use in the SQL, and thus avoid extra columns? Or adding the extra column or two to the table? Or a new table?


Depending on how complex your scenario can get, this will be lots of work, and slow too. But there's a more flexible approach. Consider something like this, referred to as initialTable:

| id | lastname | firstname |
|  1 | o'malley | josé      |
|  2 | omállèy  | dònáld    |
|  3 | o'neill  | jámès     |
|  4 | onackers | sharon    |

Maybe a bit much, but it illustrates the general problem. I had to implement a "fuzzy" search for our intranet website based on character data that looked very similar - there's many accents in french or spanish names or street addresses for example.

What I did was define a function that performed all replacements for a given string, for example (pseudocode):

function string replacestuff(string input)
{
  input = replace(input, "è", "e");
  input = replace(input, "é", "e");
  input = replace(input, "ò", "o");
  input = replace(input, "ó", "o");
  input = replace(input, "'", "");
  ...
  return input;
}

Using this conversion function, create a second table fuzzyTable that has the following content:

| id | lastname | firstname |
|  1 | omalley  | jose      |
|  2 | omalley  | donald    |
|  3 | oneill   | james     |
|  4 | onackers | sharon    |

Now, assume you'll get an input string for your search of josè. This can't be found in either table. What you'll have to do is this:

declare @input varchar(50)
declare @input_mod varchar(50)
set @input = 'josè'
set @input_mod = replacestuff(@input)

SELECT id FROM initialTable WHERE firstname like @input OR firstname like @input_mod
UNION
SELECT id FROM fuzzyTable WHERE firstname like @input OR firstname like @input_mod
GROUP BY id

(Of course, you'd have to add % to make LIKE work.) The key here is to modify your input search string using the replacement function; this way you'll get a match if searching for against a content of because both come down to se when being processed by the replacement function.

You could even do a two-level search; first check only the unmodified string against the proper table and then with the statement shown above do a fuzzy search if the user says so.

This is a very flexible approach and can handle all sorts of stuff, like finding german letters ä, ö, ü, ß by using two-letter expressions ae, oe, ue, ss. The disadvantage is that you'll have to keep duplicates of some data, and change those duplicates within fuzzyTable as the initialTable (or the replacement function) changes. In our current use case, the intranet database gets updated once a night, so it's not a problem.

EDIT

You need to be aware that, using this, in some cases you'll get false positives. For example, we're using this for an employee search, and if you've got a dutch name spelled Hoek, you'd also find this name searching for Hök, because in german the replacement for ö would be oe. This could be solved using country-aware replacement functions, but we never took the concept this far. Depending on your input data this is more or less academic, for our use case I can't remember anyone complaining.

The main reason why we came up with this approach in the first place was that some of the data we had to work with was riddled with spelling errors, ie. in french many vowels were accented the wrong way around, but still we needed to deliver a result.


I believe the trouble you're having is that SQL-Server's repalce function doesn't accept [^A-Za-z] to mean "non-alpa characters". Instead it's actually looking for that exact string to replace it.

http://msdn.microsoft.com/en-us/library/ms186862%28v=sql.90%29.aspx


In terms of using Regular Expressions, I've only done that by using the CLR, which seems to be getting much too involved for this particular problem.


My advice would be to hold the searchable fields, in the two different formats, in the table itself. And then use a simple LIKE search.

WHERE last_name LIKE @last_name OR last_name_stripped LIKE @last_name

last_name_stripped could then be a computed column (maybe using a function to strip all non_alpha characters), or handled by your client at Insert time.


Using:

WHERE ( REPLACE(people.lastname, '[^A-Za-z]', '') LIKE @last_name + '%' )

or

WHERE ( ComplexFunction( field ) LIKE whatever )

will most likely have the result that your query will not use the index (if there is one) of field people.lastname and thus scan the whole table every time you run the query.

I see two ways to avoid this:

One, add another field lastnameStripped to the table, where the ComplexFunction(lastname) is stored and an index to this field. Then you can search with either:

WHERE ( lastnameStripped LIKE REPLACE(@last_name, '[^A-Za-z]', '') + '%' )

or

WHERE ( lastnameStripped LIKE @last_name + '%' )

and both will use the index of lastnameStripped.

Two, create an indexed view with the ComplexFunction( lastname ) as a field.


If you need to do relatively complex lookups on a column on a large table, it could be more efficient to create a second column that contains the data formatted for efficient searches (with the immediate caveat that "like" searches are rarely efficient). So where you have column LastName, add a new column like LastNameLookup, and populate that column with the data formatted appropriately for your search criteria. If the formatting rules are relatively simple, you could implement this as a computed column column; if performance is important, make it a persisted computed column.

Also to mention, SQL does not support regular expressions (though there is a limited form tied in to the LIKE clause in SQL 2008).


Hmm...using classic asp example. I'm guessing this is from a form. For this example I'm calling your textbox field 'namesearch'. So the page where you request.form("namesearch"), just assign strSearch = request.form("namesearch"). Then before you run it into the SQL query do something like this:

strSearch = request.form("namesearch") 'to get textbox info from form

strSearch = replace(strSearch," ", "") 'to remove spaces
strSearch = replace(strSearch,"'", "") 'to remove apostrophes

For the SQL

SELECT id, lastname, firstname FROM people WHERE people.lastname like '%"& strSearch &"%' ORDER BY lastname

Tested and works using VBScript and SQL 2005 Server


In my case i had a table in which i had phone numbers with dashes and i wanted to search a record with a phone number user enter (but user enter phone number without dashes)

using REPLACE in WHERE clause to check spelling permutations - MS SQL

so i did something like this

select * from rpcusttest
WHERE ( REPLACE(RPCustTest.CustomerID, '-', '') LIKE '7183877333' + '%' )

Now though user is enter a number without dahses it will search all records with dashes as well

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜