How do I find records when data entry has been inconsistent?
A group of people have been inconsistently entering data for a while. Some people will enter this:
101mxeGte - TS 200-10
And other people will enter this
101mxeGte-TS-200-10
The sad thing is, those are supposed to be identical records.
They will also search inconsistently. If a record was entered one way, some people will search the other way.
Now, I know all about how you can fix data entry for the future, but that's NOT what I am asking about. I want to know how it is possible to:
- Leave the data alone, but...
- Search for the right thing.
Am I asking for the impossible here?
The best thing I found so far was a suggestion to simply muck about with the existing data, using the REPLACE function in mySQL.
I am uncomfortable with this option, as it means it will certainly actively piss off half of the users. The unfocused angst of all i开发者_如何学JAVAs less than the active ire of half.
The problem is that it has to go both ways: Entering spaces in the query has to find both space and not-space entries, and NOT entering spaces ALSO has to find both space and not-space entries.
Thanks for any help you can offer!
The "ideal" solution is pretty straightforward:
- Decide what is the canonical way of representing a record
- When someone saves a record, canonicalize it before saving
- When someone searches for a record, canonicalize the input before searching for it
You could also write a small program to convert all existing data to the canonical form (you will have the code for it anyway, as "canonicalize" in steps 2 and 3 require that you write code that does so).
Edit: some specific information on how to canonicalize
With the sample data you give, the algorithm might be:
- Replace all spaces with hyphens
- Replace all runs of one or more hyphens with a single hyphen (a regex would be easiest for this -- actually, a regex can do both steps in one go)
Is there any practical problem with this approach?
Trim whitespaces from BOTH the existing data and the input of the search. That way the intended record(s) will always be returned. Hope your data size is small, though, because it's going to perform pretty poorly.
Edit: by "existing data" I meant "the query of existing data". My answer was based on assumption that the actual data could not be touched (which might not be correct).
If it where up to me, I'd have the data in the database updated with REPLACE, and on future searches when dealing with the given row remove all spaces in the input.
Presumably your users enter the search terms (or record details, when creating a record) in an HTML form, which then goes to a PHP script. It looks like your data can always be written in a way that contains no spaces, so why don't you do this:
- Run a query that strips spaces from the existing data
- Add code in the PHP script(s) that receives the form(s), so that it strips spaces from submitted data - whether that data is to be used for search or for writing new data.
Edit: I guess you would also need to change some spaces to hyphens. Shouldn't be too hard to write logic to accomplish that.
Something like this.
pseudo code:
$myinput = mysql_real_escape_string('101mxeGte-TS-200-10')
$query = " SELECT * FROM table1
WHERE REPLACE(REPLACE(f1, ' ', ''),'-','')
= REPLACE(REPLACE($myinput, ' ', ''),'-','') "
Alternatively you might write your own function to trim the data so it can be compared.
DELIMITER $$
CREATE FUNCTION myTrim(AStr varchar) RETURNS varchar
BEGIN
declare Result varchar;
SET Result = REPLACE(AStr, ' ','');
SET Result = ......
.....
RETURN Result;
END$$
DELIMITER ;
And then use this in your select
$query = " SELECT * FROM table1
WHERE MyTrim(f1) = MyTrim($myinput) "
have you ever heard of SQL's LIKE? http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html there's also regex http://dev.mysql.com/doc/refman/4.1/en/regexp.html#operator_regexp
101mxeGte - TS 200-10
101mxeGte-TS-200-10
how about this?
SELECT 'justalnums' REGEXP '101mxeGte[[:blank:]]*(\-[[:blank:]]*)?TS[[:blank:]-]*200[[:blank:]-]*10'
digits can be represented by [0-9] and alphas as [a-z] or [A-Z] or [a-zA-Z] append a + to make then multiple of that. perens allow you to group and even capture what is in the perens and reuse it later in a replace or something else.
RLIKE is the same as REGEXP.
精彩评论