How do I find records when data entry has been inconsistent?

2023-02-26 01:45 问答作者：

A group of people have been inconsistently entering data for a while. Some people will enter this:

101mxeGte - TS 200-10

And other people will enter this

101mxeGte-TS-200-10

The sad thing is, those are supposed to be identical records.

They will also search inconsistently. If a record was entered one way, some people will search the other way.

Now, I know all about how you can fix data entry for the future, but that's NOT what I am asking about. I want to know how it is possible to:

Leave the data alone, but...
Search for the right thing.

Am I asking for the impossible here?

The best thing I found so far was a suggestion to simply muck about with the existing data, using the REPLACE function in mySQL.

I am uncomfortable with this option, as it means it will certainly actively piss off half of the users. The unfocused angst of all i开发者_如何学JAVAs less than the active ire of half.

The problem is that it has to go both ways: Entering spaces in the query has to find both space and not-space entries, and NOT entering spaces ALSO has to find both space and not-space entries.

Thanks for any help you can offer!

The "ideal" solution is pretty straightforward:

Decide what is the canonical way of representing a record
When someone saves a record, canonicalize it before saving
When someone searches for a record, canonicalize the input before searching for it

You could also write a small program to convert all existing data to the canonical form (you will have the code for it anyway, as "canonicalize" in steps 2 and 3 require that you write code that does so).

Edit: some specific information on how to canonicalize

With the sample data you give, the algorithm might be:

Replace all spaces with hyphens
Replace all runs of one or more hyphens with a single hyphen (a regex would be easiest for this -- actually, a regex can do both steps in one go)

Is there any practical problem with this approach?

Trim whitespaces from BOTH the existing data and the input of the search. That way the intended record(s) will always be returned. Hope your data size is small, though, because it's going to perform pretty poorly.

Edit: by "existing data" I meant "the query of existing data". My answer was based on assumption that the actual data could not be touched (which might not be correct).

If it where up to me, I'd have the data in the database updated with REPLACE, and on future searches when dealing with the given row remove all spaces in the input.

Presumably your users enter the search terms (or record details, when creating a record) in an HTML form, which then goes to a PHP script. It looks like your data can always be written in a way that contains no spaces, so why don't you do this:

Run a query that strips spaces from the existing data
Add code in the PHP script(s) that receives the form(s), so that it strips spaces from submitted data - whether that data is to be used for search or for writing new data.

Edit: I guess you would also need to change some spaces to hyphens. Shouldn't be too hard to write logic to accomplish that.

Something like this.

pseudo code:

$myinput = mysql_real_escape_string('101mxeGte-TS-200-10')
$query = " SELECT * FROM table1 
           WHERE REPLACE(REPLACE(f1, ' ', ''),'-','') 
                 = REPLACE(REPLACE($myinput, ' ', ''),'-','') "

Alternatively you might write your own function to trim the data so it can be compared.

DELIMITER $$

CREATE FUNCTION myTrim(AStr varchar) RETURNS varchar
BEGIN
  declare Result varchar;
  SET Result = REPLACE(AStr, ' ','');
  SET Result = ......
  .....
  RETURN Result;
END$$

DELIMITER ;

And then use this in your select

$query = " SELECT * FROM table1 
           WHERE MyTrim(f1) = MyTrim($myinput) "

have you ever heard of SQL's LIKE? http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html there's also regex http://dev.mysql.com/doc/refman/4.1/en/regexp.html#operator_regexp

101mxeGte - TS 200-10
101mxeGte-TS-200-10
how about this?
SELECT 'justalnums' REGEXP '101mxeGte[[:blank:]]*(\-[[:blank:]]*)?TS[[:blank:]-]*200[[:blank:]-]*10'

digits can be represented by [0-9] and alphas as [a-z] or [A-Z] or [a-zA-Z] append a + to make then multiple of that. perens allow you to group and even capture what is in the perens and reuse it later in a replace or something else.

RLIKE is the same as REGEXP.

继续阅读：php string whitespace

How do I find records when data entry has been inconsistent?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？