开发者

Matching arrays by values

I have multiple entries in a temporary table in Database, and I need to merge them to make permanent entries. Now the information is coming from multiple XML Feeds, and I have all sorts of information, but the closest that I have is the "title" or in my case, name of the product. Unfortunately, I don't have any other way (no same ID's or anything like that) than to match them by their name. So for example I have:

$primary = array('feedid' => 2, 'entry_name' => 'ACME Product Black Model #23');
$secondary = array('feedid' => 3, 'entry_name' => 'ACME Product Model #23');

The ACME Product May Vary from "ACME Product Model #23" to "Model 23", to "Black Model #23", etc. Also, in the same feed I may have "ACME Product Model Black #22" and "CHOAM Product Black - Model 11".

The problem is that I can't just use similar_tex开发者_Python百科t() or levenshtein(), because they would sometimes match wrong items, or sometimes just don't match at all. Each feed has 100+ entries, and I can have up to about 10 feeds.

Edit: To put in real terms, for example: "iPhone 4" and "iPhone 4 White" and "iPhone 4 Black" should all be merged ( I can handle the merging, need to match first ). So the rules are - Match the phones in this case. It could also be "Barby Doll White hair" and "Barby Doll Black Hair", but not "Some other Doll with White Hair". ...

Any ideas appreciated :)


In a comment your wrote:

I can't know what the feeds are going to be matching exactly

Well if you can not tell, how should anybody else tell you?

You first of need to solve your base problem (get the model number from string), to continue.

Unless you can't, you need to throw an exception, output the model-string you were unable to match, analyse and tweak your parser.

You can more or less easily parse strings by using regular expressions:

$r = preg_match('/Model(?: \w+)? #?(\d+)$/', $string, $matches);
if (!$r) throw new Exception(sprintf('Unable to parse "%s"', $string));
$modelNumber = $matches[1];

That for example works with the example data you've given. But the job to analyse the input is up to you. It can not be specifically answered.


I think it is worth to go with the pregmatch that hakre suggests.

I would go like this:

  1. (Optionally) In the old-temporary table would add one more field of tinyint called flag.

  2. I would go with pregmatch and in a pregmatch success I would put a positive flag on the old table to indicate that this record was managed successfuly by pregmatch.

  3. If pregmatch failed would I would go with text similarity as hakre suggests again and would put a flag that was managed with text similarity.

In the end I hope a big percentage of the records would have been managed by pregmatch and only few would hae a flag indicating "text similrity" management. This would make the problem smaller, I think. wouldn't it?

If you later find a better solution you can use the flag to know what records were not managed by pregmatch.

Then as for retrieving the new data I would go with the whith text similarity, for example something like mysql like '%string%'.

As for pregmatch being slow you will only do this process once,so shouldn't be a problem. In addition I would add a conditioned loop in order not to exceed max execution time.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜