开发者

Matching the best regex

In following table, how can I match 'FG2-4G4T5' so that the first returned row is the best match (which would be 'FG2-4G')?

part_match
^FG2|^FF2|^FF3-S|^FF4-GR
^FG2-4G
^FG2
 FG2-RGST

The following query doesn't work in this case because '^FG2|^FF2|^FF3-S|^FF4-GR' has greater l开发者_开发知识库ength:

SELECT * FROM parts WHERE 'FG2-4G4T5' REGEXP part_match ORDER BY length(part_match) DEC

Any help will be greatly appreciated. Thanks in advance.


Sometimes the K.I.S.S. solution is best. Keep another column that lists the actual length of text matched by the pattern?

part_match           part_match_length
======================================
^FG2|^FF2|^FF3|^FF4  3
^FG2-4G              6
^FG2                 3

and thus...

SELECT * FROM parts
WHERE 'FG2-4G4T5' REGEXP part_match
ORDER BY part_match_length DEC

Technically, the second column doesn't even have to necessarily be length - just some kind of indication of how good a match a given pattern provides.


Assuming your part_match is ALWAYS of the form ^xxxx and optionally EQUAL-SIZED ^yyyy|^zzzz|...

SELECT * FROM parts
WHERE 'FG2-4G4T5' REGEXP part_match
ORDER BY instr(concat(part_match,'|'),'|') DEC

This will give you the longest matching prefix.

To solve the inclusion of FULL matches, take one off the length of part matches, so,

SELECT * FROM parts
WHERE 'FG2-4G4T5' REGEXP part_match
ORDER BY CASE WHEN part_match LIKE '^%' THEN -1 else 0 end +
         instr(concat(part_match,'|'),'|') DEC

To solve REGEX involving multiple sections of unequal length, e.g. ^YYYY|^Z|ABC, you will first need to find a function that breaks that into multiple rows

part_match    | single_part
^YYYY|^Z|ABC    ^YYYY
^YYYY|^Z|ABC    ^Z
^YYYY|^Z|ABC    ABC

Then REGEXP by single_part referencing back the part_match in the answer. This is non-trivial but can be achieved with a Numbers table and judicious use of SUBSTR and INSTR.

Create a numbers table (run this once):

DROP PROCEDURE IF EXISTS CreateNumbersTable;

delimiter //
CREATE PROCEDURE CreateNumbersTable()
    LANGUAGE SQL
    NOT DETERMINISTIC
    CONTAINS SQL
    SQL SECURITY DEFINER
    COMMENT ''
BEGIN
  drop table if exists Numbers;
  create table Numbers (N int primary key);

  SET @x := 0;
  REPEAT 
    insert into Numbers values (@x);
    SET @x := @x + 1;
    UNTIL @x > 999 END REPEAT;
END//
delimiter ;

CALL CreateNumbersTable;

DROP PROCEDURE CreateNumbersTable;

Then you can use this query

select p.*
#  ,substr(p.part_match, N.N+1, locate('|', concat(p.part_match,'|'), N.N+2) -N.N -1)
#  ,length(substr(p.part_match, N.N+1, locate('|', concat(p.part_match,'|'), N.N+2) -N.N -1))
from parts p
inner join numbers N on N.N between 0 and length(p.part_match)
  and ((N.N = 0) or (substr(p.part_match, N.N, 1) = '|'))
WHERE 'FG2-RGST' REGEXP p.part_match
  and 'FG2-RGST' REGEXP substr(p.part_match, N.N+1, locate('|', concat(p.part_match,'|'), N.N+2) -N.N -1)
order by length(substr(p.part_match, N.N+1, locate('|', concat(p.part_match,'|'), N.N+2) -N.N -1)) DESC

Uncomment lines 2 and 3 to see the parts of part_match it matches on.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜