开发者

Regex, php : how to negate a capturing parentheses

I want to analyse a mysql request with a php regex, that is, extract the select_expr and the开发者_开发百科 table_references from a mysql statement. For example, here are two mysql queries that I would like my regex to match :

select id, name from table

select id, name

From that query I would like to extract 2 parts : the "id, name" information and the "table" information too.

The first part can actually contain a string like CONCAT('id','.','nom') AS alias,

and the second part can look like : table t INNER JOIN table2 t2 ON t.id=t2.user_id.

So I tried this "I know it's not working but will get me on the road" regex :

'!select (.*)( from (.*))?!i'

And of course, the first capturing parentheses get all until the end, which is not what I want.

In the

select id, name from table

string, it matches "id, nom from table" as the first part, which is not what I want. (I want "id, nom" as the first part and "table" as the second in this case).

What I would like to do from this point is to tell the regex that the first capturing parentheses should not match the " from " sequence if founded. I know there is the negated character class feature, [^a-z], but that just negate one character and not a whole string (as a sequence of letters in the right order).

Do you have any lights on this ? Can we negate the content of parentheses for example with regex ?


The last bit if your question makes it sound like the 'from' part of your query is optional, is that right?

If so, then try this:

!^select (.*?)(?: from (.*))?$!i

This will match everything between "select" and "from", if "from" is found, otherwise it will just match everything after "select".

By adding the ? in ".*?" it tells the '*' to not be greedy, so when it hits a place where the rest of the expression matches, it won't keep taking more characters. I also added the '?:', which makes the second group a non-capturing group, since there is no useful info to read from it. Finally wrap the expression in ^ and $ to mark line start and end.

If 'from' is NOT optional though, then it is a lot easier and you can just use this:

!^select (.*) from (.*)$!i


Try this out:

$string = "select id, name, CONCAT('id','.','nom') AS alias as a from table t INNER JOIN table2 t2 ON t.id=t2.user_id";
preg_match_all("!select (.*) from (.*)!i", $string, $result);
var_dump($result);

I just tested it and it works just fine.


The problem is that you use greedy matching. That is, your first .* group matches characters until the rest of your regex breaks. Since the FROM clause is made optional, it never happens and your first group just matches everything. The solution is to use non-greedy matching, by adding a ? after the * (it also works for +).

'!select (.*?)( from (.*))?!i'

It should be enough for your simple case. If you want to parse a whole query, though, it's actually much, much easier to parse SQL statements backwards. For instance, let's have a fully-featured SQL query:

SELECT foo FROM bar WHERE cond GROUP BY col HAVING stuff ORDER BY this

If you strrev it, you get:

siht YB REDRO ffuts GNIVAH loc YB PUORG dnoc EREHW rab MORF oof TCELES

With that in mind, you can easily split it with a regex, without ending with a LISPesque amount of stacked parentheses. Here's a commented regex I made to match such a string (you would need to put it back on a single line with no whitespaces).

^ // match the beginning
    (.+\s+YB\s*REDRO)?\s* // is there an ORDER BY?
    (.+\s+GNIVAH)?\s* // is there a HAVING?
    (.+\s+YB\s*PUORG)?\s* // is there a GROUP BY?
    (.+\s+EREHW)?\s* // is there a WHERE?
    (.+\s+MORF)?\s* // is there a FROM?
    .+\s+TCELES // there is a SELECT
$ // match the end

Now, all you have to do is strrev back your results, and voilà! you got a well-split query.

EDIT We can use non-capturing groups and named groups to enhance the regex. Right now, we get the individual clauses through the matches; that is, they start with a keyword. Without the keyword, it would get pretty confusing to tell what's what in the capture groups. Named groups help solve this problem.

Non-capturing groups are groups that don't appear in the regex results. They start with ?:, and they're useful to make a block optional (like (?:stuff here)?) without having to deal with it in the results.

Here's the new regex. I've also just learned about the x modifier which makes PCRE ignore whitespaces and accept comments inside regexes, so let's use that to make a valid snippet.

$regex = "/^
    (?:(?<orderby>.+)\s+YB\s*REDRO)?\s* # is there an ORDER BY?
    (?:(?<having>.+)\s+GNIVAH)?\s*      # is there a HAVING?
    (?:(?<groupby>.+)\s+YB\s*PUORG)?\s* # is there a GROUP BY?
    (?:(?<where>.+)\s+EREHW)?\s*        # is there a WHERE?
    (?:(?<from>.+)\s+MORF)?\s*          # is there a FROM?
    (?<select>.+)\s+TCELES              # there is a SELECT
$/msix";

$query = "SELECT foo FROM bar WHERE cond GROUP BY col HAVING stuff ORDER BY this";

preg_match($regex, strrev($query), $matches);
foreach ($matches as &$match)
    $match = strrev($match);

// now we can use $matches['from'] to get the FROM clause
echo $matches['from'];

print_r($matches);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜