开发者

PHP: Regex to ignore escaped quotes within quotes

I looked through related questions before posting this and I couldn't modify any relevant answers to work with my method (not good a开发者_如何学Pythont regex).

Basically, here are my existing lines:

$code = preg_replace_callback( '/"(.*?)"/', array( &$this, '_getPHPString' ), $code );

$code = preg_replace_callback( "#'(.*?)'#", array( &$this, '_getPHPString' ), $code );

They both match strings contained between '' and "". I need the regex to ignore escaped quotes contained between themselves. So data between '' will ignore \' and data between "" will ignore \".

Any help would be greatly appreciated.


For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".

This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:

Good:

"([^"\\]|\\.)*"
Version 1: Works correctly but is not terribly efficient.

Better:

"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"
Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).

Best:

"[^"\\]*(?:\\.[^"\\]*)*"
Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)

Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:

$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";


Try a regex like this:

'/"(\\\\[\\\\"]|[^\\\\"])*"/'

A (short) explanation:

"                 # match a `"`
(                 # open group 1
  \\\\[\\\\"]     #   match either `\\` or `\"`
  |               #   OR
  [^\\\\"]        #   match any char other than `\` and `"`
)*                # close group 1, and repeat it zero or more times
"                 # match a `"`

The following snippet:

<?php
$text = 'abc "string \\\\ \\" literal" def';
preg_match_all('/"(\\\\[\\\\"]|[^\\\\"])*"/', $text, $matches);
echo $text . "\n";
print_r($matches);
?>

produces:

abc "string \\ \" literal" def
Array
(
    [0] => Array
        (
            [0] => "string \\ \" literal"
        )

    [1] => Array
        (
            [0] => l
        )

)

as you can see on Ideone.


This has possibilities:

/"(?>(?:(?>[^"\\]+)|\\.)*)"/

/'(?>(?:(?>[^'\\]+)|\\.)*)'/


This will leave the quotes outside

(?<=['"])(.*?)(?=["'])

and use global /g will match all groups


This seems to be as fast as the unrolled loop, based on some cursory benchmarks, but is much easier to read and understand. It doesn't require any backtracking in the first place.

"[^"\\]*(\\.[^"\\]*)*"


According to W3 resources : https://www.w3.org/TR/2010/REC-xpath20-20101214/#doc-xpath-StringLiteral

The general Regex is:

"(\\.|[^"])*"

(+ There is no need to add back-slashes in capturing group when they checked first)

Explain:

  • "..." any match between quotes
  • (...)* The inside can have any length from 0 to Infinity
  • \\.|[^"] First accept any char that have slash behind | (Or) Then accept any char that is not quotes

The PHP version of the regex with better grouping for better handling of Any Quotes can be like this :

<?php
    $str='"First \\" \n Second" then \'This \\\' That\'';
    echo $str."\n";
    // "First \" \n Second" then 'This \' That'

    $RX_inQuotes='/"((\\\\.|[^"])*)"/';
    preg_match_all($RX_inQuotes,$str,$r,PREG_SET_ORDER);
    echo $r[0][1]."\n";
    // First \" \n Second

    $RX_inAnyQuotes='/("((\\\\.|[^"])*)")|(\'((\\\\.|[^\'])*)\')/';
    preg_match_all($RX_inAnyQuotes,$str,$r,PREG_SET_ORDER);
    echo $r[0][2]." --- ".$r[1][5];
    // First \" \n Second --- This \' That
?>

Try it: http://sandbox.onlinephpfunctions.com/code/4328cc4dfc09183f7f1209c08ca5349bef9eb5b4

Important Note: In this age, for not sure contents, you have to use u flag in end of the regex like /.../u for avoid of destroying multi-byte strings like UTF-8, or functions like mb_ereg_match.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜