开发者

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.

Rinse, repeat.

Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so开发者_StackOverflow社区 that I don't have to do this?

Edit:

Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?


The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:

proc re_escape str {
    # Every non-word char gets a backslash put in front
    regsub -all {\W} $str {\\&}
}

set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString

Where you have a whole literal string, you have two other alternatives:

regexp "***=$literal" $someString
regexp "(?q)$literal" $someString

However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.


No, tcl does not have such a feature.

If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:

set fixed1 {.*?[]}  ;# match the literal five-byte sequence .*?[]
set fixed2 {???}    ;# match the literal three byte sequence ???
set pattern "this.*and.*that"

regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"

You would need to supply the definition for re_escape but the solution should be pretty obvious.


A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:

% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}

This does however apply to the entire expression.

What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.


I believe Perl and Java support the \Q \E escape. so

\Q.*.*()\E

..will actually match the literal ".*.*()"

OR

Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜