开发者

Can PCRE regex match a null character?

I have a text source with nulls in it and I need to pull them out along with my regex pattern. Can regex ev开发者_如何学编程en match a null character?

I only realized I had them when my pattern refused to match and when I pasted it into Notepad++ it showed all the null characters.


\x00

That is a null char.


One issue with matching the null character is that you first need to arrange to have it arrive. Lots of languages use null-terminated strings so your match may not be against the entire input.

As for how to express it in PCRE, \000 works and is not going to get tripped up by anything following it, as would \x{} (but the octal version is in my opinion easier to identify when skimming the regex).

See the PCRE manpages and search for Non-printing characters for the full details of how to specify a null in various different ways.


To clarify/add another detail to previous answer: PCRE library accepts pattern as a "C" nul-terminated string. (Quoting PCRE docs: "The pattern is a C string terminated by a binary zero".) That means that pattern cannot contain a literal NUL character - instead, it must be always escaped using means described in other answers. ("Unlike the pattern string, the subject may contain binary zeroes." " 4. Though binary zero characters are supported in the subject string, they are not allowed in a pattern string because it is passed as a nor- mal C string, terminated by zero. The escape sequence \0 can be used in the pattern to represent a binary zero.")

NUL character is the only character in PCRE pattern which must be escaped, all other may go literal: "There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern".

As a final comparative note, some other Perl-compatible regex engines do allow literal NULs in a pattern, for example, Python's SRE. E.g. urlib.parse from Python3 has following line: _asciire = re.compile('([\x00-\x7f]+)'). Note the lack of "r" to signify raw literal - it means that unescaping here happens on Python level, and re module gets characters with values 0x00 and 0x7f in pattern.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜