Using a Regular Expression to Find XML character references for control characters
I need some help figuring out the regex for XML character references to control characters, in decimal or hex.
These sequences look like the following:
�
 &开发者_如何学编程;#31;  
In other words, they are an ampersand, followed by a pound, followed by an optional 'x' to denote hexadecimal mode, followed by 1 to 4 decimal (or hexadecimal) digits, followed by a semicolon.
I'm specifically trying to identify those sequences where they contain (inclusive) numbers from decimal 0 to 31, or hexadecimal 0 to 1F.
Can anyone figure out the regex for this??
If you use a zero-width lookahead assertion to restrict the number of digits, you can write the rest of the pattern without worrying about the length restriction. Try this:
&#(?=x?[0-9A-Fa-f]{1,4})0*([12]?\d|3[01]|x0*1?[0-9A-Fa-f]);
Explanation:
(?=x?[0-9A-Fa-f]{1,4}) #Restricts the numeric portion to at most four digits, including leading zeroes.
0* #Consumes leading zeroes if there is no x.
[12]?\d #Allows decimal numbers 0 - 29, inclusive.
3[01] #Allows decimal 30 or 31.
x0*1?[0-9A-Fa-f] #Allows hexadecimal 0 - 1F, inclusive, regardless of case or leading zeroes.
This pattern allows leading zeroes after the x
, but the (?=x?[0-9A-Fa-f]{1,4})
part prevents them from occurring before an x
.
&#(0{0,2}[1-2]\d|000\d|0{0,2}3[01]|x0{0,2}[01][0-9A-Fa-f]);
It's not the most elegant, but it should work.
Verified in RegexBuddy.
I think the following should work:
&#(?:x0{0,2}[01]?[0-9a-fA-F]|0{0,2}(?:[012]?[0-9]|3[01]));
Here is a Rubular:
http://www.rubular.com/r/VEYx25Fdpj
精彩评论