Problem with regex for text parsing (similar to textile)
I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:
-*deleted* -- text- and -more deleted text-
The result I want to have is:
<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>
What I do not want is开发者_运维问答:
<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>
Any ideas are very appreciated! thanks very much!
UPDATE
i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:
-american-football player-
expected result:
<del>american-football player</del>
Based of the RedCloth library's parser description, with some modification for double-dash.
@
(?<!\S) # Start of string, or after space or newline
- # Opening dash
( # Capture group 1
(?: # : (see note 1)
[^-\s]+ # :
[-\s]+ # :
)*? # :
[^-\s]+? # :
) # End
- # Closing dash
(?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<]) # (see note 2)
@x
- Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
- Note 2: Followed by space, punctuation, line break or end of string.
Or compacted:
@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])@
A few examples:
$regex = '@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])@';
$replacement = '<del>\1</del>';
preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";
Will output:
<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>
In the second example, it will match just -*deleted*-
, since there are no spaces before the --
. -text-
will not be matched, because the initial -
is not preceded by a space.
The strong
tag is easy:
$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>', $string);
Working on the others.
Shameless hack for the del
tag:
$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);
For a single token, you can simply match:
-((?:[^-]|--)*)-
and replace with:
<del>$1</del>
and similarly for \*((?:[^*]|\*{2,})*)\*
and <strong>$1</strong>
.
The regex is quite simple: literal -
in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.
To also allow single dashes in words, as in objective-c
, this can work, by accepting dashes surrounded by two alphanumeric letters:
-((?:[^-]|--|\b-\b)*)-
You could try something like:
'/-.*?[^-]-\b/'
Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.
I think you should read this warning sign first You can't parse [X]HTML with regex
Perhaps you should try googling for a php html library
精彩评论