开发者

Problem with regex for text parsing (similar to textile)

I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:

-*deleted* -- text- and -more deleted text-

The result I want to have is:

<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>

What I do not want is开发者_运维问答:

<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>

Any ideas are very appreciated! thanks very much!

UPDATE

i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:

-american-football player-

expected result:

<del>american-football player</del>


Based of the RedCloth library's parser description, with some modification for double-dash.

@
  (?<!\S)               # Start of string, or after space or newline
  -                     # Opening dash
  (                     # Capture group 1
    (?:                 #   : (see note 1)
      [^-\s]+           #   :
      [-\s]+            #   :
    )*?                 #   :
    [^-\s]+?            #   :
  )                     # End
  -                     # Closing dash
  (?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<])  # (see note 2)
@x
  • Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
  • Note 2: Followed by space, punctuation, line break or end of string.

Or compacted:

@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])@

A few examples:

$regex = '@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])@';
$replacement = '<del>\1</del>';

preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";

Will output:

<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>

In the second example, it will match just -*deleted*-, since there are no spaces before the --. -text- will not be matched, because the initial - is not preceded by a space.


The strong tag is easy:

$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>',  $string);

Working on the others.


Shameless hack for the del tag:

$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);


For a single token, you can simply match:

-((?:[^-]|--)*)-

and replace with:

<del>$1</del>

and similarly for \*((?:[^*]|\*{2,})*)\* and <strong>$1</strong>.

The regex is quite simple: literal - in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.

To also allow single dashes in words, as in objective-c, this can work, by accepting dashes surrounded by two alphanumeric letters:

-((?:[^-]|--|\b-\b)*)-


You could try something like:

'/-.*?[^-]-\b/'

Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.


I think you should read this warning sign first You can't parse [X]HTML with regex

Perhaps you should try googling for a php html library

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜