开发者

Regex for matching markup in PHPish markup?

I am creating a project, and I need to be able to use a regex(or if something else is preferable?)

Basically, I need to convert a PHPish markup code page so that the "non-code" is converted into "code." For instance:

Orginal:

<?code
  echo 'some text';
?>
<head>
</head>
<body>
</body>
<?code
  echo '</html>';
?&开发者_运维问答gt;

Converted:

<?code
  echo '<html>';
  echo '
<head>
</head>
<body>
</body>';
  echo '</html>';
?>

How could this work while also taking quotes into account? (like <?code $var='<?code stuff ?>';?>

Also, if someone provided me with something to detect included files, (to replace with something that first "prepossesses" the file then includes it) (where the includes are similar to PHP)

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

Also, this project will actually be implemented in Ruby(the preprocessor that is), so if there is something Ruby has that would aid in this, then have at it.

I know the code looks very similar to PHP, but thats because it is, but it will not be implemented in PHP and the "code" used won't actually be PHP, but it will use a <? type mechanism for containing code in markup.

Edit: also note that the language inside the markup can for all practical purposes be Ruby. So it can contain quotes and comments that have the closing code tag.


You can use token_get_all to get a stream of parser tokens. Loop through them and echo them out, when you come upon a T_INLINE_HTML, you can then rewrite it to an echo statement instead.

Edit - Just saw you say you're using Ruby. Obviously, you can't use PHP's tokeniser from within Ruby. Maybe you can call php over the command line?

Edit 2:

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

It's parsing alright. You can use regexp to split your input into tokens (aka tokenization). Since most languages are contextual, you will then have to feed the tokens to a state machine, which can parse the code into an internal representation (an AST). This can then be transformed into your target output. It sounds elaborate and scary, but it's really quite simple when you have tried it a couple of times. I suggest that you work through it, with the help of Wikipedia and Google.


More a couple of ideas rather than an answer:

I would suggest you try to find some regex that can find the blocks of PHP and then wrap everything else in your echo's instead of the other way round.

Another option may be to look at the PHP tokenizer, but i'm not sure how it deals with sections of HTML outside of the tags I'm afraid.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜