Preg_replace regex, newlines, connection resets

2023-01-01 21:38 问答作者：

I have mixed html, custom code, and regular text I need to examine and change frequently on several, long wiki pages. I'm working with a proprietary wiki-like application and have no control over how the application functions or validates user input. The layout of pages that users add must follow a very specific standard layout and always include very specific text in only certain places - a standard which frequently changes. If users add pages that are so far out of the standard, they will be deleted.

I do not have the resources to manually proof-read and correct all these pages, so automation is the only solution. The fact that all this is obviously a complete waste of time wh开发者_StackOverflow中文版en alternative platforms to do exactly what's needed here exist is already understood.

I've built a PHP based API to automate this post-validation and frequent restandardization process for me. I've been able set up regex patterns to handle all this mixed text, and they all work fine for handling single lines. The problem I have is this: Poorly formed regex against long text with line breaks can lead to unexpected results, such as connection resets. I have no access to server-side logs to troubleshoot. How do I overcome this?

This is just one example of what I currently have: {column} and {section} tags I'm searching for below can have any number of attributes, and wrap any text. {section} may or may not exist and may or may not be one or more lines under {column}, but it has to be wrapped inside {column}. {column} itself may or may not exist, and if it doesn't, I don't care as I then have some default text inserted later on down the script. I want to grab the inner section contents and wrap it in an html div tag instead. I can't recall the exact pattern I'm using offhand at the moment, but it's close enough...

$pattern = "/\{column:id=summary([|]?([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)({section([|]([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)\{section\}(.*))?{column\}/s";
$replacement = "{html}<div id='summary'>$7</div>{html}";
$text = preg_replace($pattern, $replacement, $subject);

Handling the {column} and {section} attributes and passing only valid HTML parameters to the new html div or a subtext of it is itself a challenge, but my main focus above right now is getting that (.*) value within {section} above without causing a connection reset. Any pointers?

This probably isn't what you're looking for, but: don't use a regex! You're trying to parse some very structured, very complex text, and to do so, you should really use a parser. I don't know what's available for PHP (you can Google just as well as I can, and I'm in no position to make any particular recommendation) but I'm sure something exists.

As for what's causing a connection reset, my only guess is that, since you mention problems with "long text", you're having a memory allocation issue. I don't think your regex will have unexpectedly huge performance, though it might in the non-matching case. But your best option, if you can, is probably to scrap the regex technique and switch to a real parser.

I found the likely source of the crashing issue: catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html). So if refining patterns to handle that doesn't work (and if anyone has any patterns to suggest, please do share), switching to some other text parser solution would be best.

The only real problem I can see is all those (.*)s. In /s mode, each (.*) initially slurps up the whole page, only to have to backtrack most of the way. Change them all to (.*?) (i.e., switch to reluctant quantifiers) and it should work much faster.

继续阅读：php regex

Preg_replace regex, newlines, connection resets

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？