Compress whitespace between attributes in an HTML tag
We just released some code to make our so开发者_运维知识库ftware a little bit more user friendly, and it backfired. Basically, we're attempting to replace newlines with <br />
tags. The trouble is, sometimes our users will enter code like the following:
<a
href='http://nowhere.com'>Nowhere</a>
When we run our code, this translates to
<a <br />href='http://nowhere.com' />Nowhere</a>
which obviously doesn't render properly.
Is there a regular expression or a PHP function to strip, or perhaps compress, the whitespace between the attributes of an HTML tag?
Clarification: This isn't full HTML. It's more similar to Markdown or some other language (we will eventually be moving to Markdown, but I need a quick fix). So I can't just parse this as regular HTML. The newlines need to be converted to <br />
tags properly.
Hmmm, why are you using tools for formatting html when there not designed for that purpose, get your self a DOM Library.
http://simplehtmldom.sourceforge.net/
You need a library which would correctly parse all HTML you throw at it, you never known what users may invent.
Look at HTML Purifier
After some searching and much trial and error, I have come up with the following solution/hack:
/*
* Compress all whitespace within HTML tags (including PRE at the moment)
*/
$regexp = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
preg_match_all($regexp, $text, $matches);
foreach($matches[0] as $match) {
$new_html = preg_replace('/\s+/', ' ', $match);
$text = str_replace($match, $new_html, $text);
}
After executing this code, all HTML tags in $text
will be properly formatted and valid with NO newline characters.
I know that this isn't the best solution, but it works, and pretty soon we'll be migrating to a true markup language (such as Markdown).
Ideally, you would use an XML parser, through DOM or SAX APIs. However, if your content is not proper XML, but plain text with a few tags, the parser may fail (it depends on the tool used, I guess).
A rough solution for your particular problem may be as follows: construct a state machine with two states, inside a tag and outside a tag. You read the input character by character. Upon reading '<', switch to the "inside" state. Upon reading '>', switch to the "outside" state. Upon reading '\n' and if in the "outside" state, emit "<br />" (otherwise emit nothing).
This is just a sketch, and it may need to be refined.
精彩评论