Puzzle: Splitting An HTML String Correctly
I'm trying to split an HTML string by a token in order to create a blog preview without displaying the full post. It's a little harder than I first thought. Here are the problems:
- A user will be creating the HTML through a WYSIWYG editor (CKEditor). The markup isn't guaranteed to be pretty or consistent.
- The token,
read_more()
, can be placed anywhere in the string, including being nested within a paragraph tag. - The resulting first split string needs to be valid HTML for all reasonable uses of the token.
Examples of possible uses:
<p>Some text here. read_more()</p>
<p>Some text read more() here.</p>
<p>read_more()</p>
<p> read_more()</p>
read_more()
So far, I've tried just splitting the string on the t开发者_JAVA百科oken, but it leaves invalid HTML. Regex is perhaps another option. What strategy would you use to solve this and make it as bulletproof as possible? Any code snippets or hints would also be appreciated (I'm using PHP).
function stripmore($in)
{
list($p1,$p2) = explode("read_more()",$in,2);
$pass1 = preg_replace("~>[^<>]+<~","><",$p2);
$pass2 = preg_replace("~^[^<>]+~","",$pass1);
$pass3 = null;
while ( $pass3 != $pass2 )
{
if ( $pass3 !== null ) $pass2 = $pass3;
$pass3 = preg_replace("~<([^<>]+)></\\1>~","",$pass2);
}
return $p1."read_more()".$pass3;
}
this strips any non-html after the read_more() mark, and reduces it to the minimum by stripping corresponding tags, while keeping any tag starting before and ending after the mark:
<p>Some text here. read_more()</p>
==> <p>Some text here. read_more()</p>
<p>Some <b>text</b> read_more() <b>here</b>.</p>
==> <p>Some <b>text</b> read_more()</p>
<p>Some <b>text read_more() here</b>.</p>
==> <p>Some <b>text read_more()</b></p>
The only correct option I currently see is writing your own context-free grammar HTML parser in PHP which will allow you to close the tags appropriately (simply by popping the stack when reaching read more() and for each pop adding a closing tag).
This is, however, a lot of work and this might work well for you:
$stripped = strip_tags($input);
list($preview) = explode("read more()", $stripped);
You lose the HTML markup but it's dead easy to implement. And no possible XSS on your front page :)
Instead of using full HTML, why not use one of the many markup languages that can generate HTML, but which don't require you to close tags, etc. It would be easier to train your users, and would avoid all of the possibilities for XSS attacks that accepting raw HTML allows.
PHP Markdown would seem an obvious fit, particularly in light of your desire to avoid the GNU GPL.
In order to answer a comment to my comment I decided to have it be an answer, so I can take advantage of the markup options.
Why can't you just use trim() on the resulting string, find the missing open or close element and append that appropriately, to make it valid HTML?
Just traverse forward and back to find the next open/close element, and fix your HTML.
So, you can just walk forward and back in the string to get the next <
and >
, and if that is an HTML element then stop there, otherwise keep going.
Ideally you should need to process this once per submission, so you keep paying the price to do this operation.
UPDATE:
I forgot to include a link to help with strpos
:
http://tuxradar.com/practicalphp/4/7/5
PHP tidy is a very light weight and efficient utility to repair invalid tags . Have a look , I have used it and benchmarked it in my application, and it works great. Moreoever it has many config options to suit your need the best, and takes care of other possible problems like encoding, nested invalid tags etc.
see the reference: http://www.php.net/manual/en/tidy.cleanrepair.php
example usage :
<?php
function tidyString($str)
{
$config = array('show-body-only' => true); /* else it adds HTML tags too */
tidy_set_encoding('utf8');
$outStr = tidy_repair_string($str,$config);
return $outStr;
}
$inStr = "<span> this is my incorrect html</spa";
echo tidyString($inStr); // Output : <span>this is my incorrect html</span>
?>
Why not use two textareas? One above and below the cut? The should make it obvious to the user what's going on, and eliminate the headache for you.
If you do want to use a token, you should choose something a bit more distinctive. Maybe: <!--full body cut-->
which you can be somewhat more sure isn't actually content being mistaken for a token.
Anyhow, if you want to split the string on the token, you just need to figure out where your token is using strpos()
and then use substr()
to chop off the first part. Something like:
$intro = substr($text, 0, strpos($string, $token));
Following that, run your $intro
through tidy (PHP extension) to clean up the syntax and then strip off the extra crap it adds in there. (I think you can str_replace() the extras with an empty string.)
精彩评论