Close tags from a truncated HTML string
I have inherited a site with a news section that displays a summary of the news article. For whatever reason the creators decided that displaying the first X characters of the article would be fine. Of course this very quickly led to the summary being something like:
<p>What a mighty fine <a href="blah">da
<p>What a mighty fine and warm <a href="htt
<p>His name was "Emil&qu
Which quite obviously screws with the page, especially when the opening tags aren't even closed.
What I'm after is a way to close all open tags within the string being taken. I开发者_JAVA技巧 really really don't want to use regex to do it. I'm sure there's a nice parser that can do it easily, I just can't seem to find it right now.
The best thing is probably to find a better algorithm for generating the excerpt, for example by running strip_tags before the truncation.
How will you otherwise handle hard-to-find-programmatically errors such as <p>What a mighty fine and warm <a href="htt
or <p>His name was "Emil&qu
?
Have you taken a look at Tidy?
Example:
$options = array("show-body-only" => true);
$tidy = tidy_parse_string("<B>Hello</I> How are <U> you?</B>", $options);
tidy_clean_repair($tidy);
echo $tidy;
Outputs:
<b>Hello</b> How are <u>you?</u>
I would install the PHP bindings for Tidy. You can then use this to clean up an HTML fragment using the following code:
<?php
$fragment = '<p>What a mighty fine <a href="blah">da';
$tidy = new tidy();
$tidy->parseString($fragment,array('show-body-only'=>true),'utf8');
$tidy->cleanRepair();
echo $tidy;
精彩评论