Puzzle: Splitting An HTML String Correctly

2023-01-09 13:52 问答作者：

I'm trying to split an HTML string by a token in order to create a blog preview without displaying the full post. It's a little harder than I first thought. Here are the problems:

A user will be creating the HTML through a WYSIWYG editor (CKEditor). The markup isn't guaranteed to be pretty or consistent.
The token, read_more(), can be placed anywhere in the string, including being nested within a paragraph tag.
The resulting first split string needs to be valid HTML for all reasonable uses of the token.

Examples of possible uses:

<p>Some text here. read_more()</p>

<p>Some text read more() here.</p>

<p>read_more()</p>

<p>  read_more()</p>

read_more()

So far, I've tried just splitting the string on the t开发者_JAVA百科oken, but it leaves invalid HTML. Regex is perhaps another option. What strategy would you use to solve this and make it as bulletproof as possible? Any code snippets or hints would also be appreciated (I'm using PHP).

function stripmore($in)
{
    list($p1,$p2) = explode("read_more()",$in,2);

    $pass1 = preg_replace("~>[^<>]+<~","><",$p2);
    $pass2 = preg_replace("~^[^<>]+~","",$pass1);

    $pass3 = null;
    while ( $pass3 != $pass2 )
    {
        if ( $pass3 !== null ) $pass2 = $pass3;
        $pass3 = preg_replace("~<([^<>]+)></\\1>~","",$pass2);
    }

    return $p1."read_more()".$pass3;
}

this strips any non-html after the read_more() mark, and reduces it to the minimum by stripping corresponding tags, while keeping any tag starting before and ending after the mark:

<p>Some text here. read_more()</p>
      ==> <p>Some text here. read_more()</p>

<p>Some <b>text</b> read_more() <b>here</b>.</p>
      ==> <p>Some <b>text</b> read_more()</p>

<p>Some <b>text read_more() here</b>.</p>
      ==> <p>Some <b>text read_more()</b></p>

The only correct option I currently see is writing your own context-free grammar HTML parser in PHP which will allow you to close the tags appropriately (simply by popping the stack when reaching read more() and for each pop adding a closing tag).

This is, however, a lot of work and this might work well for you:

$stripped = strip_tags($input);
list($preview) = explode("read more()", $stripped);

You lose the HTML markup but it's dead easy to implement. And no possible XSS on your front page :)

Instead of using full HTML, why not use one of the many markup languages that can generate HTML, but which don't require you to close tags, etc. It would be easier to train your users, and would avoid all of the possibilities for XSS attacks that accepting raw HTML allows.

PHP Markdown would seem an obvious fit, particularly in light of your desire to avoid the GNU GPL.

In order to answer a comment to my comment I decided to have it be an answer, so I can take advantage of the markup options.

Why can't you just use trim() on the resulting string, find the missing open or close element and append that appropriately, to make it valid HTML?

Just traverse forward and back to find the next open/close element, and fix your HTML.

So, you can just walk forward and back in the string to get the next < and >, and if that is an HTML element then stop there, otherwise keep going.

Ideally you should need to process this once per submission, so you keep paying the price to do this operation.

UPDATE:

I forgot to include a link to help with strpos:

http://tuxradar.com/practicalphp/4/7/5

PHP tidy is a very light weight and efficient utility to repair invalid tags . Have a look , I have used it and benchmarked it in my application, and it works great. Moreoever it has many config options to suit your need the best, and takes care of other possible problems like encoding, nested invalid tags etc.

see the reference: http://www.php.net/manual/en/tidy.cleanrepair.php

example usage :

<?php

    function tidyString($str)
    {
      $config = array('show-body-only' => true); /* else it adds HTML tags too */
      tidy_set_encoding('utf8');
      $outStr = tidy_repair_string($str,$config);
      return $outStr;
    }


    $inStr = "<span> this is my incorrect html</spa";
    echo tidyString($inStr);  // Output : <span>this is my incorrect html</span>

    ?>

Why not use two textareas? One above and below the cut? The should make it obvious to the user what's going on, and eliminate the headache for you.

If you do want to use a token, you should choose something a bit more distinctive. Maybe:  which you can be somewhat more sure isn't actually content being mistaken for a token.

Anyhow, if you want to split the string on the token, you just need to figure out where your token is using strpos() and then use substr() to chop off the first part. Something like:

$intro = substr($text, 0, strpos($string, $token));

Following that, run your $intro through tidy (PHP extension) to clean up the syntax and then strip off the extra crap it adds in there. (I think you can str_replace() the extras with an empty string.)

继续阅读：html-parsing php regex string

Puzzle: Splitting An HTML String Correctly

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？