Extracting the body text of an HTML document using PHP

2023-02-08 14:50 问答作者：

I know it's better to use DOM for this purpose but let's try to extract the text in this way:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

The result can be seen here: http://ideone.com/vH2FZ

As you can see, I am getting more text than expected.

There is something I don't understand, to get the correc开发者_开发知识库t length for the substr($string, $start, $length) function, I am using:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I don't see anything wrong with this formula.

Could somebody kindly suggest where the problem is?

Many thanks to you all.

EDIT:

Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines

Here is my solution, I prefer it this way.

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

Edit: I am updating my answer to provide you with better explanation why your code fails.

You have this string:

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).

When you reach this part of the code:

$index_of_body_end_tag = strpos($html, '</body>');

You get the correct position of </body> (starting at position 51) but this counts the new lines.

So when you reach this line of code:

$index_of_body_start_tag + strlen($matched_body_start_tag)

It it evaluated to 31 (new lines included), and:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).

:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).

Personally, I wouldn't use regex.

<?php

$html = <<<EOD

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>foobar</h1>
    </body>
</html>

EOD;

$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';

echo trim(substr($html, $s, strpos($html, $f) - $s));

?>

returns <h1>foobar</h1>

The problem is in your substr computation of the ending index. You should substract all the way:

$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

But you are doing:

+ strlen($matched_body_start_tag)

That said, it seems a little overkill considering you can do it using preg_match only. You just need to make sure you match across new lines, using the s modifier:

preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];

Outputs:

<p>Some text</p>

Somebodys probably already found your error, i didn't read all the replys.
The algebra is wrong.

code is here

Btw, first time seeing ideone.com, thats pretty cool.

$body = substr( 
          $html, 
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
        );

or ..

$body = substr(
          $html,
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
       );

继续阅读：html-content-extraction php regex text text-processing

Extracting the body text of an HTML document using PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？