Extracting portions of a loaded page in PHP (RegEx)

2022-12-19 22:17 问答作者：

I have a newsletter system I am trying to incorporate wi开发者_开发知识库thin a PHP site. The PHP site loads a content area and also loads scripts into the head of the page. This works fine for the code that is generated for the site but now I have the newsletter I am trying to incorporate.

Originally I was going to use an iFrame but the amount of AJAX and jQuery calls makes this quite complex.

So I thought I could use cURL to load the newsletter page as a variable. Then I was going to use RegEx to grab the content between the body tags and place this in the content area. Finally I was going to use RegEx again to search through the head and grab any scripts.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $config_live_site."lib/alerts/user/update.php?email=test@test.com.au"); # URL to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0);
$loaded_result = curl_exec( $ch ); # run!
curl_close($ch);

// Capture the body content and place in $_content
if (preg_match('%<body>([\s\S]*)</body>%', $loaded_result, $regs)) {
 $_content .= $regs[1];
} else {
 $_content .= "<p>No content to display.</p>";
}

// Capture the scripts and place in the head
if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $loaded_result, $regs)) {
 $headDetails .= $regs[0];
}

This works most of the time but if there is a script in the body of the document it captures down to the last /script'.

My question is two-fold I guess...

A. Is there a better overall approach (My deadline is very short so it needs to be a quick solution without too much editing of the newsletter code)?

B. What RegEx would I need to use to just capture the first script?

I think you'll need to add a ? to the script regex after the * so it's not greedy. Greedy regex's match as much as is possible (everything between the first opening tag and the last closing), non-greedy match as little as possible (only what's between the opening tag and the first closing tag). Try:

%(<script type="text/javascript">[\s\S]*?</script>)%

As mentioned, change it to preg_match_all, and you should just match the individual script sections instead of everything between the first and last script tags.

A: I see no issues with using regular expressions to extract the bits you need from HTML pages which are not necessarily valid. In fact some of the spidering solutions I worked with did exactly that.

B: Use preg_match_all() instead of preg_match(). preg_match() only captures the first match while preg_match_all() will continue until the end of the string and return all matches.

A quick and dirty response can be: delete the body content just after capturing it. Then proceed

if (preg_match('%<head>([\s\S]*)</head>%', $loaded_result, $regs)) {
   $_header .= $regs[1];
} else {
   $_header .= "<p>No content to display.</p>";
}

then apply the regex just to the header

if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $_header, $regs)) {
   $headDetails .= $regs[0];
}

If the html you get from curl is well formed, you should use simplexml to perform your extraction. As its name suggest, it is very simple to use.

$xml = simplexml_load_string($loaded_content);

$body = $xml->body->asXML();

$scripts = $xml->xpath('//head/script');
foreach ($scripts as $script) {
  $_scripts .= $script->asXML();
}

If your html is not well formed, then you hava to resort to tidy to normalize it (or better, correct the scripts that output invalid html content)

$doc = new DOMDocument();
$doc->loadHTML($loaded_result);
$xpath = new DOMXpath($doc);

$kod = $xpath->query("//head/script");
$i = 0;
foreach($kod as $node){
    echo 'im the script nº'.(++$i).' in the head and this is my content: ';
    echo $doc->saveXML($node)."\n";
}

继续阅读：curl php regex

Extracting portions of a loaded page in PHP (RegEx)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？