Extracting portions of a loaded page in PHP (RegEx)
I have a newsletter system I am trying to incorporate wi开发者_开发知识库thin a PHP site. The PHP site loads a content area and also loads scripts into the head of the page. This works fine for the code that is generated for the site but now I have the newsletter I am trying to incorporate.
Originally I was going to use an iFrame but the amount of AJAX and jQuery calls makes this quite complex.
So I thought I could use cURL to load the newsletter page as a variable. Then I was going to use RegEx to grab the content between the body tags and place this in the content area. Finally I was going to use RegEx again to search through the head and grab any scripts.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $config_live_site."lib/alerts/user/update.php?email=test@test.com.au"); # URL to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0);
$loaded_result = curl_exec( $ch ); # run!
curl_close($ch);
// Capture the body content and place in $_content
if (preg_match('%<body>([\s\S]*)</body>%', $loaded_result, $regs)) {
$_content .= $regs[1];
} else {
$_content .= "<p>No content to display.</p>";
}
// Capture the scripts and place in the head
if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $loaded_result, $regs)) {
$headDetails .= $regs[0];
}
This works most of the time but if there is a script in the body of the document it captures down to the last /script'.
My question is two-fold I guess...
A. Is there a better overall approach (My deadline is very short so it needs to be a quick solution without too much editing of the newsletter code)?
B. What RegEx would I need to use to just capture the first script?
I think you'll need to add a ?
to the script regex after the *
so it's not greedy. Greedy regex's match as much as is possible (everything between the first opening tag and the last closing), non-greedy match as little as possible (only what's between the opening tag and the first closing tag). Try:
%(<script type="text/javascript">[\s\S]*?</script>)%
As mentioned, change it to preg_match_all
, and you should just match the individual script sections instead of everything between the first and last script tags.
A: I see no issues with using regular expressions to extract the bits you need from HTML pages which are not necessarily valid. In fact some of the spidering solutions I worked with did exactly that.
B: Use preg_match_all() instead of preg_match(). preg_match() only captures the first match while preg_match_all() will continue until the end of the string and return all matches.
A quick and dirty response can be: delete the body content just after capturing it. Then proceed
if (preg_match('%<head>([\s\S]*)</head>%', $loaded_result, $regs)) {
$_header .= $regs[1];
} else {
$_header .= "<p>No content to display.</p>";
}
then apply the regex just to the header
if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $_header, $regs)) {
$headDetails .= $regs[0];
}
If the html you get from curl is well formed, you should use simplexml to perform your extraction. As its name suggest, it is very simple to use.
$xml = simplexml_load_string($loaded_content);
$body = $xml->body->asXML();
$scripts = $xml->xpath('//head/script');
foreach ($scripts as $script) {
$_scripts .= $script->asXML();
}
If your html is not well formed, then you hava to resort to tidy to normalize it (or better, correct the scripts that output invalid html content)
$doc = new DOMDocument();
$doc->loadHTML($loaded_result);
$xpath = new DOMXpath($doc);
$kod = $xpath->query("//head/script");
$i = 0;
foreach($kod as $node){
echo 'im the script nº'.(++$i).' in the head and this is my content: ';
echo $doc->saveXML($node)."\n";
}
精彩评论