How to extract h1 headings from an HTML page using regular expressions?
I'm still trying to get to grips with regexps and I'm considering a simple query. I'm trying parse the homepage of my website and extract the H1 tags.
<?php
$string_get = file_get_contents("http://davidelks.com/");
$replace = "$1";
$matches = preg_replace ("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/", $replace, $string_get, 1);
$string_construct = "Mum " . $matches . " Dad";
echo 开发者_运维知识库($string_construct);
?>
However, instead of just displaying the first HTML link using the $1 token, it just pulls in the whole page. What can I try next?
This looks like something that could be done easily with a DOM parser:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->load('http://davidelks.com/');
$h1 = $dom->getElementsByTagName('h1')->item(0);
echo $h1->textContent;
You should get:
Let's make things happen in and around Stoke-on-Trent
Note: I'm not sure if this is your site or a site you manage, but there shouldn't be more than a single <h1>
tag in a HTML page (there is a couple on the homepage).
The mistake is in your usage of preg_replace
. You wanted to extract something, for which preg_match
is to be used:
<?php
$text = file_get_contents("http://davidelks.com/");
preg_match('#<h1 class="title"><a href="([\w\s\x21\/\-\.\£\:]*)">([^<>]*)</a></h1>#', $text, $match);
echo "Mum " . $match[1] . " Dad";
?>
Note particularily that you can combine character classes. You don't need [A-Z]|[a-z]|[..]
because you can just combine it into one [A-Za-z...]
square bracket list.
Also try to use single quotes for the PHP string, if you want to search double quotes within. This saves a lot of extraneous escaping. As do alternative enclosures #
instead of /
around the regex.
It would be easier using a DOM parser. But if you would want to do it with regex you should use the preg_match_all function in php:
preg_match_all("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/",$string_get,$matches);
var_dump($matches);
精彩评论