regex to pull all attributes out of all meta tags
I'm trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they're SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1's, our analytics (Omniture), and our ad tags (doubleclick) are all the same.
My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have a name= attribute, same with "mariano at cricava dot com"'s solution.
I don't want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http开发者_StackOverflow-equiv= and change the regex appropriately but cannot be entirely sure as it's a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.
I have
$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)
but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair
Array
(
[0] => Array
(
[0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
[1] => content
[2] => text/html; charset=UTF-8
)
[1] => Array
(
[0] => <meta name="description" content="some description" />
[1] => content
[2] => some description
)
[2] => Array
(
[0] => <meta property="og:type" content="website" />
[1] => content
[2] => website
)
...
I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)>
then doing a second regular expression on the results, but that seems unnecessary with the power of regex?
But back to the original question, forget it's HTML for now, is there no way to have recurring subpatterns return in preg_match_all rather than just returning the last match?
Not possible with preg_*
/PCRE (nor any other regex flavor that I know of, but in Perl you could use a (?{ push @list, $^N })
hack).
preg_match_all("<meta\\s*(?:(?:\\b(\\w|-)+\\b\\s*(?:=\\s*(?:[\"\"[^\"\"]*\"\"|'[^']*'|
[^\"\"'<> ]|[''[^'']*''|\"[^\"]*\"|[^''\"<> ]]]+)\\s*)?)*)/?\\s*>", $content, $meta);
try with this
I am doing it this way. First pull out the meta tags with the following regex
string regex = "<meta\\s(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>";
I found the regex over here -
RegEx match open tags except XHTML self-contained tags
Then pull out attributes using another regex, which would be quite simple to write.
精彩评论