How do I extract all HTML tags from a webpage into an array?
I need to extract all HTML tags from a webpage into an array without the data inside the tags. It would look something like...
I'm using PHP
Array
{
html =>
Array
{
head =>
Array
{
title,
meta name='description' content='bla bla'
meta name='keyword' content='bla bla'
....
},
body =>
Array
{
div id='header' =>
Array
{
div class='logo',
div class='nav'
},
div id='content' =>
Array
{
h1,
p class='first-para',
p,
p,
div id='ad'
},
div id='footer' =>
Array
{
ul =>
Array
{
li =>
Array
{
a href='link.htm'
},
li =>
Array
{
a href='link.htm'
},
li =>
Array
开发者_StackOverflow中文版 {
a href='link.htm'
}
}
}
}
}
}
What you need is an HTML parser (an XML parser would probably not do because HTML often is invalid). Maybe: http://simplehtmldom.sourceforge.net/
You can also use the PHP DOM extension.
I think the simplest way is to use XPath.
//*::name()
Should give you the names of all nodes on all levels. Iam not sure wheather not hierarchy will be flattened though.
精彩评论