开发者

How do I extract all HTML tags from a webpage into an array?

I need to extract all HTML tags from a webpage into an array without the data inside the tags. It would look something like...

I'm using PHP

Array 
{
   html =>
             Array 
             {
                 head =>
                          Array
                          {
                              title,
                              meta name='description' content='bla bla'
                              meta name='keyword' content='bla bla'
                              ....
                          },
                 body =>
                          Array
                          {
                              div id='header' =>
                                              Array
                                              {
                                                  div class='logo',
                                                  div class='nav'
                                              },
                              div id='content' =>
                                              Array
                                              {
                                                  h1,
                                                  p class='first-para',
                                                  p,
                                                  p,
                                                  div id='ad'
                                              },
                              div id='footer' =>
                                              Array
                                              {
                                                  ul =>
                                                        Array
                                                        {
                                                            li =>
                                                                  Array
                                                                  {
                                                                     a href='link.htm'
                                                                  },
                                                            li =>
                                                                  Array
                                                                  {
                                                                     a href='link.htm'
                                                                  },
                                                            li =>
                                                                  Array
     开发者_StackOverflow中文版                                                             {
                                                                     a href='link.htm'
                                                                  }
                                                        }
                                              }
                          }

             }
}


What you need is an HTML parser (an XML parser would probably not do because HTML often is invalid). Maybe: http://simplehtmldom.sourceforge.net/


You can also use the PHP DOM extension.


I think the simplest way is to use XPath.

//*::name()

Should give you the names of all nodes on all levels. Iam not sure wheather not hierarchy will be flattened though.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜