开发者

PHP HTML DOM: How to select all visible/readable text?

I'm trying to scrape websites, modify all visible text (meaning: links, paragraphs, headlines, etc) by keeping the html structure and then render the 'new' page afterwards.

Basically 开发者_C百科I want to scramble all readable text without destroying the design/functionality.

I tried it with Zend_Dom_Query, but how to select just text?

    $dom = new Zend_Dom_Query($html);
    $results = $dom->query( ??? );

Or is there another/better way of doing this?

Thanks a lot in advance.


Example

Input:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Headline</h1>
      <h2>Subheadline</h2>
      <p>Some text</p>
      <a href="...">
        A Link 
        <img src="..." />
        <span style="display:none">additional text</span>
      </a>  
    </div>

  </body>
</html>

Output:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Hinladee</h1>
      <h2>Suialebdhne</h2>
      <p>Smoe txet</p>
      <a href="...">
        A Lnik 
        <img src="..." />
        <span style="display:none">anodiaditl txet</span>
      </a>  
    </div>

  </body>
</html>


You can try this service: http://www.alchemyapi.com/api/text/ - its API provides easy-to-use mechanisms to extract page text and title information from any web page. It's a simple way. Other way is to use http://www.alchemyapi.com/api/scrape/


Solution:

Thanks to @Yoshi and @Gordon. This is exactly what I was looking for:

$dom = new Zend_Dom_Query($html);
$results = $dom->query("//text()");
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜