PHP HTML DOM: How to select all visible/readable text?
I'm trying to scrape websites, modify all visible text (meaning: links, paragraphs, headlines, etc) by keeping the html structure and then render the 'new' page afterwards.
Basically 开发者_C百科I want to scramble all readable text without destroying the design/functionality.
I tried it with Zend_Dom_Query, but how to select just text?
$dom = new Zend_Dom_Query($html);
$results = $dom->query( ??? );
Or is there another/better way of doing this?
Thanks a lot in advance.
Example
Input:
<html>
<head>....</head>
<body>
<div>
<h1>Headline</h1>
<h2>Subheadline</h2>
<p>Some text</p>
<a href="...">
A Link
<img src="..." />
<span style="display:none">additional text</span>
</a>
</div>
</body>
</html>
Output:
<html>
<head>....</head>
<body>
<div>
<h1>Hinladee</h1>
<h2>Suialebdhne</h2>
<p>Smoe txet</p>
<a href="...">
A Lnik
<img src="..." />
<span style="display:none">anodiaditl txet</span>
</a>
</div>
</body>
</html>
You can try this service: http://www.alchemyapi.com/api/text/ - its API provides easy-to-use mechanisms to extract page text and title information from any web page. It's a simple way. Other way is to use http://www.alchemyapi.com/api/scrape/
Solution:
Thanks to @Yoshi and @Gordon. This is exactly what I was looking for:
$dom = new Zend_Dom_Query($html);
$results = $dom->query("//text()");
精彩评论