开发者

Get Text Content of current URL in php

I am working on URL Get content.

If i want to fetch ONLY the text conent from this site(Only text) http://en.wikipedia.org/wiki/Asia

How is it possible. I can fetch the URL title and URL using PHP.

I got the url title using the below code:

$url = getenv('开发者_JAVA技巧HTTP_REFERER');

$file = file($url);
$file = implode("",$file);

//$get_description = file_get_contents($url);

if(preg_match("/<title>(.+)<\/title>/i",$file,$m))
$get_title = $m[1];
echo $get_title;

Could you pl help me to get the content.

Using file_get_content i could get the HTML code alone. Any other possibilities?

Thanks - Haan


If you just want to get a textual version of a HTML page, then you will have to process it yourself. Fetch the HTML (as you seem to already know how to do) and then process it into plain text with PHP.

There are several approaches to doing this. The first is htmlspecialchars() which will escape all the HTML special characters. I don't imagine this is what you actually want but I thought I'd mention it for completeness.

The second approach is strip_tags(). This will remove all HTML completely from a HTML document. However, it doesn't validate the input its working with, it just does a fairly simple text replace. This means you will end up with stuff that you might not want in the textual representation being included (such as the contents of the head section, or the innards of embedded javascript and stylesheets)

The other approach is to parse the downloaded HTML with DOMDocument. I've not written code for you (don't have time), but the general procedure would be similar to as follows:

  1. Load the HTML into a DOMDocument object
  2. Get the document's body element and iterate over its children.
  3. For each child, if the child in question is a text node, append it to an output string. If it isn't a text node, then iterate over its children as well to check if any of its children are text nodes (and if not then iterate over those child elements as well and so on). You might also want to check the type of the node further. For example, if you don't want javascript or css embedded in the output then you can check that the tag type is not STYLE or SCRIPT and just ignore it if it is.

The above description is most easily implemented as a recursive function (one that calls itself).

The end result should be a string that contains only the textual content of the downloaded page, with no markup.

EDIT: Forgot about strip_tags! I updated my answer to mention that as well. I left my DOMDocument approach included in my answer though, because as the documentation for strip_tags states, it does no validation of the markup its processing, whereas DOMDocument attempts to parse it (and can potentially be more robust if a DOMDocument based text extraction is implemented well).


Use file_get_contents to get the HTML content and then strip_tags to remove the HTML tags, thus leaving only the text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜