开发者

In R, how to parse specific frame within a webpage?

Greetings all,

Is there a way to only read the HTML code from a specific frame within a webpage?

For example, if I submit a url to google translate, is there a way to parse only the translated page frame? Whenever I try, I can only access the top frame on the page but not the translated frame. Here is my self-contained sample code:

library(XML)
url <- "http://www.baidu.com/s?wd=r+project"
url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
htmlTreeParse(url.google.translate, useInternalNodes = FALSE)

The above code refers to this url:

$file
[1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"

The output however only access the top frame of the page and not the main frame, which is what I am interested in.

Hope that made sense and thanks in advance for any help.

Tony

UPDATE - Thanks to the answer from @kwantam below (accepted), I was able to use it to get my solution as follows (self-contained):

> # Load R packages
> library(RCurl)
> library(XML)
> 
> # STAGE 1 - find forward url in relevent frame
> ( url <- "http://www.baidu.com/s?wd=r+project" )
[1] "http://www.baidu.com/s?wd=r+project"
> gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
> gt.doc <- getURL(gt.url)
> gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){})
> nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]')
> gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
> gt.url <- paste("http://translate.google.com", gt.parameters, sep = "")
> 
> # STAGE 2 - find forward url to translated page
> doc开发者_开发知识库 <- getURL(gt.url, followlocation = TRUE)
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
> url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
> url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2]
> url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE)
> url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]])
> 
> # STAGE 3 - load translated page
> url.trans
[1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A "
> #getURL(url.trans)

If anyone knows of a simpler solution to what I've given above then please feel free to let me know! :)


Most of the following answer is for the particular case of google translate. In most cases, you'll just need to parse the <frameset> and pull out whichever frame you're looking for, though it might not be immediately obvious which is the main one from the HTML (perhaps look at the relative sizing of the frames).

It looks like you're going to have to follow a few refreshes to get the actual content. In particular, when you grab the URL you just mentioned, you'll see something like

  *snip*
<noframes>
<script>
<!--document.location="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf";-->
</script>
<a href="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf">Translate
</a>
</noframes>
  *snip*

If you follow the link here (remember to unescape '&' first), it'll give you another small HTML fragment which includes

<meta http-equiv="refresh" content="0;URL=http://translate.googleusercontent.com/translate_c?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;rurl=translate.google.com&amp;usg=asdf">

Again, unescaping the '&' and then following the refresh, you'll have the translated page that you're looking for.

Play with this in wget or curl and it should become more clear what you're going to need to do.


For your specific translation needs, maybe you'd be better off accessing the google translate API via the REST interface, rather than screen-scraping:

http://code.google.com/apis/language/translate/overview.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜