In R, how to parse specific frame within a webpage?

2023-01-26 07:26 问答作者：

Greetings all,

Is there a way to only read the HTML code from a specific frame within a webpage?

For example, if I submit a url to google translate, is there a way to parse only the translated page frame? Whenever I try, I can only access the top frame on the page but not the translated frame. Here is my self-contained sample code:

library(XML)
url <- "http://www.baidu.com/s?wd=r+project"
url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
htmlTreeParse(url.google.translate, useInternalNodes = FALSE)

The above code refers to this url:

$file
[1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"

The output however only access the top frame of the page and not the main frame, which is what I am interested in.

Hope that made sense and thanks in advance for any help.

Tony

UPDATE - Thanks to the answer from @kwantam below (accepted), I was able to use it to get my solution as follows (self-contained):

> # Load R packages
> library(RCurl)
> library(XML)
> 
> # STAGE 1 - find forward url in relevent frame
> ( url <- "http://www.baidu.com/s?wd=r+project" )
[1] "http://www.baidu.com/s?wd=r+project"
> gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
> gt.doc <- getURL(gt.url)
> gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){})
> nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]')
> gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
> gt.url <- paste("http://translate.google.com", gt.parameters, sep = "")
> 
> # STAGE 2 - find forward url to translated page
> doc开发者_开发知识库 <- getURL(gt.url, followlocation = TRUE)
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
> url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
> url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2]
> url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE)
> url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]])
> 
> # STAGE 3 - load translated page
> url.trans
[1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A "
> #getURL(url.trans)

If anyone knows of a simpler solution to what I've given above then please feel free to let me know! :)

Most of the following answer is for the particular case of google translate. In most cases, you'll just need to parse the <frameset> and pull out whichever frame you're looking for, though it might not be immediately obvious which is the main one from the HTML (perhaps look at the relative sizing of the frames).

It looks like you're going to have to follow a few refreshes to get the actual content. In particular, when you grab the URL you just mentioned, you'll see something like

  *snip*
<noframes>
<script>
<!--document.location="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf";-->
</script>
<a href="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf">Translate
</a>
</noframes>
  *snip*

If you follow the link here (remember to unescape '&' first), it'll give you another small HTML fragment which includes

<meta http-equiv="refresh" content="0;URL=http://translate.googleusercontent.com/translate_c?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;rurl=translate.google.com&amp;usg=asdf">

Again, unescaping the '&' and then following the refresh, you'll have the translated page that you're looking for.

Play with this in wget or curl and it should become more clear what you're going to need to do.

For your specific translation needs, maybe you'd be better off accessing the google translate API via the REST interface, rather than screen-scraping:

http://code.google.com/apis/language/translate/overview.html

继续阅读：google-translate

In R, how to parse specific frame within a webpage?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？