开发者

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :

r<-htmlTreeParse(e) ## e is after getURL 
    g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
    l<-g.k[names(g.k)=="text"]
    u<-ldply(l,function(x) {

        w<-xmlValue(x)
        return(w)
        })

However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?

I´ve come to

xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk

But this leaves me a lot of cleaning up to do and I assume it can be done better.

Regards, //M

EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:

getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2

gives me the list of what I want. However still in xml with br tags. I thought running

xpathApply(e2, "//text()", function(k) xmlValue(k))->kk

would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.

Is there a way to do this directly:

xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk

Link to the web pa开发者_Python百科ge: I´m trying to get the names, and only, the names from the page.

getURL("http://legeforeningen.no/id/1712")


I ended up with

xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)

(no need for RCurl) and then

sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))

(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.

n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
    xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})

Unfortunately, this does not pick up names that do not contain a comma.


Use a mixture of xpath and string manipulation.

#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)

Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.

#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]

Now we convert to character, split on the <br> tags and remove empty lines.

all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names

Optionally, separate the names of people and their locations.

strsplit(all_names, ", ")

Or more prettily with stringr.

str_split_fixed(all_names, ", ", 2)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜