开发者

How to use XPath or xgrep to find information in Wikipedia?

I'd like to scrape some (not much) info from Wikipedia. Say I have a list of Universities and their Wikipedia page. Can I use an xpath expression to find the website (domain) of that University?

So for instance, if I get the page

curl http://en.wikipedia.org/wiki/Vienna_University_of开发者_运维百科_Technology 

this xpath expression should find the domain:

http://www.tuwien.ac.at

Ideally, this should work with the Linux xgrep command line tool, or equivalent.


With h prefix bound to http://www.w3.org/1999/xhtml namespace URI:

/h:html/h:body/h:div[@id='content']
               /h:div[@id='bodyContent']
                /h:table[@class='infobox vcard']
                 /h:tr[h:th='Website']
                  /h:td/h:a/@href

Also, it looks like Wiki page are well formed XML (despite the fact that are served like text/html). So, if you have an XML document with the pages URLs like:

<root>
   <url>http://en.wikipedia.org/wiki/Vienna_University_of_Technology</url>
</root>

You could use:

document(/root/url)/h:html/h:body/h:div[@id='content']
                                  /h:div[@id='bodyContent']
                                   /h:table[@class='infobox vcard']
                                    /h:tr[h:th='Website']
                                     /h:td/h:a/@href


it looks like Wiki page are well formed XML (despite the fact that are served like text/html)

Apparently, that's no longer true. I had to use xmllint's --html option to convert the document to well-formed xml.

curl -L https://en.wikipedia.org/wiki/Vienna_University_of_Technology \
| xmllint --html --xpath "//table[@class='infobox vcard']//tr[th='Website']//a/@href" -

Result:

href="https://www.tuwien.at/en/"

Note: I added the -L option to the curl invocation to follow the redirect to https://en.wikipedia.org/wiki/TU_Wien.

As you see, it returned a node-set consisting of one attribute node. The string function can be used to get the string value of the first node in a node set. For an attribute node, it returns the attribute's value as a string:

curl -L https://en.wikipedia.org/wiki/Vienna_University_of_Technology \
| xmllint --html --xpath "string(//table[@class='infobox vcard']//tr[th='Website']//a/@href)" -

Result:

https://www.tuwien.at/en/

✔️

Update: I found that it is possible way to retrieve the article as well-formed XML. The action=parse API can do it. Here's the end result:

curl -L 'https://en.wikipedia.org/w/api.php?action=parse&page=Vienna_University_of_Technology&format=xml&redirects' \
| xmllint --xpath "string(/api/parse/text)" - \
| xmllint --xpath "string(//table[@class='infobox vcard']//tr[th='Website']//a/@href)" -

Notice that the --html option is no longer needed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜