How to use XPath or xgrep to find information in Wikipedia?
I'd like to scrape some (not much) info from Wikipedia. Say I have a list of Universities and their Wikipedia page. Can I use an xpath expression to find the website (domain) of that University?
So for instance, if I get the page
curl http://en.wikipedia.org/wiki/Vienna_University_of开发者_运维百科_Technology
this xpath expression should find the domain:
http://www.tuwien.ac.at
Ideally, this should work with the Linux xgrep
command line tool, or equivalent.
With h
prefix bound to http://www.w3.org/1999/xhtml
namespace URI:
/h:html/h:body/h:div[@id='content']
/h:div[@id='bodyContent']
/h:table[@class='infobox vcard']
/h:tr[h:th='Website']
/h:td/h:a/@href
Also, it looks like Wiki page are well formed XML (despite the fact that are served like text/html). So, if you have an XML document with the pages URLs like:
<root>
<url>http://en.wikipedia.org/wiki/Vienna_University_of_Technology</url>
</root>
You could use:
document(/root/url)/h:html/h:body/h:div[@id='content']
/h:div[@id='bodyContent']
/h:table[@class='infobox vcard']
/h:tr[h:th='Website']
/h:td/h:a/@href
it looks like Wiki page are well formed XML (despite the fact that are served like text/html)
Apparently, that's no longer true. I had to use xmllint
's --html
option to convert the document to well-formed xml.
curl -L https://en.wikipedia.org/wiki/Vienna_University_of_Technology \
| xmllint --html --xpath "//table[@class='infobox vcard']//tr[th='Website']//a/@href" -
Result:
href="https://www.tuwien.at/en/"
Note: I added the -L
option to the curl
invocation to follow the redirect to https://en.wikipedia.org/wiki/TU_Wien.
As you see, it returned a node-set consisting of one attribute node. The string function can be used to get the string value of the first node in a node set. For an attribute node, it returns the attribute's value as a string:
curl -L https://en.wikipedia.org/wiki/Vienna_University_of_Technology \
| xmllint --html --xpath "string(//table[@class='infobox vcard']//tr[th='Website']//a/@href)" -
Result:
https://www.tuwien.at/en/
✔️
Update: I found that it is possible way to retrieve the article as well-formed XML. The action=parse API can do it. Here's the end result:
curl -L 'https://en.wikipedia.org/w/api.php?action=parse&page=Vienna_University_of_Technology&format=xml&redirects' \
| xmllint --xpath "string(/api/parse/text)" - \
| xmllint --xpath "string(//table[@class='infobox vcard']//tr[th='Website']//a/@href)" -
Notice that the --html
option is no longer needed.
精彩评论