How to use XPath or xgrep to find information in Wikipedia?

2023-01-31 23:15 问答作者：

I'd like to scrape some (not much) info from Wikipedia. Say I have a list of Universities and their Wikipedia page. Can I use an xpath expression to find the website (domain) of that University?

So for instance, if I get the page

curl http://en.wikipedia.org/wiki/Vienna_University_of开发者_运维百科_Technology

this xpath expression should find the domain:

http://www.tuwien.ac.at

Ideally, this should work with the Linux xgrep command line tool, or equivalent.

With h prefix bound to http://www.w3.org/1999/xhtml namespace URI:

/h:html/h:body/h:div[@id='content']
               /h:div[@id='bodyContent']
                /h:table[@class='infobox vcard']
                 /h:tr[h:th='Website']
                  /h:td/h:a/@href

Also, it looks like Wiki page are well formed XML (despite the fact that are served like text/html). So, if you have an XML document with the pages URLs like:

<root>
   <url>http://en.wikipedia.org/wiki/Vienna_University_of_Technology</url>
</root>

You could use:

document(/root/url)/h:html/h:body/h:div[@id='content']
                                  /h:div[@id='bodyContent']
                                   /h:table[@class='infobox vcard']
                                    /h:tr[h:th='Website']
                                     /h:td/h:a/@href

it looks like Wiki page are well formed XML (despite the fact that are served like text/html)

Apparently, that's no longer true. I had to use xmllint's --html option to convert the document to well-formed xml.

curl -L https://en.wikipedia.org/wiki/Vienna_University_of_Technology \
| xmllint --html --xpath "//table[@class='infobox vcard']//tr[th='Website']//a/@href" -

Result:

href="https://www.tuwien.at/en/"

Note: I added the -L option to the curl invocation to follow the redirect to https://en.wikipedia.org/wiki/TU_Wien.

As you see, it returned a node-set consisting of one attribute node. The string function can be used to get the string value of the first node in a node set. For an attribute node, it returns the attribute's value as a string:

curl -L https://en.wikipedia.org/wiki/Vienna_University_of_Technology \
| xmllint --html --xpath "string(//table[@class='infobox vcard']//tr[th='Website']//a/@href)" -

Result:

https://www.tuwien.at/en/

✔️

Update: I found that it is possible way to retrieve the article as well-formed XML. The action=parse API can do it. Here's the end result:

curl -L 'https://en.wikipedia.org/w/api.php?action=parse&page=Vienna_University_of_Technology&format=xml&redirects' \
| xmllint --xpath "string(/api/parse/text)" - \
| xmllint --xpath "string(//table[@class='infobox vcard']//tr[th='Website']//a/@href)" -

Notice that the --html option is no longer needed.

继续阅读：wikipedia xml

How to use XPath or xgrep to find information in Wikipedia?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？