How can I use xpath querying using R's XML library?
The xml file has this snippet:
<?xml version="1.0"?>
<PC-AssayContainer
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
....
<PC-AnnotatedXRef>
<PC-AnnotatedXRef_xref>
<PC-XRefData>
<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
</PC-XRefData>
</PC-AnnotatedXRef_xref>
</PC-AnnotatedXRef>
I tried to parse it using xpath's global search and also tried with some namespacing:
library('XML')
doc = xmlInternalTreeParse('http://s3.amazonaws.com/tommy_chheng/pubmed/485270.descr.xml')
>xpathApply(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> getNodeSet(doc, "//PC-XRefData_pmid")
list()
attr(,"cl开发者_JAVA百科ass")
[1] "XMLNodeSet"
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns="xs")
list()
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns= c(xs = "http://www.w3.org/2001/XMLSchema-instance"))
list()
Shouldn't the xpath match:
<PC-XRefData_pmid>17959251</PC-XRefData_pmid>
Since the default namespace is the NIH one (whose URI is "http://www.ncbi.nlm.nih.gov"), <PC-XRefData_pmid>
(and every other element in your XML document that has no namespace prefix) is in that NIH namespace.
So to match them with an XPath, you need to tell your XPath processor what prefix you're going to use for the NIH namespace, and you need to use that prefix in your XPath.
So, without knowing R, I would try
xpathApply(doc, "//nih:PC-XRefData_pmid",
ns= c(nih = "http://www.ncbi.nlm.nih.gov"))
or else
getNodeSet(doc, "//*[local-name() = 'PC-XRefData_pmid']")
as the latter bypasses namespaces.
Just because the XML document declares the NIH namespace as the default one doesn't mean that the XPath processor will know that. In the XML information model, namespace prefixes are not significant. So when I parse in an XML document, it's not supposed to matter whether the NIH namespace is bound to the "nih:" prefix or the "snizzlefritz:" prefix or the "" (default) prefix. The XML parser or XPath processor is not supposed to have to know what prefix got bound to what namespace in the XML document. Especially since there could be several different prefixes bound to the same namespace at different places in the same document... and vice versa. So if you want to have your XPath expression match an element that's in a namespace, you have to declare that namespace to the XPath processor.
Edit: There are a few caveats, contributed by @Jim Pivarski:
- The "doc" must be an xml node, not a document (class "XMLNode" or "XMLInternalElementNode", not "XMLDocument" or "XMLInternalDocument").
- At least in Jim's version (XML_3.93-0), the named argument is "namespaces", not "ns".
So if "doc" is an instance of a document class, the correct solution is:
xpathApply(xmlRoot(doc), "//nih:PC-XRefData_pmid",
namespaces = c(nih = "http://www.ncbi.nlm.nih.gov"))
This is FAQ.
This: //PC-XRefData_pmid
Means: any PC-XRefData_pmid
in document under no namespace or empty namespace
It doesn't means any PC-XRefData_pmid
in document under default namespace
Plus, your document sample isn't completed, but it looks like your PC-XRefData_pmid
element is under http://www.ncbi.nlm.nih.gov
namespace
精彩评论