Retrieve XML Namespaces using Regex
Given an XML fragment that I want to parse with XPath I first need to extract the namespaces to add to the namespace manager. I've been trying to figure out the Regex pattern needed to extract xml attributes that define a namepspace. For example I want to get al开发者_运维百科l the namespaces which I can do some more basic string manipulation on to separate out the namespace name and the url.
xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-02-12T12:41:45"
The attribute name will always begin with xmlns: and I need the regex to read to the end of the value, so include the last "
Alternatively a more generic pattern would do the job to just extract ALL attributes in the form name="value" and I can just do some string compares to see if each attribute is a namespace.
<my:StationLookupValues xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-02-12T12:41:45"><my:StationLookupValue>Hull Inspectors</my:StationLookupValue></my:StationLookupValues><my:StationLookupValues xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-02-12T12:41:45"><my:StationLookupValue>Barnsley Inspectors</my:StationLookupValue></my:StationLookupValues><my:StationValue xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2010-02-12T12:41:45">Hull Inspectors</my:StationValue>
I've not been able to find an example of something like this, nor work it out for myself. Any assistance on this would be very much appriciated.
[EDIT] I understand that XML parsers should be used and this is what I am going to do. But all I have is an XML fragment to pass so I must first build a namespace manager and in order to do that I need to extract the namespaces used.
Try this pattern: 'xmlns:(.*?)=(".*?")'
It means
- the literal string xmlns:
- shortest string up to an =
- a quote, followed by shortest string up to the following quote
The parenthesis means the first group contains the namespace name, the second group is the value. Adjust according to whether you want it all in one, and whether you want or don't want the quotes in the group.
As Tomalak pointed out in his answer, this is fraught with peril. It could potentially match patterns that are parts of comments or embedded in strings as data, etc. This is why regular expressions aren't good for parsing xml data -- since you aren't actually parsing, you're just looking for patterns without regard to context).
Be aware that such a things are possible:
<elem>
<x:elem xmlns:x="http://some/namespace" />
<x:elem xmlns:x="http://some/other/namespace" />
<elem xmlns="http://some/third/namespace" />
<elem>
XML Namespaces look like xmlns:foo="http://some/foo/namespace"!
</elem>
<!-- remember to put xmlns:x="http://some/namespace" back in! -->
<elem />
</elem>
Just extracting namespaces and prefixes with a regex will get it wrong at some point.
I think that processing XML that contains namespaces without knowing what those namespaces are is a sign that someone, somewhere, is doing something wrong.
I'm trying to figure out how, if you don't know what namespace you're looking for, you could get any benefit out of creating a namespace manager. The weirdest requirements often actually turn out to be requirements, so I dunno, but it really seems to me like there's something else going on here.
The regex mentioned by Bryan Oakley will work (with the caveats he mentions).
Others who have railed against the idea of not knowing the namespaces involved in an XML document to parse are forgetting about the XSD specification for wildcards (see section 3.10 of the XML Schema Part 1 specification).
You may be in a scenario, like I currently am, where you only have a base XSD defined but which intentionally has defined <any namespace="##other" .../> elements to allow for arbitrary XML extensions from other namespaces. In this scenario, you'll have to use XPaths to parse any XML from other namespaces which are making use of the XSD wildcard elements. For my parser, I need to first figure out what namespaces are being used. Then, based on that, grab the appropriate pre-defined XPaths for those namespaces before I can parse the document.
Using XSD wildcards are nice when you just want a base structure but also want the flexibility that allows others to append their own information independent of one another so you don't have to constantly revise XSDs for any random request from another group and risk breaking those currently using the schema.
I haven't settled on a final solution for this myself. But leaning towards using regex to grab XML prefixes (which will likely have false positives) and then validate those matches against the JAXP org.w3c.dom.Document.lookupNamespaceURI(String prefix) to remove the false positives.
精彩评论