Validating XML with XSDs ... but still allow extensibility
Maybe it's me, but it appears that if you have an XSD
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="User">
<xs:complexType>
<xs:sequence>
<xs:element name="GivenName" />
<xs:element name="SurName" />
</xs:sequence>
<xs:attribute name="ID" type="xs:unsignedByte" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>
that defines the schema for this docum开发者_如何学Cent
<?xml version="1.0" encoding="utf-8" ?>
<User ID="1">
<GivenName></GivenName>
<SurName></SurName>
</User>
It would fail to validate if you added another element, say EmailAddress, and mix up the order
<?xml version="1.0" encoding="utf-8" ?>
<User ID="1">
<SurName></SurName>
<EmailAddress></EmailAddress>
<GivenName></GivenName>
</User>
I don't want to add EmailAddress to the document and have it be marked optional.
I just want an XSD that validates the bare minimum requirements that the document must meet.
Is there a way to do this?
EDIT:
marc_s pointed out below that you can use xs:any
inside of xs:sequence
to allow more elements, unfortunately, you have to maintain the order of elements.
Alternatively, I can use xs:all
which doesn't enforce the order of elements, but alas, doesn't allow me to place xs:any
inside of it.
Your issue has a resolution, but it will not be pretty. Here's why:
Violation of non-deterministic content models
You've touched on the very soul of W3C XML Schema's. What you are asking — variable order and variable unknown elements — violates the hardest, yet most basic principle of XSD's, the rule of Non-Ambiguity, or, more formally, the Unique Particle Attribution Constraint:
A content model must be formed such that during validation [..] each item in the sequence can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.
In normal English: when an XML is validated and the XSD processor encounters <SurName>
it must be able to validate it without first checking whether it is followed by <GivenName>
, i.e., no looking forward. In your scenario, this is not possible. This rule exists to allow implementations through Finite State Machines, which should make implementations rather trivial and fast.
This is one of the most-debated issues and is a heritage of SGML and DTD (content models must be deterministic) and XML, that defines, by default, that the order of elements is important (thus, trying the opposite, making the order unimportant, is hard).
As Marc_s already suggested, Relax_NG is an alternative that allows for non-deterministic content models. But what can you do if you're stuck with W3C XML Schema?
Non-working semi-valid solutions
You've already noticed that xs:all
is very restrictive. The reason is simple: the same non-deterministic rule applies and that's why xs:any
, min/maxOccurs
larger then one and sequences are not allowed.
Also, you may have tried all sorts of combinations of choice
, sequence
and any
. The error that the Microsoft XSD processor throws when encountering such invalid situation is:
Error: Multiple definition of element 'http://example.com/Chad:SurName' causes the content model to become ambiguous. A content model must be formed such that during validation of an element information item sequence, the particle contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.
In O'Reilly's XML Schema (yes, the book has its flaws) this is excellently explained. Furtunately, parts of the book are available online. I highly recommend you read through section 7.4.1.3 about the Unique Particle Attribution Rule, their explanations and examples are much clearer than I can ever get them.
One working solution
In most cases it is possible to go from an undeterministic design to a deterministic design. This usually doesn't look pretty, but it's a solution if you have to stick with W3C XML Schema and/or if you absolutely must allow non-strict rules to your XML. The nightmare with your situation is that you want to enforce one thing (2 predefined elements) and at the same time want to have it very loose (order doesn't matter and anything can go between, before and after). If I don't try to give you good advice but just take you directly to a solution, it will look as follows:
<xs:element name="User">
<xs:complexType>
<xs:sequence>
<xs:any minOccurs="0" processContents="lax" namespace="##other" />
<xs:choice>
<xs:sequence>
<xs:element name="GivenName" />
<xs:any minOccurs="0" processContents="lax" namespace="##other" />
<xs:element name="SurName" />
</xs:sequence>
<xs:sequence>
<xs:element name="SurName" />
<xs:any minOccurs="0" processContents="lax" namespace="##other" />
<xs:element name="GivenName" />
</xs:sequence>
</xs:choice>
<xs:any minOccurs="0" processContents="lax" namespace="##any" />
</xs:sequence>
<xs:attribute name="ID" type="xs:unsignedByte" use="required" />
</xs:complexType>
</xs:element>
The code above actually just works. But there are a few caveats. The first is xs:any
with ##other
as its namespace. You cannot use ##any
, except for the last one, because that would allow elements like GivenName
to be used in that stead and that means that the definition of User
becomes ambiguous.
The second caveat is that if you want to use this trick with more than two or three, you'll have to write down all combinations. A maintenance nightmare. That's why I come up with the following:
A suggested solution, a variant of a Variable Content Container
Change your definition. This has the advantage of being clearer to your readers or users. It also has the advantage of becoming easier to maintain. A whole string of solutions are explained on XFront here, a less readable link you may have already seen from the post from Oleg. It's an excellent read, but most of it does not take into account that you have a minimum requirement of two elements inside the variable content container.
The current best-practice approach for your situation (which happens more often than you may imagine) is to split your data between the required and non-required fields. You can add an element <Required>
, or do the opposite, add an element <ExtendedInfo>
(or call it Properties, or OptionalData). This looks as follows:
<xs:element name="User2">
<xs:complexType>
<xs:sequence>
<xs:element name="GivenName" />
<xs:element name="SurName" />
<xs:element name="ExtendedInfo" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" namespace="##any" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
This may seem less than ideal at the moment, but let it grow a bit. Having an ordered set of fixed elements isn't that big a deal. You're not the only one who'll be complaining about this apparent deficiency of W3C XML Schema, but as I said earlier, if you have to use it, you'll have to live with its limitations, or accept the burden of developing around these limitations at a higher cost of ownership.
Alternative solution
I'm sure you know this already, but the order of attributes is by default undetermined. If all your content is of simple types, you can alternatively choose to make a more abundant use of attributes.
A final word
Whatever approach you take, you will lose a lot of verifiability of your data. It's often better to allow content providers to add content types, but only when it can be verified. This you can do by switching from lax
to strict
processing and by making the types themselves stricter. But being too strict isn't good either, the right balance will depend on your ability to judge the use-cases that you're up against and weighing that in against the trade-offs of certain implementation strategies.
After reading of the answer of marc_s and your discussion in comments I decide to add a little.
It seems to me there are no perfect solution of your problem Chad. There are some approaches how to implement extensible content model in XSD, but all me known implementation have some restrictions. Because you didn't write about the environment where you plan to use extensible XSD I can you only recommend some links which probably will help you to choose the way which can be implemented in your environment:
- http://www.xfront.com/ExtensibleContentModels.html (or http://www.xfront.com/ExtensibleContentModels.pdf) and http://www.xfront.com/VariableContentContainers.html
- http://www.xml.com/lpt/a/993 (or http://www.xml.com/pub/a/2002/07/03/schema_design.html)
- http://msdn.microsoft.com/en-us/library/ms950793.aspx
You should be able to extend your schema with the <xs:any>
element for extensibility - see W3Schools for details.
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="User">
<xs:complexType>
<xs:sequence>
<xs:element name="GivenName" />
<xs:element name="SurName" />
<xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" />
</xs:sequence>
<xs:attribute name="ID" type="xs:unsignedByte" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>
When you add the processContents="lax"
then the .NET XML validation should succeed on it.
See MSDN docs on xs:any for more details.
Update: if you require more flexibility and less stringent validation, you might want to look at other methods of defining schemas for your XML - something like RelaxNG. XML Schema is - on purpose - rather strict about its rules, so maybe that's just the wrong tool for this job at hand.
Well, you can always use DTD :-) except that DTD also prescribes ordering. Validation with "unordered" grammar is terribly expensive. You could play with xsd:choice and min and max occurs but it's probably going to balk as well. You could also write XSD extensions / derived schemas.
The way you posed the problem it looks like you don't really want XSD at all. You can just load it and then validate whatever minimum you want with XPaths, but just protesting against XSD, how many years after it became omni-present standard is really, really not going to get you anywhere.
RelaxNG will solve this problem succinctly, if you can use it. Determinism isn't a requirement for schemas. You can translate an RNG or RNC schema into XSD, but it will approximate in this case. Whether that's good enough for your use is up to you.
The RNC schema for this case is:
start = User
User = element User {
attribute ID { xsd:unsignedByte },
( element GivenName { text } &
element SurName { text } &
element * - (SurName | GivenName) { any })
}
any = element * { (attribute * { text } | text | any)* }
The any rule matches any well-formed XML fragment. So this will require the User element to contain GivenName and SurName elements containing text in any order, and allow any other elements containing pretty much anything.
精彩评论