Can I enforce the order of XML attributes using a schema?
Our C++ application reads configuration data from XML files that look something like this:
<data>
<value id="FOO1" name="foo1" size="10" description="the foo" ... />
<value id="FOO2" name="foo2" size="10" description="the other foo" ... />
...
<value id="FOO300" name="foo300" size="10" description="the last foo" ... />
</data>
The complete application configuration consist of ~2500 of these XML files (which translates into more than 1.5 million key/value attribute pairs). The XML files come from many different sources/teams and are validated against a schema. However, sometimes the <value/>
nodes look like this:
<value name="bar1" id="BAR1" description="the bar" size="20" ... />
or this:
<value id="BAT1" description="the b开发者_如何学Pythonat" name="bat1" size="25" ... />
To make this process fast, we are using Expat to parse the XML documents. Expat exposes the attributes as an array - like this:
void ExpatParser::StartElement(const XML_Char* name, const XML_Char** atts)
{
// The attributes are stored in an array of XML_Char* where:
// the nth element is the 'key'
// the n+1 element is the value
// the final element is NULL
for (int i = 0; atts[i]; i += 2)
{
std::string key = atts[i];
std::string value = atts[i + 1];
ProcessAttribute (key, value);
}
}
This puts all the responsibility onto our ProcessAttribute()
function to read the 'key' and decide what to do with the value. Profiling the app has shown that ~40% of the total XML Parsing time is dealing with these attributes by name/string.
The overall process could be sped up dramatically if I could guarantee/enforce the order of the attributes (for starters, no string comparisons in ProcessAttribute()
). For example, if 'id' attribute was always the 1st attribute we could deal with it directly:
void ExpatParser::StartElement(const XML_Char* name, const XML_Char** atts)
{
// The attributes are stored in an array of XML_Char* where:
// the nth element is the 'key'
// the n+1 element is the value
// the final element is NULL
ProcessID (atts[1]);
ProcessName (atts[3]);
//etc.
}
According to the W3C schema specs, I can use <xs:sequence>
in an XML schema to enforce the order of elements - but it doesn't seem to work for attributes - or perhaps I'm using it incorrectly:
<xs:element name="data">
<xs:complexType>
<xs:sequence>
<xs:element name="value" type="value_type" minOccurs="1" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="value_type">
<!-- This doesn't work -->
<xs:sequence>
<xs:attribute name="id" type="xs:string" />
<xs:attribute name="name" type="xs:string" />
<xs:attribute name="description" type="xs:string" />
</xs:sequence>
</xs:complexType>
Is there a way to enforce attribute order in an XML document? If the answer is "no" - could anyone perhaps suggest a alternative that wouldn't carry a huge runtime performance penalty?
According to the xml specification,
the order of attribute specifications in a start-tag or empty-element tag is not significant
You can check it at section 3.1
XML attributes don't have an order, therefore there is no order to enforce.
If you want something ordered, you need XML elements. Or something different from XML. JSON, YAML and bEncode, e.g. have both maps (which are unordered) and sequences (which are ordered).
As others have pointed out, no, you can't rely on attribute ordering.
If I had any process at all involving 2,500 XML files and 1.5 million key/value pairs, I would get that data out of XML and into a more usable form as soon as I possibly could. A database, a binary serialization format, whatever. You're not getting any advantage out of using XML (other than schema validation). I'd update my store every time I got a new XML file, and take parsing 1.5 million XML elements out of the main flow of my process.
The answer is no, alas. I'm shocked by your 40% figure. I find it hard to believe that turning "foo" into ProcessFoo takes that long. Are you sure the 40% doesn't include the time taken to execute ProcessFoo?
Is it possible to access the attributes by name using this Expat thing? That's the more traditional way to access attributes. I'm not saying it's going to be faster, but it might be worth a try.
I don't think XML Schema supports that - attributes are just defined and restricted by name, e.g. they have to match a particular name - but I don't see how you could define an order for those attributes in XSD.
I don't know of any other way to make sure attributes on a XML node come in a particular order - not sure if any of the other XML schema mechanisms like Schematron or Relax NG would support that....
I'm pretty sure there's no way to enforce attribute order in an XML document. I'm going to assume that you can insist on it via a business process or other human factors, such as a contract or other document.
What if you just assumed that the first attribute was "id", and tested the name to be sure? If yes, use the value, if not, then you can try to get the attribute by name or throw out the document.
While not as efficient as calling out the attribute by its ordinal, some non-zero number of times you'll be able to guess that your data providers have delivered XML to spec. The rest of the time, you can take other action.
Just a guess, but can you try adding use="required"
to each of your attribute specifications?
<xs:complexType name="value_type">
<!-- This doesn't work -->
<xs:sequence>
<xs:attribute name="id" type="xs:string" use="required" />
<xs:attribute name="name" type="xs:string" use="required" />
<xs:attribute name="description" type="xs:string" use="required" />
</xs:sequence>
</xs:complexType>
I'm wondering if the parser is being slowed down by allowing optional attributes, when it appears your attributes will always be there.
Again, just a guess.
EDIT: XML 1.0 spec says that attribute order is not significant. http://www.w3.org/TR/REC-xml/#sec-starttags
Therefore, XSD won't enforce any order. But that doesn't mean that parsers can't be fooled into working quickly, so I'm keeping the above answer published in case it actually works.
From what I recall, Expat is a non validating parser and better for it.. so you can probably scrap that XSD idea. Neither is the order-dependent a good idea in many XML approaches (XSD got criticised on element order a heck of a lot back in the day, for example, by pro or anti- sellers of XML Web Services at MSFT).
Do your custom encoding and simply extend either your logic for more efficient lookup or dig into the parser source. It is trivial to write the tooling around encoding efficient replacement whilst shielding the software agents and users from it.. you want do to this so it is easily migrated while preserving backward compatibility and reversibility. Also, go for fixed-size constraints/attribute-name-translation.
[ Consider yourself lucky with Expat :) and its raw speed. Imagine how CLR devs love XML scaling facilities, they routinely send 200MB on the wire in process of 'just querying the database' .. ]
精彩评论