Performance problem with XSLT index having many identical keys
I'm using XSLT keys in many contexts. Usually, the keys used are more or less unique with very infrequent duplicate instances. Now I defined a key which has A LOT of instances for some key values. To be precise: I'm processing a 1.7 GigaByte file with 420.000 entries having a @STEREOTYPE attribute. Some of the stereotypes occur up to 90.000 times. Those are not the ones I'm interested in, though. The ones that I would like to select usually have have maybe 10 to 20 instances.
The key definition is
<xsl:key
name="entityByStereotype"
match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY"
use="@STEREOTYPE"/>
The building of the index lasts eternally, that is I usually kill the process after 5 or 6 hours.
An alternate key definition is
<xsl:key
name="entityByStereotype"
match="/REPOSITORY_DUMP/ENTITY_LIST/EN开发者_开发百科TITY"
use="concat(@STEREOTYPE, @OBJECT_ID)"/>
which forces the instance keys to be unique and its build returns after 14 seconds. My assumption is that the sort algorithm does not work very well for multiple instances of the same key resulting in an O(n**2) complexity for all subsets with identical keys. This is pretty bad for sub sets of 90.000 entries. :-(
However, I cannot use the alternate index definition, since I do not know the OBJECT_ID part of the instance beforehand.
Any ideas? Thanks a lot!
Saxon used: Version 9.1.0.5
Have you tried to use just <xsl:for-each-group>
?
In case you provide a suitable source XML document I may be interested to help find a more optimal solution.
Update: A few other tricks I'd recommend:
1) In case you know in advance the values of @STEREOTYPE
in which you are interested, then use:
<xsl:key
name="entityByStereotype"
match="/REPOSITORY_DUMP/ENTITY_LIST/ENTITY[@STEREOTYPE = ($val1, $val2,...,$val-n)]"
use="@STEREOTYPE"/>
If they occur, as you say, just 10-20 times, chances are the hash-table (yes, sorting isn't meaningful for implementing keys) will be more easily built.
2) Split the XML document into several smaller (say 10) documents and process separately.
精彩评论