Flatten and convert a large amount of XML files to a Table
I have a huge number (2k+) of xml files that I need to extract data from and dump that data into a table of some sort (Excel or simply one or multiple XML files would be fine). The fun part is that the xml files have wildly differing amount of nodes (with occasionally the same names in the subnodes) and also the depth of the hierarchy is not known.
Anyway, here is an example:
<?xml version="1.0" encoding="UTF-8"?>
<SomeName>
<identifier>
<UID> 1234 </UID>
</identifier>
<MainNode1>
<SubNode1>
<Subnode1a>DATA1a0</Subnode1a>
</SubNode1>
<SubNode1>
<Subnode1a>DATA1a1</Subnode1a>
</SubNode1>
<SubNode1>
<Subnode1a>DATA1a2</Subnode1a>
</SubNode1>
</MainNode1>
<MainNode2>
<SubNode2>
<Subnode2a>DATA2a0</Subnode2a>
</SubNode2>
<SubNode3>
<Subnode3a>DATA3a0</Subnode3a>
</SubNode3>
<SubNode4>
<Subnode4a>DATA4a0</Subnode4a>
</SubNode4>
</MainNode2>
<MainNodeIDONTCARE>
<SubnodeWhatever>
</SubnodeWhatever>
<MainNodeIDONTCARE>
</SomeName>
And here is my table I want it to flatten to. Basically it should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<SomeName>
<UID>1234</UID>
<MainNode1_SubNode1_SubNode1aA>DATA1a0</MainNode1_SubNode1_SubNode1aA>
<MainNode1_SubNode1_SubNode1aB>DATA1a1</MainNode1_SubNode1_SubNode1aB>
<MainNode1_SubNode1_SubNode1aC>DATA1a2</MainNode1_SubNode1_SubNode1aC>
<MainNode2_SubNode2_SubNode2a>Data2a0</MainNode2_SubNode2_SubNode2a>
<MainNode2_SubNode3_SubNode3a>Data3a0</MainNode2_SubNode3_SubNode3a>
<MainNode2_SubNode4_SubNode4a>Data4a0</MainNode2_SubNode4_SubNode4a>
</SomeName>
As you can see the <MainNodeIDONTCARE>
is not in the table, bec开发者_开发知识库ause I simply want to extract specific data sets from the XML files (I want to include MainNodes that I need and just ignore all others). In this case anything that is in MainNode1 through MainNode4.
What would be the best way to get this done? I don't care about performance, because the files aren't too big and the conversion can take as long as it wants to.
Is there some clever XSLT or anything else that can help me? Thank you.
Here you are:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="SomeName">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="UID" priority="1">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="MainNodeIDONTCARE" priority="1"/>
<xsl:template match="SomeName//*[not(*)][text()]">
<xsl:variable name="elementName">
<xsl:call-template name="getElementName">
<xsl:with-param name="element" select="."/>
</xsl:call-template>
</xsl:variable>
<xsl:element name="{$elementName}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
<xsl:template name="getElementName">
<xsl:param name="element"/>
<xsl:if test="$element/parent::*[not(self::SomeName)]">
<xsl:call-template name="getElementName">
<xsl:with-param name="element" select="$element/parent::*"/>
</xsl:call-template>
<xsl:text>_</xsl:text>
</xsl:if>
<xsl:value-of select="name($element)"/>
</xsl:template>
</xsl:stylesheet>
Whether or not an XSLT is used (and yes that would work), it seems the trick is to make an xpath statement that makes one single collection of all those nodes. The proper xpath statement will not care about the depth.
Using XSLT 2.0 [Saxon] (I don't think this'll work in xslt 1.0):
First, the non empty text nodes contain the data, so select for them:
<xsl:template match="/SomeName" >
<SomeName>
<xsl:copy-of select="identifier/UID" />
<xsl:apply-templates select="(MainNode1|MainNode2)//text()[normalizespace()]" />
</SomeName>
</xsl:template>
Then you need to make an element name by joining the ancestor names:
<xsl:template match="text()" >
<xsl:element name="{string-join( ./ancestor::*[name()!='SomeName']/name(),'_')}">
<xsl:value-of select="." />
</xsl:element>
</xsl:template>
精彩评论