extract cdata using xslt

2023-01-15 21:31 问答作者：

Below is the xml that has CDATA section

<?xml version="1.0" encoding="ISO-8859-1"?>
<character>
<name>
<role>Indiana Jones</role>
<actor>Harrison Ford</actor>
<part>protagonist</part>
<![CDATA[  <film>Indiana Jones and the Kingdom of the开发者_开发百科 Crystal Skull</film>]]>
</name>
</character>

For above xml i need to rip off the CDATA and add new element under the existing element "film" , so the final output will be :

<?xml version="1.0" encoding="ISO-8859-1"?>
<character>
<name>
<role>Indiana Jones</role>
<actor>Harrison Ford</actor>
<part>protagonist</part>
<film>Indiana Jones and the Kingdom of the Crystal Skull</film>
<Language>English</Language>
</name>
</character>

Is this can be done using XSLT?

A slightly modified identify function should work.

Given this XML:

<?xml version="1.0" encoding="ISO-8859-1"?>
<character>
    <name>
        <role>Indiana Jones</role>
        <actor>Harrison Ford</actor>
        <part>protagonist</part>
        <![CDATA[  <film>Indiana Jones and the Kingdom of the Crystal Skull</film>]]>
    </name>
</character>

Using this XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Will produce this output:

<?xml version="1.0" encoding="UTF-8"?>
<character>
   <name>
      <role>Indiana Jones</role>
      <actor>Harrison Ford</actor>
      <part>protagonist</part>
          <film>Indiana Jones and the Kingdom of the Crystal Skull</film>
    </name>
</character>

(Tested using Saxon-HE 9.3.0.5 in oXygen 12.2.)

Since the film element in the CDATA block appears to be well-formed, you could use disable-output-escaping. If you match of the name/text(), select value-of with DOE and then insert the Language element immediately following.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"  />

<!--Identity template simply copies content forward -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>


<xsl:template match="name/text()">
    <!--disable-output-escaping will prevent the "film" element from being escaped.
    Since it appears to be well-formed you should be safe, but no guarentees -->
    <xsl:value-of select="." disable-output-escaping="yes" />
    <Language>English</Language>
</xsl:template>

</xsl:stylesheet>

Another way to solve this which would give you some more control over the transformation is to use Andrew Welsh LexEv XMLReader. This gives you the possibility to process CDATA sections as markup among other things.

First, the fact that your input XML has "CDATA" is in one sense irrelevant... the XSLT can't tell whether it's CDATA or not. What's key about your input XML is that you have escaped markup <film>...</film>, and you want to turn it into a real element.

If you know that the escaped element will always have a certain name ('film'), and you know where it occurs, you can strip it and replace it easily:

   <xsl:template match="text()[contains(., '&lt;film>')]">
      <film>
         <xsl:value-of select="substring-before(substring-after(., '&lt;film>'),
              '&lt;/film>')"/>
      </film>
   </xsl:template>

If you don't know in advance where the escaped tags will occur and what the element names are, you could use XSLT 2.0's <xsl:analyze-string> to find and replace them. But as Alejandro pointed out, general parsing of XML using regular expressions can get very messy. It would only be feasible if you know the markup will be simple.

I was dealing with something similar and I found a good solution so I thought of sharing it with you, but this one is for NSXMLParser.

If you're using NSXMLParser there's a delegate method called foundCDATA which can look like this:

- (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock{
    if (!parseElement) {
        return;
    }
    if (parsedElementData==nil) {
        parsedElementData = [[NSMutableData alloc] init];
    }
    [parsedElementData appendData:CDATABlock];

    //Grabs the whole content in CDATABlock.
    NSMutableString *content = [[NSMutableString alloc] initWithData:CDATABlock encoding:NSUTF8StringEncoding];

 }

Now add this prewritten class to your project. Then import it to the parser class you want to use it in:

#import NSString_stripHTML

Now simply you can add the following line to foundCDATAmethod:

NSString *strippedContent;
strippedContent = [content strippedHtml];

Now you will have the stripped text without any extra characters. You can substring whatever you want from this stripped text.

继续阅读：xslt

extract cdata using xslt

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？