Selecting an abstract of n words including HTML tags with XSLT

2022-12-18 17:22 问答作者：

I want to select an abstract, along with HTML format elements using XSLT. Here is an example the XML:

<PUBLDES>The <IT>European Journal of Cancer (including EJC Supplements),</IT> 
is an international comprehensive oncology journal that publishes original 
research, editorial comments, review articles and news on experimental oncology, 
clinical oncology (medical, paediatric, radiation, surgical), translational 
oncology, and on cancer epidemiology and prevention. The Journal now has online
submission for authors. Please submit manuscripts at 
<SURL>http://ees.elsevier.com/ejc</SURL> and follow the instructions on the 
site.<P/>

The <IT>European Journal of Cancer (including EJC Supplements)</IT> is the 
official Journal of the European Organisation for Research and Treatment 
of Cancer (EORTC), the European CanCer Organisation (ECCO), the European 
Association for Cancer Research (EACR), the the European Society of Breast 
Cancer Specialists (EUSOMA) and the European School of Oncology (ESO). <P/>
Supplements to the <IT>European Journal of Cancer</IT> are published under 
the title <IT>EJC Supplements</IT> (ISSN 1359-6349).  All subscribers to 
<IT>European Journal of Cancer</IT> automatically receive this publication.<开发者_JAVA技巧;P/>
To access the latest tables of contents, abstracts and full-text articles 
from <IT>EJC</IT>, including Articles-in-Press, please visit <URL>
<HREF>http://www.sciencedirect.com/science/journal/09598049</HREF>
<HTXT>ScienceDirect</HTXT>
</URL>.</PUBLDES>

How do I get say 45 words from it, along with HTML tags in it. When I use substring() or concat() it removes the tags (like <IT> etc.).

You would probably be better off doing this programmatically, rather than with pure XSLT, but if you have to use XSLT, here is one way to do it. It does involve multiple stylesheets, although if you had were able to use extension functions, you can make use of node-sets, and combine them into one big (and nasty) style sheet.

The first stylesheet would copy the intial XML, but 'tokenise' any text it finds, so that each word in the text becomes a separate 'WORD' element.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <!-- Copy existing nodes and attributes -->
   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
   </xsl:template>
   <!-- Match text nodes -->
   <xsl:template match="text()">
      <xsl:call-template name="tokenize">
         <xsl:with-param name="string" select="."/>
      </xsl:call-template>
   </xsl:template>
   <!-- Splits a string into separate elements for each word -->
   <xsl:template name="tokenize">
      <xsl:param name="string"/>
      <xsl:param name="delimiter" select="' '"/>
      <xsl:choose>
         <xsl:when test="$delimiter and contains($string, $delimiter)">
            <xsl:variable name="word" select="normalize-space(substring-before($string, $delimiter))"/>
            <xsl:if test="string-length($word) &gt; 0">
               <WORD>
                  <xsl:value-of select="$word"/>
               </WORD>
            </xsl:if>
            <xsl:call-template name="tokenize">
               <xsl:with-param name="string" select="substring-after($string, $delimiter)"/>
               <xsl:with-param name="delimiter" select="$delimiter"/>
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="word" select="normalize-space($string)"/>
            <xsl:if test="string-length($word) &gt; 0">
               <WORD>
                  <xsl:value-of select="$word"/>
               </WORD>
            </xsl:if>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
</xsl:stylesheet>

The XSLT template used to 'tokenize' a string of text, I took from this question here:

tokenizing-and-sorting-with-xslt-1-0

(Note that in XSLT2.0, I believe there is a tokenize function, which would simplify the above)

This would give you XML like so...

<PUBLDES>
   <WORD>The</WORD>
   <IT>
      <WORD>European</WORD>
      <WORD>Journal</WORD>
      <WORD>of</WORD>
      ....

And so on...

Next, it is a case of traversing this XML document, using another XSLT document, outputting only upto the first 45 word elements. To do this, I repeatedly apply a template, keeping a running total of the number of WORDS currently found. When matching a node, there are three possibilities

Match a WORD element: Output it. Carry on processing from next sibling if total is not reached.
Match a element where the number of words below it is less than the total: Copy the whole element, and then carry on processing from next sibling if total is not reached
Match elements where number of words below would exceed total: Copy the current node (but not its children) and continue processing at first child.

Here is the style sheet, in all its hideousness

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:variable name="WORDCOUNT">6</xsl:variable>

   <!-- Match root element -->
   <xsl:template match="/">
      <xsl:apply-templates select="descendant::*[1]" mode="word">
         <xsl:with-param name="previousWords">0</xsl:with-param>
      </xsl:apply-templates>
   </xsl:template>

   <!-- Match any node -->
   <xsl:template match="node()" mode="word">
      <xsl:param name="previousWords"/>

      <!-- Number of words below the element (at any depth) -->
      <xsl:variable name="childWords" select="count(descendant::WORD)"/>
      <xsl:choose>
         <!-- Matching a WORD element -->
         <xsl:when test="local-name(.) = 'WORD'">
            <!-- Copy the word -->
            <WORD>
               <xsl:value-of select="."/>
            </WORD>
            <!-- If there are still words to output, continue processing at next sibling -->
            <xsl:if test="$previousWords + 1 &lt; $WORDCOUNT">
               <xsl:apply-templates select="following-sibling::*[1]" mode="word">
                  <xsl:with-param name="previousWords">
                     <xsl:value-of select="$previousWords + 1"/>
                  </xsl:with-param>
               </xsl:apply-templates>
            </xsl:if>
         </xsl:when>

         <!-- Match a node where the number of words below it is within allowed limit -->
         <xsl:when test="$childWords &lt;= $WORDCOUNT - $previousWords">
            <!-- Copy the element -->
            <xsl:copy>
               <!-- Copy all its desecendants -->
               <xsl:copy-of select="*|@*"/>
            </xsl:copy>
            <!-- If there are still words to output, continue processing at next sibling -->
            <xsl:if test="$previousWords + $childWords &lt; $WORDCOUNT">
               <xsl:apply-templates select="following-sibling::*[1]" mode="word">
                  <xsl:with-param name="previousWords">
                     <xsl:value-of select="$previousWords + $childWords"/>
                  </xsl:with-param>
            </xsl:apply-templates>
         </xsl:if>
         </xsl:when>

         <!-- Match nodes where the number of words below it would exceed current limit -->
         <xsl:otherwise>
            <!-- Copy the node -->
            <xsl:copy>
               <!-- Continue processing at very first child node -->
               <xsl:apply-templates select="descendant::*[1]" mode="word">
                  <xsl:with-param name="previousWords">
                     <xsl:value-of select="$previousWords"/>
                  </xsl:with-param>
               </xsl:apply-templates>
            </xsl:copy>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
</xsl:stylesheet>

If you were outputting just the first 4 words, say, this would give you the following output

<PUBLDES>
   <WORD>The</WORD>
   <IT>
      <WORD>European</WORD>
      <WORD>Journal</WORD>
      <WORD>of</WORD>
   </IT>
</PUBLDES>

Of course, you would then need yet another transformation to remove the WORD elements, and just leave the text. This should be fairly straight-forward....

This is all very nasty though, but it is the best I could come up with for now!

继续阅读：parsing xslt

Selecting an abstract of n words including HTML tags with XSLT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？