Can I parse an HTML using XSLT?

2022-12-09 21:05 问答作者：

I have to parse a big HTML file, and Im only interested in a small section (a table). So I thought about using an XSLT to simplify/transform the HTML in something simpler that I could then开发者_开发问答 easily process.

The problem Im having is that the is not finding my table. So I don't know if its even possible to parse HTML using a XSL stylesheet.

By the way, the HTML file has this look (schematic, missing tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html id="ctl00_htmlDocumento" xmlns="http://www.w3.org/1999/xhtml" lang="es-ES" xml:lang="es-ES">
<div> some content </div>
<div class="NON_IMPORTANT"></div>
<div class="IMPORTANT_FATHER>
    <div class="IMPORTANT">
        <table>
            HERE IS THE DATA IM LOOKING FOR
        </table>
    </div>
</div>

as per request, here is my xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="tbody">
        tbody found, lets process it
    <xsl:for-each select="tr">
        new tf found, lets process it
    </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

The full HTML is quite big so I dont know how to present it here... I've tested for valid document on Oxygen, and it says its valid.

Thanks in advance. Gonso

You're not using XPath correctly in your match attributes. You need the xmlns:xhtml="http://www.w3.org/1999/xhtml" attribute in your xsl:stylesheet element, and then you'll need to use the xhtml: prefix in your XPath expressions (you need a prefix; XPath does not obey default namespaces).

After this, you'll still get the problem that it will process everything else too. I don't know if there's a better solution to this, but I think you will need to explicitly process things on the path to the tbody element, something like

<xsl:template match="xhtml:html">
  <xsl:apply-templates select="xhtml:body"/>
</xsl:template>

and the same thing for body and so on until you get to your tbody match.

XPath also supports more complex matching than just a specific child as above. For instance, matching the third child div tag can be done with

<xsl:template match="xhtml:div[3]">

and matching an element with a specific attribute with

<xsl:template match="xhtml:div[@class='IMPORTANT']">

Here the [] surrounds an additional condition that needs to be fulfilled for the element to be considered a match. A plain number means to index into the matches and take only the one that has that index (the indexing is 1-based), an @ sign precedes an attribute, but you can have arbitrarily complex XPath in there, so you can match pretty much any substructure you'd like.

As long as your XHTML document is well-formed, an XML parser will be able to read it, and therefore an XSLT engine will be able to transform it.

Assuming that, the most common causes of not being able to find elements in a document are:

Your XPath expression is being executed relative to a different node that what you thought it was going to be. What this means for your XSLT - check that your XSLT match patterns are correct relative to their templates.
You have not defined the namespace URI-to-prefix mappings in your XPath engine. What this means for your XSLT - make sure you have the xmlns="http://www.w3.org/1999/xhtml" namespace declared in your XSLT file, with or without a prefix.

If you post your XSLT I will be able to comment further.

You can use XSLT to manipulate HTML assuming the HTML is well formatted (as in the HTML document is a well-formed XML document in the strictest sense).

If you can confirm this, and your XSLT isn't working, maybe you should provide a more thorough snippet of both the HTML and XSLT documents so that we can figure it out.

继续阅读：parsing xslt

Can I parse an HTML using XSLT?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？