Get first few elements of a html fragment with xpath on ruby

2023-01-21 10:09 问答作者：

For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.

So if I have

<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li&开发者_如何学Gogt;
</ul>
<p>some other text</p>

And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle...), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.

How could I get this? Using XPath or a regex? I am a bit without ideas on that...

Edit

First I want to give a big THANK YOU to all of you who replied!

While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.

  class String
    def summarize_md length
        arr = self.split(/\r\n\r\n/)
        sum =""
        arr.each do |ea|
          break if sum.length + ea.length > length
          sum = sum+"#{ea}\r\n\r\n"
        end
        sum
      end
  end

while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions. Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I'll just give the answer to the first guy... I hope that's just...

A pure XPath 1.0 solution:

substring(/*,1,150)

where the parent of the provided XHTML fragment is the top element (/* or /html).

A very exact XPath 2.0 solution exists:

   for $t in (//text())[not(sum((.| preceding::text())/string-length(.)) gt 150)]
     return
       ($t, '&#xA;')

Do note: The XML document must be parsed in a mode that discards the white-space-only text nodes. Otherwise string-length(.) must be replaced by string-length(normalize-space(.))

How could I get this?

XSLT, of course!

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:strip-space elements="*"/>
    <xsl:param name="pMaxLength" select="73"/>
    <xsl:template match="node()">
        <xsl:param name="pPrecedingLength" select="0"/>
        <xsl:variable name="vContent">
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates select="node()[1]">
                    <xsl:with-param name="pPrecedingLength"
                                    select="$pPrecedingLength"/>
                </xsl:apply-templates>
            </xsl:copy>
        </xsl:variable>
        <xsl:variable name="vLength"
                      select="$pPrecedingLength + string-length($vContent)"/>
        <xsl:if test="$pMaxLength > $vLength and
                      (string-length($vContent) or not(node()))
                      or not($pPrecedingLength)">
            <xsl:copy-of select="$vContent"/>
            <xsl:apply-templates select="following-sibling::node()[1]">
                <xsl:with-param name="pPrecedingLength" select="$vLength"/>
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Output:

<html>
    <h1>hello world</h1>
    <p>Lets say these are 100 chars</p>
    <ul>
        <li>some bla bla, 40 chars</li>
    </ul>
</html>

For my uses I always wanted to strip tags because they could include all sorts of nastiness that would totally hose the display of the summary. They could also seriously skew the letter count, depending on the tags and whether they contain parameters.

I've used something like this many times.

require 'nokogiri'

html = %q{
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
}

doc = Nokogiri::HTML(html)
puts doc.content.gsub(/\n/, ' ').squeeze(' ').strip[0 .. 150]

Which outputs

hello world Lets say these are 100 chars some bla bla, 40 chars some other text

I'll leave it to you to figure out how to ignore or subtract the text from the final <p> tag, but looking up that tag and grabbing its content and then stripping it from the end of the string shouldn't be too hard.

Using XPath is the most robust and flexible. Here's a sample app:

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

The XPath //text() simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.

继续阅读：markdown regex ruby

Get first few elements of a html fragment with xpath on ruby

Edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？