开发者

Ruby Nokogiri Parsing HTML table

I am using mechanize/nokogiri and need to parse out the following HTML string. can anyone help me with the xpath syntax to do this or any other methods that would work?

<table>
  <tr class="darkRow">
    <td>
      <span>
        <a href="?x=mSOWNEBYee31H0eV-V6JA0ZejXANJXLsttVxillWOFoykMg5U65P4x7FtTbsosKRbbBPuYvV8nPhET7b5sFeON4a开发者_如何学编程WpbD10Dq">
            <span>4242YP</span>
        </a>
      </span>
    </td>
    <td>
      <span>Subject of Meeting</span>
    </td>
    <td>
      <span>
        <span>01:00 PM</span> 
        <span>Nov 11 2009</span> 
        <span>America/New_York</span>
      </span>
    </td>
    <td>
      <span>30</span>
    </td>
    <td>
      <span>
        <span>example@email.com</span>
      </span>
    </td>
    <td>
        <span>39243368</span>
    </td>
  </tr>
  .
  .
  .
  <more table rows with the same format>
</table>

I want this as the output

"4242YP","Subject of Meeting","01:00 PM Nov 11 2009 America/New_York","30","example@email.com", "39243368"
.
.
.
<however many rows exist in the html table>


something like this?

items=doc.xpath('//tr').map {|row| row.xpath('.//span/text()').select{|item| item.text.match(/\w+/)}.map {|item| item.text} }

returns: => [["4242YP", "Subject of Meeting", "01:00 PM", "Nov 11 2009", "America/New_York", "30", "example@email.com", "39243368"], ["abcdefg"]]

Select includes only spans that start with word characters (e.g. excluding the whitespace that some of your spans have). You may need to refine the "select" filter for your specific case.

I added a minimalist row that contained a span containing abcdefg, so that you can see the nested array.


Here's part of the XSL to transform your input, if you have an XSL transformer:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:template match="/">
   <xsl:apply-templates select="//tr"/>
</xsl:template>

<xsl:template match="tr">
   "<xsl:value-of select="td/span/a/span"/>","<xsl:value-of select="td[position()=2]/span"/>","<xsl:value-of select="td[position()=3]/span/span[position()=1]"/>"
</xsl:template>

</xsl:stylesheet>

Output produced looks like this:

"4242YP","Subject of Meeting","01:00 PM"
"4242YP","Subject of Meeting","01:00 PM"

(I duplicated your first table row).

The XSL select bits give you a good idea of what XPATH input you'd need to get the rest.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜