Nokogiri and xpath parsing an HTML table

2023-02-17 00:36 问答作者：

I can set up parsing, and connect to a site but, when I run the script, it returns an empty NodeSet:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'ap'

time = Time.new

url = <<-EOS
'http://www.events.psu.edu/cgi-bin/cal/webevent.cgi?cmd=listday&y=%d&m=%d&d=%d&cat=&sib=1&sort=m,e,t&ws=0&cf=list&set=1&swe=1&sa=1&de=1&tf=0&sb=1&stz=Default&cal=cal299' % [time.year, time.month, time.day]
EOS

page = Nokogiri::HTML(url)

rows =  page.xpath('/html/body/p/table/tbody/tr/td[3]/p/table/tbody/tr[2]')
details = rows.collect do |row|
detail = {}
[
 [:time, 'td[3]/p/text()'],
 [:name, 'td[4]/div/a/b/font/text()'],
 [:location, 'td[4]/div[2]/text()'],
 [:details, 'td[4]/div[4]/text()'],
].collect do |name, xpath|

detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
ap details

The returned value is "[]".

This is the HTML file before the table /html/body/p/table/tbody/tr/td[3]/p:

<TABLE BORDER=0 CELLPADDING=3 WIDTH="100%">

<!--Begin Event-->
<TR>
  <TD WIDTH="2%">
    <P></P>
  </TD>
  <TD WIDTH="10%">
    <P></P>
  </TD>
  <TD WIDTH="19%">

    <P></P>
  </TD>
  <TD WIDTH="60%">
    <P></P>
  </TD>
</TR>
<TR>
<!--Icon Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="2%">
    <P CLASS="listeventicon">&nbsp;</P>

  </TD>
<!--Date Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="10%">
    <P CLASS="listeventdate">Mar 14</P>
  </TD>
<!--Time Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="19%">
    <P CLASS="listeventtime">8:30 a.m. - 4:30 p.m.<BR>
<开发者_StackOverflow中文版/P>

  </TD>
<!--Main Event Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="60%">
<div class=listeventtitlelarge><A HREF="http://www.pennstatehershey.org/web/diabetesresearch/home">
<B><font color="#0000CC">2011 Diabetes and Obesity Research Spring Summit</FONT></B></A>
</div>
<div class=listeventtitle><B>Calendar:</B> HHD Seminars<BR>
<B>Posted by:</B> <A HREF="mailto:luk10%40psu.edu">Lauren Kipp</A><BR><B>Location:</B> The Nittany Lion Inn<BR>

</div>
<DIV CLASS="listeventspacer"> </DIV>
<DIV CLASS="listeventdetails">
<B>Details:</B><BR>Registration and Abstract Deadline: February 15, 2011<BR>        <BR>Registration: Please follow the link for more details and access to on line registration. Space is limited, so please register early to ensure your seat at the conference.<BR><BR>The Keynote Speaker for this year’s event is <b>Dr. Robert Sherwin, from the Yale School of Medicine.</b>  Dr. Sherwin is known for his research in the effect of insulin on brain function and immune mechanisms leading to type 1 diabetes.  The topic of his presentation is <i>Pathophysiological Mechanisms in Diabetes, from Laboratory to Bedside.</i><BR><BR>A welcome to the University Park campus will be offered by <b>Eugene Marsh, MD</b>, Senior Associate Dean for the Penn State College of Medicine Regional Medical Campus and Associate Director of the Penn State Hershey Medical Group in State College<BR><BR>Abstract Submission<BR>Please follow the link for formatting details and to register your intent to submit an abstract using the on-line form.<BR>    <BR>You will receive a confirmation immediately upon submission of your on-line form. Subsequently, the final formatted abstract must be sent directly to Continuing Ed by email attachment (see website instructions). Within 48 hours of sending your abstract in final format, you will receive an email confirmation from ContinuingEd@hmc.psu.edu indicating that both your form & the abstract attachment have been received.<BR><BR><i>All abstracts will be considered for poster presentations. A subset of these abstracts will be selected and invited for brief oral presentations during the “Poster Headlines” plenary sessions. To be considered for an oral presentation, please be sure to meet the submission deadline for submission of your final abstract. Prizes will be awarded for the top three posters from by post-doc/fellow/student presenters.</i>

</div>
</TD>
<!--EndEvent-->
......Followed by more of the same format

I am trying to get the name of the event, the time, location and the description of the event.

This is a simplified version of how I'd go about it.

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'ap'

time = Time.new

url = 'http://www.events.psu.edu/cgi-bin/cal/webevent.cgi?cmd=listday&y=%d&m=%d&d=%d&cat=&sib=1&sort=m,e,t&ws=0&cf=list&set=1&swe=1&sa=1&de=1&tf=0&sb=1&stz=Default&cal=cal299' % [time.year, time.month, time.day]

page = Nokogiri::HTML(open(url))

details = page.search('//tr/td[@class="listeventbg"]/..').map do |row|
  time     = row.at( 'p.listeventtime'         ).text.strip rescue ''
  name     = row.at( 'div.listeventtitlelarge' ).text.strip rescue ''
  location = row.at( 'div.listeventtitle'      ).text.strip rescue ''
  details  = row.at( 'div.listeventdetails'    ).text.strip rescue ''

  {
    :time     => time,
    :name     => name,
    :location => location,
    :details  => details
  }
end

ap details

Rather than rely on long XPath accessors, often it's easier to break down the search. This loops over the rows, then, for each row, does a simple lookup for the cells.

Normally I wouldn't use rescue '' but for quick and dirty it's OK. For production I'd set up real exception handling.

Your sample code required Mechanize, but didn't use it, so I removed it for this example. It didn't include a way to have Nokogiri retrieve the HTML, so I added Open-URI.

Nokogiri allows use of CSS and XPath accessors. A lot of times CSS will result in a simpler search. XPath has more power, but that can come at the price of complexity. /tr/td[@class="listeventbg"]/.. looks for rows with the embedded cells, then steps back to the row level.

You can use XPath instead of the CSS accessor like so:

//div[@class='listeventtitlelarge']

but, remember, this is a full text match so foobar will also be caught. In any case, you can modify it with a few simple regex functions or just don't use too similar class names. Or, you could also go with "XPATH CSS CLASS MATCHING" from the pivotall guys.

It looks like you are parsing through the structure instead of using the classes that are given in the document. I would use the CSS classes the document creator put in, like this:

page = Nokogiri::HTML(url)
eventdate = page.at_css("p.listeventdate").content
eventtime = page.at_css("p.listeventtime").content
details =   page.at_css("div.listeventdetails").content

If you are doing this on a larger document, where multiple results will be returned, then use css and iterate through the results instead of at_css. The latter only finds one instance of the tag and class.

It looks like everything you want has a selector that makes more sense than the direct path. It also makes it more resilient to change because, if they change the structure and keep the same classes, then your parsing still works.

继续阅读：html-table nokogiri ruby

Nokogiri and xpath parsing an HTML table

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？