HPricot css search: How do I select the parent/ancestor of a particular element using a string selector?
I'm using HPricot's css search to identify a table withi开发者_开发百科n a web page. Here's a sample html snippet I'm parsing:
<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
...
</tbody></table>
There are lots of tables in the page. I want to find the table which contains the A Name=a1
reference.
Right now, the way I'm doing it is
(page/"a[@name=a1]")[0].parent.parent.parent.parent.parent
I don't like this because
- It is ugly
- It is error prone (what if the folks who maintain the web page remove the tbody?)
Is there a way to tell hpricot to get me the table ancestor of the specified element?
Edit: Here's the full blown page I'm parsing: http://www.blonnet.com/businessline/scoboard/a.htm
The bits I'm interested in are the two tables, one with quarterly results and another with the annual results. Right now, the way I'm extracting those tables is by finding and and moving up from there.
Rohith is right. It is ugly and it is error prone (more than it needs to be). Again as he says it is much more clear with the intent to say "find the closest parent that is a table", and this could go for any child/parent relationship.
If it's "not possible" to do that with hpricot then just say so. But don't just say "it's hopeless to try to do that anyway what's the point". That's a bogus answer. It also doesn't help the next person who comes along (myself) looking for the answer to the same question but for different reasons, which is parsing many pages where differences are ASSUMED and not just feared.
To actually answer the question... I don't know, yet. And I don't have much hope of finding out with hpricot. The documentation is absolutely horridly nonexistent.
But here's a workaround that does about the same thing.
table = (page%"a[@name=a1]").parent
table = table.parent while table.name != "table"
Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.
You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.
"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.
"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.
The only thing I'd suggest doing differently, is to use the %
(AKA "at") instead of /
(AKA search). %
returns only the first occurrence so you can drop the [0]
index.
(page%"a[@name=a1]").parent.parent.parent.parent.parent
or
page%'//a[@name="a1"]/../../../../../..'
which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.
If you know that the target table is the only one with that width and height, you can use a more specific xpath:
page%'//table[@height=61 and @width=700]'
I recommend Nokogiri over Hpricot.
You can also use XPath from the top of the document down:
irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil
Basically the XPath pattern means:
Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.
Note: Firefox automatically adds the <tbody>
tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.
The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3]
according to Firefox so you have to strip the tbody
. Also you don't need to anchor at /html
.
精彩评论