开发者

How do I select sets of nodes with a single XPath query?

I'm trying to extract journey and price information from my favorite airline.

I have a search results page that looks like this:

MASwings search results http://img28.imagevenue.com/aAfkjfp01fo1i-2846/loc29/42467_dayview_oneway_122_29lo.jpg

EDIT: Image host might have blocked the hotlink. See the image on this page: http://img28.imagevenue.com/img.php?image=42467_dayview_oneway_122_29lo.jpg

Repro URL for booking query

I can select each row that represents a flight using this XPath selector:

//*[@class="servicecode "]/ancestor::tr[1]

But each flight row is not an independent journey; the flights are really grouped into legs, and these are what I want to select.

The row class alternates for each new leg: the rows of the first leg have class "datarow", and the rows of the next leg have "datarow alt". In Python I can group the nodes selected by the above expression using itertools.groupby, but if there is a way to acheive this purely in XPath, I would prefer it.

An extension to this question: my selector selects all rows, whether the flight is sold out or not. I can select the first flight of every bookable journey using this selector:

//*[contains(@class, "datarow")][.//input]

But if the leg has more than one flight, then I will have to look for following sibling with the same c开发者_运维百科lass using another XPath query.

Is there a single XPath query that will return me each bookable leg as a nodeset?

Note: I'm using the Python lxml library, in case that matters.


I can select each row that represents a flight using this XPath selector:

     //*[@class="servicecode "]/ancestor::tr[1] 

But each flight row is not an independent journey; the flights are really grouped into legs, and these are what I want to select.

The row class alternates for each new leg: the rows of the first leg have class "datarow",

Use:

//tr[@class='datarow'][.//*[@class='servicecode']]

An extension to this question: my selector selects all rows, whether the flight is sold out or not. I can select the first flight of every bookable journey using this selector:

//*[contains(@class, "datarow")][.//input]

But if the leg has more than one flight, then I will have to look for following sibling with the same class using another XPath query.

Is there a single XPath query that will return me each bookable leg as a nodeset?

Yes:

  (//tr[@class='datarow'])[1]//input 
| 
  (//tr[@class='datarow'])[1]
         //following-sibling::tr[@class='datarow altrow']
                   [count(preceding-sibling::tr[@class='datarow'])=1]
                         //input

This XPath expression selects all tr elements that represent each bookable leg (in this case 3 legs) of the first journey.

To get all legs of the second journey, substitute 1 in the above expression with 2.

To get all legs of the k-th journey, substitute 1 in the above expression with the actual value of k.


This does what I want. But is there a more elegant solution?

//*[contains(@class, "columns")]//tr[contains(@class, "datarow")][1]
|
//*[contains(@class, "columns")]//tr[not(contains(@class, "altrow"))]
       [preceding-sibling::tr[1]
           [contains(@class, "altrow")]]
|
//*[contains(@class, "columns")]//tr[contains(@class,"altrow")]
       [preceding-sibling::tr[1]
           [not(contains(@class, "altrow"))]]

The second part selects each set of consecutive rows with class not containing "altrow" as a single nodeset.

The third part selects each set of consecutive rows with class containing "altrow" as a single node set.

The first part selects the first set of consecutive rows with class not containing "altrow", because it is not selected by the second part.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜