How do I select sets of nodes with a single XPath query?
I'm trying to extract journey and price information from my favorite airline.
I have a search results page that looks like this:
MASwings search results http://img28.imagevenue.com/aAfkjfp01fo1i-2846/loc29/42467_dayview_oneway_122_29lo.jpg
EDIT: Image host might have blocked the hotlink. See the image on this page: http://img28.imagevenue.com/img.php?image=42467_dayview_oneway_122_29lo.jpg
Repro URL for booking query
I can select each row that represents a flight using this XPath selector:
//*[@class="servicecode "]/ancestor::tr[1]
But each flight row is not an independent journey; the flights are really grouped into legs, and these are what I want to select.
The row class alternates for each new leg: the rows of the first leg have class "datarow", and the rows of the next leg have "datarow alt". In Python I can group the nodes selected by the above expression using itertools.groupby
, but if there is a way to acheive this purely in XPath, I would prefer it.
An extension to this question: my selector selects all rows, whether the flight is sold out or not. I can select the first flight of every bookable journey using this selector:
//*[contains(@class, "datarow")][.//input]
But if the leg has more than one flight, then I will have to look for following sibling with the same c开发者_运维百科lass using another XPath query.
Is there a single XPath query that will return me each bookable leg as a nodeset?
Note: I'm using the Python lxml library, in case that matters.
I can select each row that represents a flight using this XPath selector: //*[@class="servicecode "]/ancestor::tr[1] But each flight row is not an independent journey; the flights are really grouped into legs, and these are what I want to select. The row class alternates for each new leg: the rows of the first leg have class "datarow",
Use:
//tr[@class='datarow'][.//*[@class='servicecode']]
An extension to this question: my selector selects all rows, whether the flight is sold out or not. I can select the first flight of every bookable journey using this selector:
//*[contains(@class, "datarow")][.//input]
But if the leg has more than one flight, then I will have to look for following sibling with the same class using another XPath query.
Is there a single XPath query that will return me each bookable leg as a nodeset?
Yes:
(//tr[@class='datarow'])[1]//input
|
(//tr[@class='datarow'])[1]
//following-sibling::tr[@class='datarow altrow']
[count(preceding-sibling::tr[@class='datarow'])=1]
//input
This XPath expression selects all tr
elements that represent each bookable leg (in this case 3 legs) of the first journey.
To get all legs of the second journey, substitute 1
in the above expression with 2
.
To get all legs of the k-th journey, substitute 1
in the above expression with the actual value of k
.
This does what I want. But is there a more elegant solution?
//*[contains(@class, "columns")]//tr[contains(@class, "datarow")][1]
|
//*[contains(@class, "columns")]//tr[not(contains(@class, "altrow"))]
[preceding-sibling::tr[1]
[contains(@class, "altrow")]]
|
//*[contains(@class, "columns")]//tr[contains(@class,"altrow")]
[preceding-sibling::tr[1]
[not(contains(@class, "altrow"))]]
The second part selects each set of consecutive rows with class not containing "altrow" as a single nodeset.
The third part selects each set of consecutive rows with class containing "altrow" as a single node set.
The first part selects the first set of consecutive rows with class not containing "altrow", because it is not selected by the second part.
精彩评论