How can this xpath query (PHP) be more flexible?
I'm parsing an XHTML document using PHP's SimpleXML. I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling... code will help explain!
Given the following dummy xhtml:
<html>
<head></head>
<body>
...
<ul class="attr-list">
<li>Active Life (active)</li>
<ul>
<li>Amateur Sports Teams (amateursportsteams)</li>
<li>Amusement Parks (amusementparks)</li>
<li>Fitness & Instruction (fitness)</li>
<ul>
<li>Dance Studios (dancestudio)</li>
<li>Gyms (gyms)</li>
<li>Martial Arts (martialarts)</li>
<li>Pilates (pilates)</li>
<li>Swimming Lessons/Schools (swimminglessons)</li>
</ul>
<li>Go Karts (gokarts)</li>
<li>Mini Golf (mini_golf)</li>
<li>Parks (parks)</li>
<ul>
<li>Dog Parks (dog_parks)</li>
<li>Skate Parks (skate_parks)</li>
</ul>
<li>Playgrounds (playgrounds)</li>
<li>Rafting/Kayaking (rafting)</li>
<li>Tennis (tennis)</li>
<li>Zoos (zoos)</li>
</ul>
<li>Arts & Entertainment (arts)</li>
<ul>
<li>Arcades (arcades)</li>
<li>Art Galleries (galleries)</li>
<li>Wineries (wineries)</li>
</ul>
<li>Automotive (auto)</li>
<ul>
<li>Auto Detailing (auto_detailing)</li>
<li>Auto Glass Services (autoglass)</li>
<li>Auto Parts & Supplies (autopartssupplies)</li>
</ul>
<li>Nightlife (nightlife)</li>
<ul>
<li>Bars (bars)</li>
<ul>
<li>Dive Bars (divebars)</li>
</ul>
</ul>
</ul>
...
</body>
</html>
I need to be able to query the ul.attr-list for a child element, and discover its "root" category. I cannot change the xhtml to be formed differently.
So, if I have "galleries" as a category, I need to know that it is in the "arts" "root" category. Or, if I have "dog_parks", I need to know that it is in the "active" category.开发者_高级运维 The following code gets the job done, but only with the assumption that at max there are two nested levels:
function get_root_category($shortCategoryName){
$url = "http://www.yelp.com/developers/documentation/category_list";
$result = file_get_contents($url);
$dom = new domDocument();
@$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$sxml = simplexml_import_dom($dom);
$lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
$lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");
if($lvl2){
return array_pop($lvl2);
} else {
return array_pop($lvl1);
}
}
There has to be a better way to write that XPath, so that only one query needs to be made, and is relatively bulletproof to multiple, nested levels.
EDIT:: Thanks to those that pointed out that this HTML is not valid. However, the structure of the page is set, and I cannot edit it; I can only use it as a resource, and have to make due with what it is.
I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling...
That would be (here $v
is the value you look for):
$p = "/html/body//ul[li[contains(text(), '$v')]]/preceding-sibling::li[1]";
- Make sure that you check that
$v
does not contain single quotes, since this would break the XPath expression. - When you want to look for whole words only, use:
[contains(concat(' ', text(), ' '), concat(' ', '$v', ' '))]
. - When you want to look case-insentitively, use (I abbreviated the full alphabet with
…
):
[contains(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')]
. - Note that predicates can be nested.
- Note that the use of
text()
ensures only direct child text nodes are taken into account. When you use.
instead, the whole "subtree" of the<li>
is converted to string and you might get more results than you actually want. - Note that I restricted the
//
operator (a shortcut for thedescendant
axis) to a certain part of the tree - if you can restrict it further, by all means do so.
Letting your XPath start with//
makes it much slower than it needs to be since all nodes of the entire document are checked, even those that can not under any circumstances produce a match. - As others have already noted, the HTML is invalid.
How about:
/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li
This should work with deeply nested lists. It always gets the upper-most category.
By the way: I don't think nesting ul
's like this is valid.
精彩评论