How can this xpath query (PHP) be more flexible?

2022-12-21 16:26 问答作者：

I'm parsing an XHTML document using PHP's SimpleXML. I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling... code will help explain!

Given the following dummy xhtml:

<html>
<head></head>
<body>
...

<ul class="attr-list"> 
    <li>Active Life (active)</li> 
    <ul> 
        <li>Amateur Sports Teams (amateursportsteams)</li> 
        <li>Amusement Parks (amusementparks)</li> 
        <li>Fitness & Instruction (fitness)</li> 
        <ul> 
            <li>Dance Studios (dancestudio)</li> 
            <li>Gyms (gyms)</li> 
            <li>Martial Arts (martialarts)</li> 
            <li>Pilates (pilates)</li> 
            <li>Swimming Lessons/Schools (swimminglessons)</li>  
        </ul> 
        <li>Go Karts (gokarts)</li> 
        <li>Mini Golf (mini_golf)</li> 
        <li>Parks (parks)</li> 
        <ul> 
            <li>Dog Parks (dog_parks)</li> 
            <li>Skate Parks (skate_parks)</li> 
        </ul> 
        <li>Playgrounds (playgrounds)</li> 
        <li>Rafting/Kayaking (rafting)</li> 
        <li>Tennis (tennis)</li> 
        <li>Zoos (zoos)</li> 
    </ul> 
    <li>Arts & Entertainment (arts)</li> 
    <ul> 
        <li>Arcades (arcades)</li> 
        <li>Art Galleries (galleries)</li> 
        <li>Wineries (wineries)</li> 
    </ul> 
    <li>Automotive (auto)</li> 
    <ul> 
        <li>Auto Detailing (auto_detailing)</li> 
        <li>Auto Glass Services (autoglass)</li> 
        <li>Auto Parts & Supplies (autopartssupplies)</li> 
    </ul>
    <li>Nightlife (nightlife)</li>
    <ul>
        <li>Bars (bars)</li>
        <ul>
            <li>Dive Bars (divebars)</li>
        </ul>
    </ul>
</ul>

...
</body>
</html>

I need to be able to query the ul.attr-list for a child element, and discover its "root" category. I cannot change the xhtml to be formed differently.

So, if I have "galleries" as a category, I need to know that it is in the "arts" "root" category. Or, if I have "dog_parks", I need to know that it is in the "active" category.开发者_高级运维 The following code gets the job done, but only with the assumption that at max there are two nested levels:

function get_root_category($shortCategoryName){

    $url = "http://www.yelp.com/developers/documentation/category_list";
    $result = file_get_contents($url);

    $dom = new domDocument();
    @$dom->loadHTML($result);
    $dom->preserveWhiteSpace = false;

    $sxml = simplexml_import_dom($dom);

    $lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
    $lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");

    if($lvl2){
        return array_pop($lvl2);
    } else {
        return array_pop($lvl1);
    }
}

There has to be a better way to write that XPath, so that only one query needs to be made, and is relatively bulletproof to multiple, nested levels.

EDIT:: Thanks to those that pointed out that this HTML is not valid. However, the structure of the page is set, and I cannot edit it; I can only use it as a resource, and have to make due with what it is.

I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling...

That would be (here $v is the value you look for):

$p = "/html/body//ul[li[contains(text(), '$v')]]/preceding-sibling::li[1]";

Make sure that you check that $v does not contain single quotes, since this would break the XPath expression.
When you want to look for whole words only, use:
[contains(concat(' ', text(), ' '), concat(' ', '$v', ' '))].
When you want to look case-insentitively, use (I abbreviated the full alphabet with …):
[contains(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')].
Note that predicates can be nested.
Note that the use of text() ensures only direct child text nodes are taken into account. When you use . instead, the whole "subtree" of the <li> is converted to string and you might get more results than you actually want.
Note that I restricted the // operator (a shortcut for the descendant axis) to a certain part of the tree - if you can restrict it further, by all means do so.
Letting your XPath start with // makes it much slower than it needs to be since all nodes of the entire document are checked, even those that can not under any circumstances produce a match.
As others have already noted, the HTML is invalid.

How about:

/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li

This should work with deeply nested lists. It always gets the upper-most category.

By the way: I don't think nesting ul's like this is valid.

继续阅读：php simplexml

How can this xpath query (PHP) be more flexible?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？