Can't find the correct XPath expression (to combine results)

2023-02-16 08:38 问答作者：

I'm trying to get a list of proverbs from wikipedia.

I'm able to select:

the categories (e.g. "aanval", "aap")
the proverbs (e.g. "De aanhouder wint.")
the explanations (e.g. "Wie blijft proberen zijn doel te bereiken, heeft uiteindelijk succes. je moet volhouden.")

but have much difficulty in joining them the correct way. I would like to end up with an array like:

array(
  0 => array(
    'cate开发者_Go百科gory' => 'aanval',
    'proverb' => 'De aanval is de beste verdediging.',
    'explanation' => array(
      0 => 'Je kunt in een strijd of ruzie beter zelf actie ondernemen dan afwachten.',
    )
  ),
  1 => array(
    'category' => 'aap',
    'proverb' => 'Al draagt een aap een gouden ring, het is en blijft een lelijk ding.',
    'explanation' => array(
      0 => 'Wie zich mooi aankleedt wordt daarmee zelf nog niet mooi.',
      1 => 'Of: Wie zich kleedt als iemand van aanzien wordt daarmee nog niet aanzienlijk.',
      2 => 'Of: Fraaie kleding en sieraden maken een lelijk mens niet mooi.'
    )
  ),
  2 => array(
    'category' => 'aap',
    'proverb' => 'Als apen hoger klimmen willen, ziet men gauw hun blote billen.',
    'explanation' => array(
      0 => 'Iemand die meer wil dan hij kan, maakt zich snel belachelijk.',
    )
  ),
);

Here's the code I'm using now:

if ($x = urlToXpath($url, true))
{
  $keywords = array();
  foreach ($x->query('/html/body/div[3]/div[3]/h2/span[@class="mw-headline"]') as $node)
  {
    $keywords[] = trim($node->nodeValue);
  }

  $data = array();
  foreach ($x->query('/html/body/div[3]/div[3]/dl/dd/dl') as $node)
  {
    $proverbs = array();
    foreach ($x->query('dd[@style="font-weight: bold"] | dd/b', $node) as $childNode)
    {
      $proverbs[] = trim($childNode->nodeValue);
    }
    $descriptions = array();
    foreach ($x->query('dd[position()>1]/small', $node) as $childNode)
    {
      $descriptions[] = trim(preg_replace('/^((Ook|Of):)/i', '', $childNode->nodeValue));
    }
    $data[] = array('proverbs' => $proverbs, 'descriptions' => $descriptions);
  }
}

To do this with xpath, you would probably need to select each H2, then use this solution to select all the proverb-containing nodes in between. Then do the same thing on those nodes to find the descriptions.

You might find it easier to download the wikitext for the page (e.g. like this) and process that with a simple text parser over the lines in the text. Or if not that, you should at least use action=render to get a version without all the skin-related HTML.

This XPath expression selects the wanted (three) nodes for the first proverb:

 /html/body/div[3]/div[3]/h2[1]/span[@class="mw-headline"]
|
 /html/body/div[3]/div[3]/h2[1]/following-sibling::dl[1]/dd/dl/dd[1]/b 
|  
 /html/body/div[3]/div[3]/h2[1]/following-sibling::dl[1]/dd/dl/dd[2]/small

The wanted three nodes for the second proverb are selected by this XPath expression (note that just the index is incremented from 1 to 2):

 /html/body/div[3]/div[3]/h2[2]/span[@class="mw-headline"]
|
 /html/body/div[3]/div[3]/h2[2]/following-sibling::dl[1]/dd/dl/dd[1]/b 
|  
 /html/body/div[3]/div[3]/h2[2]/following-sibling::dl[1]/dd/dl/dd[2]/small

...etc.

This gives you a good algorithm for filling your arrays -- iterate the index : 1, 2, 3, ... until for some index K the evaluation of the constructed XPath expression doesn't select any nodes -- then you are finished.

继续阅读：php

Can't find the correct XPath expression (to combine results)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？