Can't find the correct XPath expression (to combine results)
I'm trying to get a list of proverbs from wikipedia.
I'm able to select:
- the categories (e.g. "aanval", "aap")
- the proverbs (e.g. "De aanhouder wint.")
- the explanations (e.g. "Wie blijft proberen zijn doel te bereiken, heeft uiteindelijk succes. je moet volhouden.")
but have much difficulty in joining them the correct way. I would like to end up with an array like:
array(
0 => array(
'cate开发者_Go百科gory' => 'aanval',
'proverb' => 'De aanval is de beste verdediging.',
'explanation' => array(
0 => 'Je kunt in een strijd of ruzie beter zelf actie ondernemen dan afwachten.',
)
),
1 => array(
'category' => 'aap',
'proverb' => 'Al draagt een aap een gouden ring, het is en blijft een lelijk ding.',
'explanation' => array(
0 => 'Wie zich mooi aankleedt wordt daarmee zelf nog niet mooi.',
1 => 'Of: Wie zich kleedt als iemand van aanzien wordt daarmee nog niet aanzienlijk.',
2 => 'Of: Fraaie kleding en sieraden maken een lelijk mens niet mooi.'
)
),
2 => array(
'category' => 'aap',
'proverb' => 'Als apen hoger klimmen willen, ziet men gauw hun blote billen.',
'explanation' => array(
0 => 'Iemand die meer wil dan hij kan, maakt zich snel belachelijk.',
)
),
);
Here's the code I'm using now:
if ($x = urlToXpath($url, true))
{
$keywords = array();
foreach ($x->query('/html/body/div[3]/div[3]/h2/span[@class="mw-headline"]') as $node)
{
$keywords[] = trim($node->nodeValue);
}
$data = array();
foreach ($x->query('/html/body/div[3]/div[3]/dl/dd/dl') as $node)
{
$proverbs = array();
foreach ($x->query('dd[@style="font-weight: bold"] | dd/b', $node) as $childNode)
{
$proverbs[] = trim($childNode->nodeValue);
}
$descriptions = array();
foreach ($x->query('dd[position()>1]/small', $node) as $childNode)
{
$descriptions[] = trim(preg_replace('/^((Ook|Of):)/i', '', $childNode->nodeValue));
}
$data[] = array('proverbs' => $proverbs, 'descriptions' => $descriptions);
}
}
To do this with xpath, you would probably need to select each H2, then use this solution to select all the proverb-containing nodes in between. Then do the same thing on those nodes to find the descriptions.
You might find it easier to download the wikitext for the page (e.g. like this) and process that with a simple text parser over the lines in the text. Or if not that, you should at least use action=render
to get a version without all the skin-related HTML.
This XPath expression selects the wanted (three) nodes for the first proverb:
/html/body/div[3]/div[3]/h2[1]/span[@class="mw-headline"]
|
/html/body/div[3]/div[3]/h2[1]/following-sibling::dl[1]/dd/dl/dd[1]/b
|
/html/body/div[3]/div[3]/h2[1]/following-sibling::dl[1]/dd/dl/dd[2]/small
The wanted three nodes for the second proverb are selected by this XPath expression (note that just the index is incremented from 1
to 2
):
/html/body/div[3]/div[3]/h2[2]/span[@class="mw-headline"]
|
/html/body/div[3]/div[3]/h2[2]/following-sibling::dl[1]/dd/dl/dd[1]/b
|
/html/body/div[3]/div[3]/h2[2]/following-sibling::dl[1]/dd/dl/dd[2]/small
...etc.
This gives you a good algorithm for filling your arrays -- iterate the index : 1, 2, 3, ... until for some index K the evaluation of the constructed XPath expression doesn't select any nodes -- then you are finished.
精彩评论