开发者

Matching text within HTML with PHP's Regexp functions [duplicate]

This question already has answers here: Closed 11 years ago.

Possible Duplicates:

Preg match text in php between html tags

RegEx match open tags except XHTML self-contained tags

I have a large amount of text formatted in the following way:

    <P><B>1- TITLE</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text
    </DL><P>
    <P><B>2 - Title 2</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text Text text text text text
text text Text text text text text
text text
    <br><I>Additional irrelevant information</I>
    </DL><P>

I'm trying to use PHP's Regexp f开发者_JAVA百科unctions to retrieve the Title-Text value pairs while stripping out the extra   characters as well as the irrelevant info that follows some of the text blocks. Preferably I'd like to:

Grab everything between <P><B> and </B> as the title

Grab all the text between

<DL><DD>&nbsp;&nbsp;&nbsp;

and the next HTML tag (<) as the text, and somehow keep the two associated together for further processing. Any idea how to do this with PHP's Regexp functions?


As the comments on your question suggest, questions along the same lines are frequently asked on Stack Overflow, and the right answer is generally "Don't try to parse HTML with regular expressions". As well as making that point, however, I think it's useful to have an example in the answer of showing how one might take the suggested approach. For the case in your question, one could do:

<?php

$html = <<<EOF
    <P><B>1- TITLE</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text
    </DL><P>
    <P><B>2 - Title 2</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text Text text text text text
text text Text text text text text
text text
    <br><I>Additional irrelevant information</I>
    </DL><P>
EOF;

$d = new DomDocument;
$d->loadHtml($html);

$xp = new DomXpath($d);

$matches = $xp->query("//p/b", $d);
foreach ($matches as $dn) {
    echo "Title is: " . $dn->nodeValue . "\n";
    $dl = $dn->parentNode->nextSibling->nextSibling->firstChild;
    $dd = $dl->firstChild;
    echo "Content is: " . $dd->nodeValue . "\n";
}
?>

Depending on how robust you need this to be, you would probably want to check that the nextSiblings and children are tags with the name you expect, but this shows the idea anyway.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜