开发者

regular expression breaking on new line

I'm trying to use a regular expression as below:

preg_match_all('|<table.*</table>|',$html,$matches, PREG_SET_ORDER);

But this is not working, and I think the problem is the new line inside the string $html.

Could someone tell me a work around?


EDIT: I've realized that it'开发者_开发问答s not right to use regex to parse HTML. Thanks to those who told me. :)


The dot does not match newlines unless the s pattern modifier is used.

preg_match_all('|<table.*?</table>|s',$html,$matches, PREG_SET_ORDER);

(Be aware that using regex to parse HTML ranks among the worst capital sins here in SO).


Before making a decision on what to do next, I'd read this first: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

In general, it's not a good idea to parse HTMl with RegEx.

I recommend using DOM

You can check out the PHP Simple HTML DOM Parser as an alternative.

Main Features:

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.


preg_match_all('|<table.*?</table>|ms',$html,$matches, PREG_SET_ORDER);


Did you try the multiline modifier m?

preg_match_all('|<table.*</table>|m',$html,$matches, PREG_SET_ORDER);


Use the /s flag to have the '.' also apply to new line characters, or just check for new line characters explicitly - usually '[\n\r]'. I haven't yet read it myself, but do check out more info on the PCRE library at http://www.pcre.org/pcre.txt

Careful how you form your pattern though - long input strings with newlines mixed with misunderstood patterns can cause unexplained script failures and connection resets.

In your case, PCRE functions don't seem to be needed here, and could cause unexpected results anyway. If you're just looking to extract contents of a single table on a page, why not just do the most basic...

$start = stripos($input, "<table>");
$end = stripos($input, "</table>", $start);
$my_table = substr($input, $start, $end);


EDIT: I've realized that it's not right to use regex to parse HTML.

Better: You can read $html into a SimpleXML object and parse it with SimpleXML's Xpath. (Powerful and much easier to use than the DOM extension IMHO.)

Like this:

$html = "<html><body><table id=\"mytbl\"><tr><td>ABC</td></tr><tr><td>DEF</td></tr></table></body></html>";

$xml = simplexml_load_string($html);

if($xml)
foreach($xml->xpath("/html/body/*") as $item) {
    echo $item["id"] . "<br>"; // mytbl
    foreach($item->tr as $tr) {
        echo $tr->td . "<br>"; // 1:ABC, 2:DEF
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜