开发者

Trying to use regex matches between words using PHP

I am trying to match HTML tags that might occur between words on a web page, using regex's.

For example, if the sentence that I want to match is "This is a word", I need to develop a pattern that will match something like "This is a <b>word</b>".

I've tried 开发者_开发技巧using the code below to prepare the regex pattern:

$pattern = "/".str_replace(" ", .{0,100}, $sentence)."/si";

This replaces all spaces by .{0,100} and uses the s modifier to match any character. However, I am getting undesired results with this.

Thanks in advance for any help with this!


Try to use ereg_replace() or preg_replace() function when you are trying to perform a regular expression search and replace.


I put this together very quickly, so it probably doesn't cover all edge cases, but I think it at least partially matches your requirements. Also, I haven't tried it in PHP.

/[^\s>]+[\s]*(<([^>]+)>)(.*)(</\2>)[\s]*[^\s<]+/g

In the following example:

<p>This is a <b><i>nice</i> sentence</b>.</p> <p>Here's another sentence.</p>

It only matches the first sentence, in the following groups:

  1. <b>
  2. b
  3. <i>nice</i> sentence
  4. b


What are you actually trying to achieve? Parsing an html document with regex might not be the best solution. You can use XPath for what you've described (so far).
E.g. finding all rows in a table that contain the text this is a word:

<?php
$doc = new DOMDocument;
$doc->loadhtml('<html><head><title>...</title></head><body>
  <table>
    <tr><td>1</td><td>lalala</td></tr>
    <tr><td>2</td><td>this is a <b>word</b></td></tr>
    <tr><td>3</td><td>lalala</td></tr>
    <tr><td>4</td><td><b>And this is a</b> word, too</td></tr>
  </table>
</body></html>');

$xpath = new DOMXPath($doc);
foreach($xpath->query('/html/body/table/tr[./td[contains(., "this is a word")]]') as $tr) {
  foreach($tr->childNodes as $td) {
    echo $td->nodeValue, ' ';
  }
  echo "\n";
}

prints

2 this is a word 
4 And this is a word, too 


The regular expression

%(<[^>]+?>)\s*?((?:\w+\s*)*)\s*?(</[^>]+?>)%im 

will grab basic words, including simple multiple word phrases that are between a proper opening and closing tag, and group the full match, the opening tag, the word/phrase and the closing tag so you can access each easily.

So lets say your input will be html source code. Then run preg_match_all with the PREG_SET_ORDER flag. This will return an array of matches arrays, useful for looping through with foreach().

In this function below, $html is your source page that you want to search, and $matches is an empty array passed by value that the function will fill in with your results for you.

<?php
$html='
This is a <b>word</b>.
This is not a word.
This is a <span>three word phrase</span>.
';

$regex ='%(<[^>]+?>)\s*?((?:\w+\s*)*)\s*?(</[^>]+?>)%im';

preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

foreach($matches as $val) {
    //$val[0] will always be the entire match with the tags
    echo "full match: " . $val[0] . "\n";

    //$val[1] will always be the opening tag
    echo "opening tag: " . $val[1] . "\n";

    //$val[2] will always be the word or words, if separated by spaces
    echo "word: " . $val[2] . "\n";

    //$val[3] will always be the closing tag
    echo "closing tag: " . $val[3] . "\n\n";
}
?>

The above script will output:

full match: <b>word</b>
opening tag: <b>
word: word
closing tag: </b>

full match: <span>three word phrase</span>
opening tag: <span>
word: three word phrase
closing tag: </span>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜