开发者

Regex to automate some html tagging

I'm having 800 entries that are very similar, but they need some stuff done to them. The f开发者_如何学运维ormat is like this:

<td class="description">

Describing text.

Might very well be 2 paragraphs

</td>

I need to do some stuff to the text inside the cell. I've tried to use preg_replace('/(.+)</td>/'). It ends up with two problems.

  1. I don't manage to fetch what's inside the parenthesis, but it will also fetch the cell tags.
  2. It will fetch everything until the last </td> in the document. I just want it to go to the first occurrence of </td>

Thanks in advance


First of all, .+ will grab everything... it won't just start at <td>. You will want to add a regex to pull the beginning of the table col:

<td[^>]*?>

(note, [^>]* means match non-> characters until we find one.)

Also, .+ and .* are greedy, meaning that it will grab as much as possible. To change this behavior, add a ? after it, like such: .+?. This makes it satisfy only as much as it needs to.

So, you will have

<td[^>]*)>(.*?)<\/td>

This was a lesson on regex, but I really think you shouldn't be using regex for this. Regex can break pretty easily once you start having nested tables or anything more complicated than simple html.


D̨͙̯̹̼ỏ͇̥̱͚̲͖̣͢ǹ̶̥͉̳͈͈̏̉ͧ'ͧͬ͏̪̩͓̳̬̱ͅt͇̝̖ͦ̏̏̍̉͠ ͙̺̹͚͎̐̒ͥ͑̀ṷ͍̖͕̐ͫ̚s̤͖͇̲̪͊͋̉ͨͪ̚e͚̲͎͓̟͊̍ ̲̬̩͇̗̭̌̊̑̊͝r̷̦͔̞̜̬ͦe̔̓͒͊̌g̹̘̬̭ͨ̐̽̐̂u̼̹̔ͣ͑͐̓͋l͈̤̘͉̰̏͌̚a̵̤̞̥̋rͭ ̦̝͓̟̣̯̄́̎̀̔ͥe̢̟̥̹̊̅̌̅̋x̠̠̲͚̝͋ͪp̧̽̉ṟ͉̏͌̊̐ͅe͖͎̞͇̽͛̀s͓͈̒s̴͚̮̹ͧ̽i̐ͪ̈́̏̑o͇͓̎n͎̐̃ͨ͢s̜͉̼̹͇̐ͥ̏̈́̽̔͐ ̛̑ͧf̩̋ͨ͑ö̮̗̩́̏̀ͩ̆r̮͓͊̌ ̸̪͈̫̬̭̻̮͊ͧ͂ͬ̌H͎̤̟͙̞ͪ͐̃̿ͮͭͅT͚̉͑͛̉M̴̦͖͇͔͚̙ͭͭ̽L͗ͦ̋̓͑ ͍͈͙̞͍̻̉̆͆̃͘p̓̉̃͆͛ͦ́͟r͕͙ͭͭͦ͡ő̹͍̳̳ͯ̐c̵̙͇͋̅è͖̘̲̰͉͉̺͛́ͪͩ̋͜s̾͑ͬͬ͐̋̀s̜̼̰̞̺͗ͫ̒ͫͧͥͅḭ̪ͫ͋ͫ̚n̿͐҉̺̩̟̻̳g͑̀̑̆̈̾!̠̓ͭ̈͜

If you still want to try it ... use non-capturing groups (?:) to exclude the tags and a lazy quantifier *? to match only up to the first closing tag.

(?:<td[^>]*>).*?(?:</td>)

This requires dot-all mode and may still fail if for example the description attribute contains a closing angle bracket.


If you're certain that there is no HTML in the table cells, the following non-regex code may help:

// $entries contains all of the table cell entries.
$newentries = "";
$cells = split("</td>",$entries);
while (list(,$data) = each($cells)) {
    $newentries .= "<td class=\"description\">";
    $text = substr($data,strpos($data, ">") + 1);
    // perform modifications on $text
    // i.e. $text = "<B>" . $text . "</B>";
    $newentries .= $text;
    $newentries .= "</td>";
}

// $newentries now contains the modified cell entries.

This probably isn't 100% what you're looking for, but maybe it will help.


You may use:

preg_replace(
  '/<td (.*?)>(.*?)<\/td>/sm',
  '<td class="description"><strong>$2</strong></td>',
  $data
)

If what you are trying to do with the text inside is complicate, use a callback function.


As all the other ones have said: RegExp is bad, at least here!

So, basic Regex is

#<td[^>]*>(.*?)</td>#s

(Note I used the s-Modifier, otherwise the RegExp wouldn't work.)

Now, this RegExp is wrong, even though it may be okay for your purposes. To be more strict you have to know, that > is allowed in attributes. Therefore this Regex may break things.

#<td(\s+\w+="[^"]+")\s*>(.*?)</td>#s

I think this now will be quite secure if you're dealing with XML. But sure, it may break on rare occasions, which I right now can't think off.


$d = new DOMDocument();
$d->loadHTML($htmlstring);
$x = new DOMXPath($d);
$tds = $x->query("//td[@class='description']//text()");
for($i = 1; $i <= $tds->length; $i++){
    $tds->item($i)->replaceData(0,mb_strlen($tds->item($i)->wholeText),strtoupper($tds->item($i)->wholeText));   
}
var_dump($d->saveHTML());
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜