Regex to automate some html tagging
I'm having 800 entries that are very similar, but they need some stuff done to them. The f开发者_如何学运维ormat is like this:
<td class="description">
Describing text.
Might very well be 2 paragraphs
</td>
I need to do some stuff to the text inside the cell. I've tried to use preg_replace('/(.+)</td>/'). It ends up with two problems.
- I don't manage to fetch what's inside the parenthesis, but it will also fetch the cell tags.
- It will fetch everything until the last
</td>
in the document. I just want it to go to the first occurrence of</td>
Thanks in advance
First of all, .+ will grab everything... it won't just start at <td>
. You will want to add a regex to pull the beginning of the table col:
<td[^>]*?>
(note, [^>]*
means match non->
characters until we find one.)
Also, .+
and .*
are greedy, meaning that it will grab as much as possible. To change this behavior, add a ?
after it, like such: .+?
. This makes it satisfy only as much as it needs to.
So, you will have
<td[^>]*)>(.*?)<\/td>
This was a lesson on regex, but I really think you shouldn't be using regex for this. Regex can break pretty easily once you start having nested tables or anything more complicated than simple html.
D̨͙̯̹̼ỏ͇̥̱͚̲͖̣͢ǹ̶̥͉̳͈͈̏̉ͧ'ͧͬ͏̪̩͓̳̬̱ͅt͇̝̖ͦ̏̏̍̉͠ ͙̺̹͚͎̐̒ͥ͑̀ṷ͍̖͕̐ͫ̚s̤͖͇̲̪͊͋̉ͨͪ̚e͚̲͎͓̟͊̍ ̲̬̩͇̗̭̌̊̑̊͝r̷̦͔̞̜̬ͦe̔̓͒͊̌g̹̘̬̭ͨ̐̽̐̂u̼̹̔ͣ͑͐̓͋l͈̤̘͉̰̏͌̚a̵̤̞̥̋rͭ ̦̝͓̟̣̯̄́̎̀̔ͥe̢̟̥̹̊̅̌̅̋x̠̠̲͚̝͋ͪp̧̽̉ṟ͉̏͌̊̐ͅe͖͎̞͇̽͛̀s͓͈̒s̴͚̮̹ͧ̽i̐ͪ̈́̏̑o͇͓̎n͎̐̃ͨ͢s̜͉̼̹͇̐ͥ̏̈́̽̔͐ ̛̑ͧf̩̋ͨ͑ö̮̗̩́̏̀ͩ̆r̮͓͊̌ ̸̪͈̫̬̭̻̮͊ͧ͂ͬ̌H͎̤̟͙̞ͪ͐̃̿ͮͭͅT͚̉͑͛̉M̴̦͖͇͔͚̙ͭͭ̽L͗ͦ̋̓͑ ͍͈͙̞͍̻̉̆͆̃͘p̓̉̃͆͛ͦ́͟r͕͙ͭͭͦ͡ő̹͍̳̳ͯ̐c̵̙͇͋̅è͖̘̲̰͉͉̺͛́ͪͩ̋͜s̾͑ͬͬ͐̋̀s̜̼̰̞̺͗ͫ̒ͫͧͥͅḭ̪ͫ͋ͫ̚n̿͐҉̺̩̟̻̳g͑̀̑̆̈̾!̠̓ͭ̈͜
If you still want to try it ... use non-capturing groups (?:)
to exclude the tags and a lazy quantifier *?
to match only up to the first closing tag.
(?:<td[^>]*>).*?(?:</td>)
This requires dot-all mode and may still fail if for example the description attribute contains a closing angle bracket.
If you're certain that there is no HTML in the table cells, the following non-regex code may help:
// $entries contains all of the table cell entries.
$newentries = "";
$cells = split("</td>",$entries);
while (list(,$data) = each($cells)) {
$newentries .= "<td class=\"description\">";
$text = substr($data,strpos($data, ">") + 1);
// perform modifications on $text
// i.e. $text = "<B>" . $text . "</B>";
$newentries .= $text;
$newentries .= "</td>";
}
// $newentries now contains the modified cell entries.
This probably isn't 100% what you're looking for, but maybe it will help.
You may use:
preg_replace(
'/<td (.*?)>(.*?)<\/td>/sm',
'<td class="description"><strong>$2</strong></td>',
$data
)
If what you are trying to do with the text inside is complicate, use a callback function.
As all the other ones have said: RegExp is bad, at least here!
So, basic Regex is
#<td[^>]*>(.*?)</td>#s
(Note I used the s-Modifier, otherwise the RegExp wouldn't work.)
Now, this RegExp is wrong, even though it may be okay for your purposes. To be more strict you have to know, that >
is allowed in attributes. Therefore this Regex may break things.
#<td(\s+\w+="[^"]+")\s*>(.*?)</td>#s
I think this now will be quite secure if you're dealing with XML. But sure, it may break on rare occasions, which I right now can't think off.
$d = new DOMDocument();
$d->loadHTML($htmlstring);
$x = new DOMXPath($d);
$tds = $x->query("//td[@class='description']//text()");
for($i = 1; $i <= $tds->length; $i++){
$tds->item($i)->replaceData(0,mb_strlen($tds->item($i)->wholeText),strtoupper($tds->item($i)->wholeText));
}
var_dump($d->saveHTML());
精彩评论