Remove empty tags using RegEx
I want to delete empty tags such as <label></label>
, <font> </font>
so that:
<label></label><form></form>
<p>This is <span style="color: red;">red</span>
<i>italic</i&开发者_StackOverflowgt;
</p>
will be cleaned as:
<p>This is <span style="color: red;">red</span>
<i>italic</i>
</p>
I have this RegEx in javascript, but it deletes the the empty tags but it also delete this: "<i>italic</i></p>"
str=str.replace(/<[\S]+><\/[\S]+>/gim, "");
What I am missing?
You have "not spaces" as your character class, which means "<i>italic</i></p>
" will match. The first half of your regex will match "<(i>italic</i)>
" and the second half "</(p)>
". (I've used brackets to show what each [\S]+
matches.)
Change this:
/<[\S]+><\/[\S]+>/
To this:
/<[^/>][^>]*><\/[^>]+>/
Overall you should really be using a proper HTML processor, but if you're munging HTML soup this should suffice :)
Regex is not for HTML. If you're in JavaScript anyway I'd be encouraged to use jQuery DOM processing.
Something like:
$('*:empty').remove();
Alternatively:
$("*").filter(function()
{
return $.trim($(this).html()).length > 0;
}).remove();
All the answers with regex are only validate
<label></label>
but in the case of
<label> </label>
<label> </label>
<label>
</label>
try this pattern to get all the above
<[^/>]+>[ \n\r\t]*</[^>]+>
You need /<[\S]+?><\/[\S]+?>/
-- the difference is the ?
s after the +
s, to match "as few as possible" (AKA "non-greedy match") nonspace characters (though 1 or more), instead of the bare +
s which match"as many as possible" (AKA "greedy match").
Avoiding regular expressions altogether, as the other answer recommends, is also an excellent idea, but I wanted to point out the important greedy vs non-greedy distinction, which will serve you well in a huge variety of situations where regexes are warranted.
I like MattMitchell's jQuery solution but here is another option using native JavaScript.
function CleanChildren(elem)
{
var children = elem.childNodes;
var len = elem.childNodes.length;
for (var i = 0; i < len; i++)
{
var child = children[i];
if(child.hasChildNodes())
CleanChildren(child);
else
elem.removeChildNode(child);
}
}
Here's a modern native JavaScript solution; which is actually quite similar to the jQuery one from 2010. I adapted it from that answer for a project that I am working on, and thought I would share it here.
document.querySelectorAll("*:empty").forEach((x)=>{x.remove()});
document.querySelectorAll
returns aNodeList
; which is essentially an array of all DOM nodes which match the CSS selector given to it as an argument.*:empty
is a selector which selects all elements (*
means "any element") that is empty (which is what:empty
means).This will select any empty element within the entire document, if you only wanted to remove any empty elements from within a certain part of the page (i.e. only those within some
div
element); you can add an id to that element and then use the selector#id *:empty
, which means any empty element within the element with an id ofid
.This is almost certainly what you want. Technically some important tags (e.g.
<meta>
tags,<br>
tags,<img>
tags, etc) are "empty"; so without specifying a scope, you will end up deleting some tags you probably care about.
forEach
loops through every element in the resultingNodeList
, and runs the anonymous function(x)=>{x.remove()}
on it.x
is the current element in the list, and calling.remove()
on it removes that element from the DOM.
Hopefully this helps someone. It's amazing to see how far JavaScript has come in just 8 years; from almost always needing a library to write something complex like this in a concise manner to being able to do so natively.
Edit
So, the method detailed above will work fine in most circumstances, but it has two issues:
- Elements like
<div> </div>
are not treated as:empty
(not the space in-between). CSS Level 4 selectors fix this with the introduction of the:blank
selector (which is like empty except it ignores whitespace), but currently only Firefox supports it (in vendor-prefixed form). - Self-closing tags are caught by
:empty
- and this will remain the case with:blank
, too.
I have written a slightly larger function which deals with these two use cases:
document.querySelectorAll("*").forEach((x)=>{
let tagName = "</" + x.tagName + ">";
if (x.outerHTML.slice(tagName.length).toUpperCase() == tagName
&& /[^\s]/.test(x.innerHTML)) {
x.remove();
}
});
We iterate through every element on the page. We grab that element's tag name (for example, if the element is a div this would be DIV
, and use it to construct a closing tag - e.g. </DIV>
.
That tag is 6 characters long. We check if the upper-cased last 6 characters of the elements HTML matches that. If it does we continue. If it doesn't, the element does't have a closing tag, and therefore must be self-closing. This is preferable over a list, because it means you don't have to update anything should a new self-closing tag get added to the spec.
Then, we check if the contents of the element contain any whitespace. /[^\s]/
is a RegEx. []
is a set in RegEx, and will match any character that appears inside it. If ^
is the first element, the set becomes negated - it will match any element that is NOT in the set. \s
means whitespace - tabs, spaces, line breaks. So what [^\s]
says is "any character that is not white space".
Matching against that, if the tag is not self-closing, and its contents contain a non-whitespace character, then we remove it.
Of course, this is a bit bigger and less elegant than the previous one-liner. But it should work for essentially every case.
This is an issue of greedy regex. Try this:
str=str.replace(/<[\^>]+><\/[\S]+>/gim, "");
or
str=str.replace(/<[\S]+?><\/[\S]+>/gim, "");
In your regex, <[\S]+?>
matches <i>italic</i>
and the <\/[\S]+>
matches the </p>
You can use this one
text = text.replace(/<[^/>][^>]>\s</[^>]+>/gim, "");
found this on code pen: jQuery though but does the job
$('element').each(function() {
if ($(this).text() === '') {
$(this).remove();
}
});
You will need to alter the element to point to where you want to remove empty tags. Do not point at document cause it will result in my answer at Toastrackenigma
remove empty tags with cheerio will and also removing images:
$('*')
.filter(function(index, el) {
return (
$(el)
.text()
.trim().length === 0
)
})
.remove()
remove empty tags with cheerio, but also keep images:
$('*')
.filter(function(index, el) {
return (
el.tagName !== 'img' &&
$(el).find(`img`).length === 0 &&
$(el)
.text()
.trim().length === 0
)
})
.remove()
<([^>]+)\s*>\s*<\/\1\s*>
<div>asdf</div>
<div></div> -- will match only this
<div></notdiv>
-- and this
<div >
</div >
try yourself https://regexr.com/
精彩评论