Remove empty tags using RegEx

2023-01-04 18:18 问答作者：

I want to delete empty tags such as <label></label>, <font> </font> so that:

<label></label><form></form>
<p>This is <span style="color: red;">red</span> 
<i>italic</i&开发者_StackOverflowgt;
</p>

will be cleaned as:

<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

I have this RegEx in javascript, but it deletes the the empty tags but it also delete this: "<i>italic</i></p>"

str=str.replace(/<[\S]+><\/[\S]+>/gim, "");

What I am missing?

You have "not spaces" as your character class, which means "<i>italic</i></p>" will match. The first half of your regex will match "<(i>italic</i)>" and the second half "</(p)>". (I've used brackets to show what each [\S]+ matches.)

Change this:

/<[\S]+><\/[\S]+>/

To this:

/<[^/>][^>]*><\/[^>]+>/

Overall you should really be using a proper HTML processor, but if you're munging HTML soup this should suffice :)

Regex is not for HTML. If you're in JavaScript anyway I'd be encouraged to use jQuery DOM processing.

Something like:

$('*:empty').remove();

Alternatively:

$("*").filter(function() 
{ 
     return $.trim($(this).html()).length > 0; 
}).remove();

All the answers with regex are only validate

<label></label>

but in the case of

<label> </label>
<label>    </label>
<label>
</label>

try this pattern to get all the above

<[^/>]+>[ \n\r\t]*</[^>]+>

You need /<[\S]+?><\/[\S]+?>/ -- the difference is the ?s after the +s, to match "as few as possible" (AKA "non-greedy match") nonspace characters (though 1 or more), instead of the bare +s which match"as many as possible" (AKA "greedy match").

Avoiding regular expressions altogether, as the other answer recommends, is also an excellent idea, but I wanted to point out the important greedy vs non-greedy distinction, which will serve you well in a huge variety of situations where regexes are warranted.

I like MattMitchell's jQuery solution but here is another option using native JavaScript.

function CleanChildren(elem)
{
    var children = elem.childNodes;
    var len = elem.childNodes.length;

    for (var i = 0; i < len; i++)
    {
        var child = children[i];

        if(child.hasChildNodes())
            CleanChildren(child);
        else
            elem.removeChildNode(child);

    }
}

Here's a modern native JavaScript solution; which is actually quite similar to the jQuery one from 2010. I adapted it from that answer for a project that I am working on, and thought I would share it here.

document.querySelectorAll("*:empty").forEach((x)=>{x.remove()});

document.querySelectorAll returns a NodeList; which is essentially an array of all DOM nodes which match the CSS selector given to it as an argument.
- *:empty is a selector which selects all elements (* means "any element") that is empty (which is what :empty means).
  
  This will select any empty element within the entire document, if you only wanted to remove any empty elements from within a certain part of the page (i.e. only those within some div element); you can add an id to that element and then use the selector #id *:empty, which means any empty element within the element with an id of id.
  
  This is almost certainly what you want. Technically some important tags (e.g. <meta> tags, <br> tags, <img> tags, etc) are "empty"; so without specifying a scope, you will end up deleting some tags you probably care about.
forEach loops through every element in the resulting NodeList, and runs the anonymous function (x)=>{x.remove()} on it. x is the current element in the list, and calling .remove() on it removes that element from the DOM.

Hopefully this helps someone. It's amazing to see how far JavaScript has come in just 8 years; from almost always needing a library to write something complex like this in a concise manner to being able to do so natively.

Edit

So, the method detailed above will work fine in most circumstances, but it has two issues:

Elements like <div> </div> are not treated as :empty (not the space in-between). CSS Level 4 selectors fix this with the introduction of the :blank selector (which is like empty except it ignores whitespace), but currently only Firefox supports it (in vendor-prefixed form).
Self-closing tags are caught by :empty - and this will remain the case with :blank, too.

I have written a slightly larger function which deals with these two use cases:

document.querySelectorAll("*").forEach((x)=>{
    let tagName = "</" + x.tagName + ">";
    if (x.outerHTML.slice(tagName.length).toUpperCase() == tagName
        && /[^\s]/.test(x.innerHTML)) {
        x.remove();
    }
});

We iterate through every element on the page. We grab that element's tag name (for example, if the element is a div this would be DIV, and use it to construct a closing tag - e.g. </DIV>.

That tag is 6 characters long. We check if the upper-cased last 6 characters of the elements HTML matches that. If it does we continue. If it doesn't, the element does't have a closing tag, and therefore must be self-closing. This is preferable over a list, because it means you don't have to update anything should a new self-closing tag get added to the spec.

Then, we check if the contents of the element contain any whitespace. /[^\s]/ is a RegEx. [] is a set in RegEx, and will match any character that appears inside it. If ^ is the first element, the set becomes negated - it will match any element that is NOT in the set. \s means whitespace - tabs, spaces, line breaks. So what [^\s] says is "any character that is not white space".

Matching against that, if the tag is not self-closing, and its contents contain a non-whitespace character, then we remove it.

Of course, this is a bit bigger and less elegant than the previous one-liner. But it should work for essentially every case.

This is an issue of greedy regex. Try this:

str=str.replace(/<[\^>]+><\/[\S]+>/gim, "");

str=str.replace(/<[\S]+?><\/[\S]+>/gim, "");

In your regex, <[\S]+?> matches <i>italic</i> and the <\/[\S]+> matches the </p>

You can use this one text = text.replace(/<[^/>][^>]>\s</[^>]+>/gim, "");

found this on code pen: jQuery though but does the job

$('element').each(function() {
  if ($(this).text() === '') {
    $(this).remove();
  }
});

You will need to alter the element to point to where you want to remove empty tags. Do not point at document cause it will result in my answer at Toastrackenigma

remove empty tags with cheerio will and also removing images:

  $('*')
    .filter(function(index, el) {
      return (
        $(el)
          .text()
          .trim().length === 0
      )
    })
    .remove()

remove empty tags with cheerio, but also keep images:

  $('*')
    .filter(function(index, el) {
      return (
        el.tagName !== 'img' &&
        $(el).find(`img`).length === 0 &&
        $(el)
          .text()
          .trim().length === 0
      )
    })
    .remove()

<([^>]+)\s*>\s*<\/\1\s*>

<div>asdf</div>
<div></div> -- will match only this
<div></notdiv>
-- and this
<div  >  
    </div   >

try yourself https://regexr.com/

继续阅读：javascript regex

Remove empty tags using RegEx

Edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？