delete html comment tags using regexp

2023-01-22 20:12 问答作者：

This is how my text (html) file looks like
    <!--
     |                                |
     |  This is a dummy comment       |
     |      please delete me          |
     |         asap                   |
     |                                |
      ________________________________
     | -->

    this is another line 
    in this long dummy html file...
    please do not delete me

I'm trying to delete the comment using sed :

cat file.html | sed 's/.*<!--\(.*\)-->.*//g'

It doesn't w开发者_开发知识库ork :( What am I doing wrong?

Thank you very much for your help!

patrickmdnet has the correct answer. Here it is on one line using extended regex:

cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92

http://www.catonmat.net/blog/sed-one-liners-explained-part-three/

One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.

You would better off using existing code instead of rolling your own.

http://sed.sourceforge.net/grabbag/scripts/strip_html_comments.sed

#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <stewart.ravenhall@ukonline.co.uk>

/<!--/!b
:a
/-->/!{
    N
    ba
}
s/<!--.*-->//

(from http://sed.sourceforge.net/grabbag/scripts/)

See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.

http://www.perlmonks.org/?node_id=500603

I think you can do this with awk if you want. Start:

[~] $ more test.txt
<!--

An HTML style comment 

-->

Some other text

<div>
<p>blah</p>
</div>

<!-- Whoops
     Another comment -->
<span>Something</span>

Result of the awk:

[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text

<div>
<p>blah</p>
</div>

<span>Something</span>

Improving (hopefully) on the awk-based answer provided by eldarerathis --

The code below addresses the concern raised by john-jones.

In this version, the prefix leading up to the start of the html comment is preserved, as is the suffix following the close of the html comment.

$ cat some-file | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'

for example

$ cat test.txt
<!--

An HTML style comment

-->

<meta charset="utf-8"> <!-- charset encoding must be within the first 1024 bytes of the document -->
Some other text

<div>
<p>blah</p>
</div>

<!-- Whoops
     Another comment -->
<span>Something</span>

<div> <!-- start of foo -->
foo
</div> <!-- end of foo -->

<div> <!-- start of multiline comment
bar
end of multiline comment --> </div>

$ cat test.txt | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'

Some other text
<div>
<p>blah</p>
</div>

<span>Something</span>
<meta charset="utf-8">
<div>
foo
</div>
<div>  </div>

继续阅读：html-parsing regex sed

delete html comment tags using regexp

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？