delete html comment tags using regexp
This is how my text (html) file looks like
<!--
| |
| This is a dummy comment |
| please delete me |
| asap |
| |
________________________________
| -->
this is another line
in this long dummy html file...
please do not delete me
I'm trying to delete the comment using sed :
cat file.html | sed 's/.*<!--\(.*\)-->.*//g'
It doesn't w开发者_开发知识库ork :( What am I doing wrong?
Thank you very much for your help!
patrickmdnet has the correct answer. Here it is on one line using extended regex:
cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92
http://www.catonmat.net/blog/sed-one-liners-explained-part-three/
One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.
You would better off using existing code instead of rolling your own.
http://sed.sourceforge.net/grabbag/scripts/strip_html_comments.sed
#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <stewart.ravenhall@ukonline.co.uk>
/<!--/!b
:a
/-->/!{
N
ba
}
s/<!--.*-->//
(from http://sed.sourceforge.net/grabbag/scripts/)
See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.
http://www.perlmonks.org/?node_id=500603
I think you can do this with awk
if you want. Start:
[~] $ more test.txt
<!--
An HTML style comment
-->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
Result of the awk
:
[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
Improving (hopefully) on the awk-based answer provided by eldarerathis --
The code below addresses the concern raised by john-jones.
In this version, the prefix leading up to the start of the html comment is preserved, as is the suffix following the close of the html comment.
$ cat some-file | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
for example
$ cat test.txt
<!--
An HTML style comment
-->
<meta charset="utf-8"> <!-- charset encoding must be within the first 1024 bytes of the document -->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
<div> <!-- start of foo -->
foo
</div> <!-- end of foo -->
<div> <!-- start of multiline comment
bar
end of multiline comment --> </div>
$ cat test.txt | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
<meta charset="utf-8">
<div>
foo
</div>
<div> </div>
精彩评论