How can I remove a variable number of lines from the top and bottom of multiple HTML documents?
I have a large number of html documents that need a variable number of lines removed from the top and bottom. The pa开发者_运维知识库rt I want always starts with <div class="someclass">
and the bottom section always starts with <div class="bottomouter>
. Something like this:
<html>
[...]
<div class="someclass"><!-- stuff i want to keep --></div>
<div class="bottomouter">[...]</div>
[...]
</html>
How could this be accomplished?
I'm working on a Linux box so I have access to Perl, Sed, Awk, &c. However, I don't know how to approach this (or if this is the right place to ask).
Edit: To clarify I'm moving a bunch of static document into a template system and they need the headers and footers removed.
How about a perl script like this:
#!/usr/bin/perl -n
$output_enabled = 1 if (/^<div class="someclass">/);
$output_enabled = 0 if (/^<div class="bottomouter">/);
print if ($output_enabled);
The -n
option tells perl to apply the script to each line of input, putting the line in the $_
variable (which is used implicitly in a lot of places in Perl; think of it like the word "it"). I set the $output_enabled
variable (which persists across lines since it's a global variable, not declared with my
) to 1
(true) if the current line matches the regex /^<div class="someclass">/
, that is, if it starts with <div class="someclass">
. Similarly, I set $output_enabled
to 0
(false) if the line starts with <div class="bottomouter">
. Finally, I print out the line if $output_enabled
is true (it's initially false because it's undefined).
sed -n '/begPattern/,/endPattern/p'
To output the part of the file between the delimiting lines without including them:
sed '1,/<div class="someclass">/d;/<div class="bottomouter">/,$d' inputfile
精彩评论