Sed program - deleted strings reappearing?
I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed
to clean it up. I understand that sed
works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my@email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I usesed -f commands.sed file.html > output.txt
and I believe that the following sed program (commands.sed
) should put it in that format:
#using the '@' symbol as delimiter instead of '/'
#remove tags
s@<.*>\(.*\)</.*>@\1@g
#remove the nbsp
s@\( \)*@@g
#add a newline before the address (actually typing a newline in the file)
s@\(123 street\)@\
\1@g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s@\(.*\)\n\(.*\)\n\(.*\)@\1开发者_开发技巧 \2 \3@g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my@email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash
shell and using the version of sed
that comes with it.
I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.
See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.
Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.
If you have only one data block per php file, try the following (using sed)
kent$ cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>
kent$ sed 's/<[^>]*>//g; s/ //g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name
123 street
City Region 1A1 A1A
my@email.ca
000-000-0000
精彩评论