开发者

Sed program - deleted strings reappearing?

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.

Here is the relevant section of my source file:

<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>&nbsp;
<span class="region">Region</span>&nbsp;&nbsp;
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>

I would like this to be made into the following plaintext format:

My Name

123 street
City Region  1A1 A1A
my@email.ca
000-000-0000

The key is that City, Region, and Post code are all on one line now.

I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:

#using the '@' symbol as delimiter instead of '/'
#remove tags
s@<.*>\(.*\)</.*>@\1@g
#remove the nbsp
s@\(&nbsp;\)*@@g
#add a newline before the address (actually typing a newline in the file)
s@\(123 street\)@\
\1@g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s@\(.*\)\n\(.*\)\n\(.*\)@\1开发者_开发技巧 \2  \3@g
}

Seems to make sense. Tags are all stripped and then three lines are put into one.

Buuuuut it doesn't work that way. Here is the result I get:

My Name

123 street
City <span class="region">Region</span>&nbsp;&nbsp;  <span class="postal-code">1A1 A1A</span>
my@email.ca
000-000-0000

To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?

I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.


I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.


See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.

Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.


If you have only one data block per php file, try the following (using sed)

kent$  cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>&nbsp;
<span class="region">Region</span>&nbsp;&nbsp;
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>

kent$  sed 's/<[^>]*>//g; s/&nbsp;//g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name

123 street
City Region 1A1 A1A
my@email.ca
000-000-0000
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜