Sed program - deleted strings reappearing?

2023-04-12 06:39 问答作者：

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.

Here is the relevant section of my source file:

<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>&nbsp;
<span class="region">Region</span>&nbsp;&nbsp;
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>

I would like this to be made into the following plaintext format:

My Name

123 street
City Region  1A1 A1A
my@email.ca
000-000-0000

The key is that City, Region, and Post code are all on one line now.

I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:

#using the '@' symbol as delimiter instead of '/'
#remove tags
s@<.*>\(.*\)</.*>@\1@g
#remove the nbsp
s@\(&nbsp;\)*@@g
#add a newline before the address (actually typing a newline in the file)
s@\(123 street\)@\
\1@g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s@\(.*\)\n\(.*\)\n\(.*\)@\1开发者_开发技巧 \2  \3@g
}

Seems to make sense. Tags are all stripped and then three lines are put into one.

Buuuuut it doesn't work that way. Here is the result I get:

My Name

123 street
City <span class="region">Region</span>&nbsp;&nbsp;  <span class="postal-code">1A1 A1A</span>
my@email.ca
000-000-0000

To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?

I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.

I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.

See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.

Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.

If you have only one data block per php file, try the following (using sed)

kent$  cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>&nbsp;
<span class="region">Region</span>&nbsp;&nbsp;
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>

kent$  sed 's/<[^>]*>//g; s/&nbsp;//g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name

123 street
City Region 1A1 A1A
my@email.ca
000-000-0000

继续阅读：macos regex sed

Sed program - deleted strings reappearing?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？