开发者

SED - unable to execute some commands on UTF-8 encoded chars

I got a file that looks like this:

<text top="123" left="45" width="50" height="17" font="8">Måndag</text>

As noted in the topic, this file is encoded in utf-8. When using this command:

cat file | sed 's_.*top="\([0-9][0-9]*\)" left="\([0-9][0-9]*\)".*>\(.*\)<.*_\1 \2 \3_'

it never completes the execution and prints nothing.

However executing a line like this one:

cat file | sed 's/å/FOO/'

gives me a correct output:

<text top="123" left="45" width="50" height="17" font="8开发者_如何学JAVA">MFOOndag</text>

Is this a bug in sed or is there something wrong with my regex or the way that I'm using it? What I want is a neat way to extract the top, left and content data without involving too many commands.


The easiest way to do this reliably is just to use perl in place of sed:

bash$ perl -CSAD -pe 's/foo/bar/g'

That will allow Unicode in your arguments, your std streams, and all files you process.


Not all seds are built to handle UTF-8. I would look at the source to see if any relevant patches have been applied. FTR, Red Hat-derived seds do handle UTF-8 properly.


Try this suggestion. Looks like it could work for you.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜