开发者

Using sed to remove new line characters from the middle of a field

the data i have looks something like this

a,b,c,d
a,b1
b2,c,d
A,B,C,D

What is happening is that in fie开发者_运维问答ld 2 there is occasionally a new line character in the second field so the line gets split over two lines

So far i have found a sed script that will do this it reads like

cat file| sed ':a;N;$!ba;s/\(\(b1\)\)\n/\1/g'

but i am struggling to get the correct combinations of (.*,) to make this work so i've substituted it with b1 to get this example to work but in the real world A, B, C and D are field of mixed length and content

The end result i'm looking for is this

a,b,c,d
a,b1b2,c,d
A.B,C,D

Any help is much appreciated

Thanks Matt


I have here a solution that is not yet perfect - but I will further think about it. If your version of sed supports extended regular expressions you could do:

cat file | sed -r ':a;N;$!ba;s/((^|\n)([^,]+,){0,2}[^,]+)\n/\1/g'

That will work if there's a line break in one of the first three columns. Up to now it does not yet work with multiple line breaks in a "line".

Explanation:
(^|\n) matches the beginning of the line (resp. a line break)
[^,]+, means: at least one (+ means one ore more) character!="," followed by a ","
([^,]+,){0,2} matches if there are 0-2 columns delimited with a ","
The [^,]+ means that there are some (at least one) character!="," following the 0-2 columns.
The trailing \n matches a line break

Summarized the s command will match all lines containing 0-3 columns with a line break at the end and will substitute it with itself (\1) excluding the trailing line break.


In awk

awk -F, 'NF < 4 {getline nextline; $0 = $0 nextline} 1' filename
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜