Using sed to remove new line characters from the middle of a field
the data i have looks something like this
a,b,c,d
a,b1
b2,c,d
A,B,C,D
What is happening is that in fie开发者_运维问答ld 2 there is occasionally a new line character in the second field so the line gets split over two lines
So far i have found a sed script that will do this it reads like
cat file| sed ':a;N;$!ba;s/\(\(b1\)\)\n/\1/g'
but i am struggling to get the correct combinations of (.*,) to make this work so i've substituted it with b1 to get this example to work but in the real world A, B, C and D are field of mixed length and content
The end result i'm looking for is this
a,b,c,d
a,b1b2,c,d
A.B,C,D
Any help is much appreciated
Thanks Matt
I have here a solution that is not yet perfect - but I will further think about it. If your version of sed
supports extended regular expressions you could do:
cat file | sed -r ':a;N;$!ba;s/((^|\n)([^,]+,){0,2}[^,]+)\n/\1/g'
That will work if there's a line break in one of the first three columns. Up to now it does not yet work with multiple line breaks in a "line".
Explanation:
(^|\n)
matches the beginning of the line (resp. a line break)
[^,]+,
means: at least one (+
means one ore more) character!="," followed by a ","
([^,]+,){0,2}
matches if there are 0-2 columns delimited with a ","
The [^,]+
means that there are some (at least one) character!="," following the 0-2 columns.
The trailing \n
matches a line break
Summarized the s
command will match all lines containing 0-3 columns with a line break at the end and will substitute it with itself (\1
) excluding the trailing line break.
In awk
awk -F, 'NF < 4 {getline nextline; $0 = $0 nextline} 1' filename
精彩评论