parsing issue with comma separated csv file
I am trying to extract 4th column from csv file (comma separated, and skipping first 2 header lines) using this command,
awk 'NR <2 {next}{FS =","}{print $4}' filename.csv | more
However, it doesn't work because the first column cantains comma, thus 4th column is not 开发者_运维知识库really 4th. Below is an example of a row:
"sdfsdfsd, sfsdf", 454,fgdfg, I_want_this_column,sdfgdg,34546, 456465, etc
Unless you have specific reasons for using awk
, I would recommend using a CSV parsing library. Many scripting languages have one built-in (or at least available) and they'll save you from these headaches.
if your first column has quotes always,
$ awk 'BEGIN{ FS="\042[ ]*," } { m=split($2,a,","); print a[3] } ' file
I_want_this_column
if the column you want is always the last 2nd,
$ awk -F"," '{print $(NF-1)}' file
I_want_this_column
You can try this demo script to break down the columns
awk 'BEGIN{ FS="," }
{
for(i=1;i<=NF;i++){
# save normal
if($i !~ /^[ ]*\042|[ ]*\042[ ]*$/){
a[++j]=$i
}
# if quotes at the end
if(f==1 && $i ~ /[ ]*\042[ ]*$/){
s=s","$i
a[++j]=s
#reset
s="";f=0
}
# if quotes in front
if($i ~ /^[ ]*\042/){
s=s $i
f=1
}
if(f==1 && ( $i !~/\042/ ) ){
s=s","$i
}
}
}
END{
# print columns
for(p=1;p<=j;p++){
print "Field "p,": "a[p]
}
} ' file
output
$ cat file
"sdfsdfsd, sfsdf", "454,fgdfg blah , words ", I_want_this_column,sdfgdg
$ ./shell.sh
Field 1 : "sdfsdfsd, sfsdf"
Field 2 : fgdfg blah
Field 3 : "454,fgdfg blah , words "
Field 4 : I_want_this_column
Field 5 : sdfgdg
You shouldn't use awk here. Use Python csv module or Perl Text::CSV or Text::CSV_XS modules or another real csv parser.
Related question - parse csv file using gawk
If you can't avoid awk, this piece of code does the job you need:
BEGIN {FS=",";}
{
f=0;
j=0;
for (i = 1; i <=NF ; ++i) {
if (f) {
a[j] = a[j] "," $(i);
if ($(i) ~ "\"$") {
f = 0;
}
}
else {
++j;
a[j] = $(i);
if ((a[j] ~ "^\"[^\"]*$")) {
f = 1;
}
}
}
for (i = 1; i <= j; ++i) {
gsub("^\"","",a[i]);
gsub("\"$","",a[i]);
gsub("\"\"","\"",a[i]);
print "i = \"" a[i] "\"";
}
}
Working with CSV files that have quoted fields with commas inside can be difficult with the standard UNIX text tools.
I wrote a program called csvquote to make the data easy for them to handle. In your case, you could use it like this:
csvquote filename.csv | awk 'NR <2 {next}{FS =","}{print $4}' | csvquote -u | more
or you could use cut and tail like this:
csvquote filename.csv | tail -n +3 | cut -d, -f4 | csvquote -u | more
The code and docs are here: https://github.com/dbro/csvquote
精彩评论