How to match a column name and find out the column position in awk?
I am trying to parse some csv files using awk. I am new to s开发者_运维技巧hell scripting and awk. The csv file i am working on looks something like this :
fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600
I need to find the average AccessTimes of the different functions.
I have been working with awk and have been able to get the average times provided the exact column numbers are specified like $2, $3 etc.
However I need to have a general script in which if i input "minAccessTime" in the command argument, I need the script to print the average AccessTime (instead of explicitly specifying $2 or $3 while using awk).
I have been googling about this and saw in various forums but none of them seems to work. Can someone tell me how to do this ? It would be of great help !
Thanks in advance!!
This awk
script should give you all that you want.
It first evaluates which column you're interested in by using the name passed in as the COLM
variable and checking against the first line. It converts this into an index (it's left as the default 0 if it couldn't find the column).
It then basically runs through all other lines in your input file. On all these other lines (assuming you've specified a valid column), it updates the count, sum, minimum and maximum for both the overall data plus each individual function name.
The former is stored in count
, sum
, min
and max
. The latter are stored in associative arrays with similar names (with _arr
appended).
Then, once all records are read, the END
section outputs the information.
NR == 1 {
for (i = 1; i <= NF; i++) {
if ($i == COLM) {
cidx = i;
}
}
}
NR > 1 {
if (cidx > 0) {
count++;
sum += $cidx;
if (count == 1) {
min = $cidx;
max = $cidx;
} else {
if ($cidx < min) { min = $cidx; }
if ($cidx > max) { max = $cidx; }
}
count_arr[$1]++;
sum_arr[$1] += $cidx;
if (count_arr[$1] == 1) {
min_arr[$1] = $cidx;
max_arr[$1] = $cidx;
} else {
if ($cidx < min_arr[$1]) { min_arr[$1] = $cidx; }
if ($cidx > max_arr[$1]) { max_arr[$1] = $cidx; }
}
}
}
END {
if (cidx == 0) {
print "Column '" COLM "' does not exist"
} else {
print "Overall:"
print " Total records = " count
print " Sum of column = " sum
if (count > 0) {
print " Min of column = " min
print " Max of column = " max
print " Avg of column = " sum / count
}
for (task in count_arr) {
print "Function " task ":"
print " Total records = " count_arr[task]
print " Sum of column = " sum_arr[task]
print " Min of column = " min_arr[task]
print " Max of column = " max_arr[task]
print " Avg of column = " sum_arr[task] / count_arr[task]
}
}
}
Storing that script into qq.awk
and placing your sample data into qq.in
, then running:
awk -F, -vCOLM=minAccessTime -f qq.awk qq.in
generates the following output, which I'm relatively certain will give you every possible piece of information you need:
Overall:
Total records = 7
Sum of column = 2930
Min of column = 200
Max of column = 700
Avg of column = 418.571
Function getStage:
Total records = 3
Sum of column = 1900
Min of column = 600
Max of column = 700
Avg of column = 633.333
Function getInfo:
Total records = 4
Sum of column = 1030
Min of column = 200
Max of column = 300
Avg of column = 257.5
For `maxAccessTime, you get:
Overall:
Total records = 7
Sum of column = 5120
Min of column = 600
Max of column = 1000
Avg of column = 731.429
Function getStage:
Total records = 3
Sum of column = 2600
Min of column = 800
Max of column = 1000
Avg of column = 866.667
Function getInfo:
Total records = 4
Sum of column = 2520
Min of column = 600
Max of column = 700
Avg of column = 630
And, for xyzzy
(a non-existent column), you'll see:
Column 'xyzzy' does not exist
If I understand the requirements correctly, you want the average of a column, and you'd like to specify the column by name.
Try the following script (avg.awk
):
BEGIN {
FS=",";
}
NR == 1 {
for (i=1; i <= NF; ++i) {
if ($i == SELECTED_FIELD) {
SELECTED_COL=i;
}
}
}
NR > 1 && $1 ~ SELECTED_FNAME {
sum[$1] = sum[$1] + $SELECTED_COL;
count[$1] = count[$1] + 1;
}
END {
for (f in sum) {
printf("Average %s for %s: %d\n", SELECTED_FIELD, f, sum[f] / count[f]);
}
}
and invoke your script like this
awk -v SELECTED_FIELD=minAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -v SELECTED_FNAME=getInfo -f avg.awk < data.csv
EDIT:
Rewritten to group by function name (assumed to be first field)
EDIT2:
Rewritten to allow additional parameter to filter by function name (assumed to be first field)
精彩评论