How to match a column name and find out the column position in awk?

2023-03-29 22:46 问答作者：

I am trying to parse some csv files using awk. I am new to s开发者_运维技巧hell scripting and awk. The csv file i am working on looks something like this :

fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600

I need to find the average AccessTimes of the different functions.

I have been working with awk and have been able to get the average times provided the exact column numbers are specified like $2, $3 etc.

However I need to have a general script in which if i input "minAccessTime" in the command argument, I need the script to print the average AccessTime (instead of explicitly specifying $2 or $3 while using awk).

I have been googling about this and saw in various forums but none of them seems to work. Can someone tell me how to do this ? It would be of great help !

Thanks in advance!!

This awk script should give you all that you want.

It first evaluates which column you're interested in by using the name passed in as the COLM variable and checking against the first line. It converts this into an index (it's left as the default 0 if it couldn't find the column).

It then basically runs through all other lines in your input file. On all these other lines (assuming you've specified a valid column), it updates the count, sum, minimum and maximum for both the overall data plus each individual function name.

The former is stored in count, sum, min and max. The latter are stored in associative arrays with similar names (with _arr appended).

Then, once all records are read, the END section outputs the information.

NR == 1 {
    for (i = 1; i <= NF; i++) {
        if ($i == COLM) {
            cidx = i;
        }
    }
}

NR > 1 {
    if (cidx > 0) {
        count++;
        sum += $cidx;
        if (count == 1) {
            min = $cidx;
            max = $cidx;
        } else {
            if ($cidx < min) { min = $cidx; }
            if ($cidx > max) { max = $cidx; }
        }

        count_arr[$1]++;
        sum_arr[$1] += $cidx;
        if (count_arr[$1] == 1) {
            min_arr[$1] = $cidx;
            max_arr[$1] = $cidx;
        } else {
            if ($cidx < min_arr[$1]) { min_arr[$1] = $cidx; }
            if ($cidx > max_arr[$1]) { max_arr[$1] = $cidx; }
        }
    }
}

END {
    if (cidx == 0) {
        print "Column '" COLM "' does not exist"
    } else {
        print "Overall:"
        print "   Total records = " count
        print "   Sum of column = " sum
        if (count > 0) {
            print "   Min of column = " min
            print "   Max of column = " max
            print "   Avg of column = " sum / count
        }
        for (task in count_arr) {
            print "Function " task ":"
            print "   Total records = " count_arr[task]
            print "   Sum of column = " sum_arr[task]
            print "   Min of column = " min_arr[task]
            print "   Max of column = " max_arr[task]
            print "   Avg of column = " sum_arr[task] / count_arr[task]
        }
    }
}

Storing that script into qq.awk and placing your sample data into qq.in, then running:

awk -F, -vCOLM=minAccessTime -f qq.awk qq.in

generates the following output, which I'm relatively certain will give you every possible piece of information you need:

Overall:
   Total records = 7
   Sum of column = 2930
   Min of column = 200
   Max of column = 700
   Avg of column = 418.571
Function getStage:
   Total records = 3
   Sum of column = 1900
   Min of column = 600
   Max of column = 700
   Avg of column = 633.333
Function getInfo:
   Total records = 4
   Sum of column = 1030
   Min of column = 200
   Max of column = 300
   Avg of column = 257.5

For `maxAccessTime, you get:

Overall:
   Total records = 7
   Sum of column = 5120
   Min of column = 600
   Max of column = 1000
   Avg of column = 731.429
Function getStage:
   Total records = 3
   Sum of column = 2600
   Min of column = 800
   Max of column = 1000
   Avg of column = 866.667
Function getInfo:
   Total records = 4
   Sum of column = 2520
   Min of column = 600
   Max of column = 700
   Avg of column = 630

And, for xyzzy (a non-existent column), you'll see:

Column 'xyzzy' does not exist

If I understand the requirements correctly, you want the average of a column, and you'd like to specify the column by name.

Try the following script (avg.awk):

BEGIN {
  FS=",";
}

NR == 1 {
  for (i=1; i <= NF; ++i) {
    if ($i == SELECTED_FIELD) {
      SELECTED_COL=i;
    }
  }
}

NR > 1 && $1 ~ SELECTED_FNAME {
  sum[$1] = sum[$1] + $SELECTED_COL;
  count[$1] = count[$1] + 1;
}

END {
  for (f in sum) {
    printf("Average %s for %s: %d\n", SELECTED_FIELD, f, sum[f] / count[f]);
  }
}

and invoke your script like this

awk -v SELECTED_FIELD=minAccessTime -f avg.awk < data.csv

awk -v SELECTED_FIELD=maxAccessTime -f avg.awk < data.csv

awk -v SELECTED_FIELD=maxAccessTime -v SELECTED_FNAME=getInfo -f avg.awk < data.csv

EDIT:

Rewritten to group by function name (assumed to be first field)

EDIT2:

Rewritten to allow additional parameter to filter by function name (assumed to be first field)

继续阅读：shell

How to match a column name and find out the column position in awk?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？