How do I get all statistics-related information from a file, using awk?
I am trying to parse some CSV files using awk
.
The CSV file I am working with looks like this:
fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600
I need to find the minimum, maximum and average figures for columns 2 and 3, both across all data and individual functi开发者_JAVA技巧ons.
I realize you're not looking for non-awk solution, but I thought I'd share some R
code to demonstrate how seamless it is to summarize data.
# read in data
awk <- read.table(textConnection("fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600"), header = TRUE, sep = ",")
# split according to the function
awk.split <- split(awk, awk$fnName)
# for each function, calculate full summary for columns 2 and 3
lapply(X = awk.split, FUN = function(x) {
summary(x[2:3])
})
Result:
$getInfo
minAccessTime maxAccessTime
Min. :200.0 Min. :600
1st Qu.:237.5 1st Qu.:600
Median :265.0 Median :610
Mean :257.5 Mean :630
3rd Qu.:285.0 3rd Qu.:640
Max. :300.0 Max. :700
$getStage
minAccessTime maxAccessTime
Min. :600.0 Min. : 800.0
1st Qu.:600.0 1st Qu.: 800.0
Median :600.0 Median : 800.0
Mean :633.3 Mean : 866.7
3rd Qu.:650.0 3rd Qu.: 900.0
Max. :700.0 Max. :1000.0
This awk
script should give you all the skills necessary to get what you want.
It basically runs through all lines in your input file, ignoring those where the second field is minAccessTime
(the CSV header).
On all other records, it updates the count of, minimum-of-minima, maximum-of-minima, minimum-of-maxima, maximum-of-maxima, sum-of-minima, and sum-of-maxima for both the overall data plus each individual function name.
The former is stored in count
, min_min
, max_min
, min_max
, max_max
, sum_min
and sum_max
. The latter are stored in associative arrays with similar names (with _arr
appended).
Then, once all records are read, the END
section outputs the information.
NR > 1 {
count++;
sum_min += $2;
sum_max += $3;
if (count == 1) {
min_min = $2;
max_min = $2;
min_max = $3;
max_max = $3;
} else {
if ($2 < min_min) { min_min = $2; }
if ($2 > max_min) { max_min = $2; }
if ($3 < min_max) { min_max = $3; }
if ($3 > max_max) { max_max = $3; }
}
count_arr[$1]++;
sum_min_arr[$1] += $2;
sum_max_arr[$1] += $3;
if (count_arr[$1] == 1) {
min_min_arr[$1] = $2;
max_min_arr[$1] = $2;
min_max_arr[$1] = $3;
max_max_arr[$1] = $3;
} else {
if ($2 < min_min_arr[$1]) { min_min_arr[$1] = $2; }
if ($2 > max_min_arr[$1]) { max_min_arr[$1] = $2; }
if ($3 < min_max_arr[$1]) { min_max_arr[$1] = $3; }
if ($3 > max_max_arr[$1]) { max_max_arr[$1] = $3; }
}
}
END {
print "Overall:"
print " Total records = " count
print " Sum of minima = " sum_min
print " Sum of maxima = " sum_max
if (count > 0) {
print " Min of minima = " min_min
print " Max of minima = " max_min
print " Min of maxima = " min_max
print " Max of maxima = " max_max
print " Avg of minima = " sum_min / count
print " Avg of maxima = " sum_max / count
}
for (task in count_arr) {
print "Function " task ":"
print " Total records = " count_arr[task]
print " Sum of minima = " sum_min_arr[task]
print " Sum of maxima = " sum_max_arr[task]
print " Min of minima = " min_min_arr[task]
print " Max of minima = " max_min_arr[task]
print " Min of maxima = " min_max_arr[task]
print " Max of maxima = " max_max_arr[task]
print " Avg of minima = " sum_min_arr[task] / count_arr[task]
print " Avg of maxima = " sum_max_arr[task] / count_arr[task]
}
}
Storing that script into qq.awk
and placing your sample data into qq.in
, then running:
awk -F, -f qq.awk qq.in
generates the following output, which I'm relatively certain will give you every possible piece of information you need:
Overall:
Total records = 7
Sum of minima = 2930
Sum of maxima = 5120
Min of minima = 200
Max of minima = 700
Min of maxima = 600
Max of maxima = 1000
Avg of minima = 418.571
Avg of maxima = 731.429
Function getStage:
Total records = 3
Sum of minima = 1900
Sum of maxima = 2600
Min of minima = 600
Max of minima = 700
Min of maxima = 800
Max of maxima = 1000
Avg of minima = 633.333
Avg of maxima = 866.667
Function getInfo:
Total records = 4
Sum of minima = 1030
Sum of maxima = 2520
Min of minima = 200
Max of minima = 300
Min of maxima = 600
Max of maxima = 700
Avg of minima = 257.5
Avg of maxima = 630
If you insist on Awk...
$ awk -F, '
> func newmin(fname, array, value) { if (!(fname in array) || array[fname]>value) array[fname] = value }
> func newmax(fname, array, value) { if (!(fname in array) || array[fname]<value) array[fname] = value }
> NR>1 {
> newmin($1,min2,$2)
> newmin("global",min2,$2)
> newmax($1,max2,$2)
> newmax("global",max2,$2)
> newmin($1,min3,$3)
> newmin("global",min3,$3)
> newmax($1,max3,$3)
> newmax("global",max3,$3)
> }
> END { for (fname in min2) { print fname, min2[fname], max2[fname], min3[fname], max3[fname] } }'
精彩评论