Calculating statistics directly from a CSV file
I have a transaction log file in CSV format that I want use to run statistics. The log has the following fields:
date: Time/date stamp salesperson: The username of the person who closed the sale promo: sum total of items in the sale that were promotions. amount: grand total of the sale
I'd like to get the following statistics:
salesperson: The username of the salesperson being analyzed. minAmount: The smallest grand total of this salesperson's transaction. avgAmount: The mean grand total.. maxAmount: The largest grand total.. mi开发者_高级运维nPromo: The smallest promo amount by the salesperson. avgPromo: The mean promo amount...
I'm tempted to build a database structure, import this file, write SQL, and pull out the stats. I don't need anything more from this data than these stats. Is there an easier way? I'm hoping some bash script could make this easy.
TxtSushi does this:
tssql -table trans transactions.csv \ 'select salesperson, min(as_real(amount)) as minAmount, avg(as_real(amount)) as avgAmount, max(as_real(amount)) as maxAmount, min(as_real(promo)) as minPromo, avg(as_real(promo)) as avgPromo from trans group by salesperson'
I have a bunch of example scripts showing how to use it.
Edit: fixed syntax
Could also bang out an awk script to do it. It's only CSV with a few variables.
You can loop through the lines in the CSV and use bash script variables to hold your min/max amounts. For the average, just keep a running total and then divide by the total number of lines (not counting a possible header).
Here are some useful snippets for working with CSV files in bash.
If your data might be quoted (e.g. because a field contains a comma), processing with bash, sed, etc. becomes much more complex.
精彩评论