Shell: script to group strings by substring
I have a program (sorry changing this is not an o开发者_运维问答ption) that is outputting log files with upwards of 500k lines.
I am trying to group together lines (and then sort these groups) in the log file based on a substring with in the lines
For example I have lines similar to below:
SELECT something WHERE TIM BETWEEN '*' AND '*' AND something;
what im looking to group on is the TIM BETWEEN '*' AND '*'
where * matches between lines for example:
SELECT something WHERE TIM BETWEEN '2010-03-04' AND '2010-03-10' AND something;
SELECT something WHERE TIM BETWEEN '2011-01-28' AND '2011-02-05' AND something;
SELECT something WHERE TIM BETWEEN '2010-03-04' AND '2010-03-10' AND something;
SELECT something WHERE TIM BETWEEN '2011-01-28' AND '2011-02-05' AND something;
would be grouped as such in the output:
SELECT something WHERE TIM BETWEEN '2010-03-04' AND '2010-03-10' AND something;
SELECT something WHERE TIM BETWEEN '2010-03-04' AND '2010-03-10' AND something;
SELECT something WHERE TIM BETWEEN '2011-01-28' AND '2011-02-05' AND something;
SELECT something WHERE TIM BETWEEN '2011-01-28' AND '2011-02-05' AND something;
with each group also having been sorted based on the whole string so where the "somethings" are similar the are next to each other?
I have been trying to put a shell script together to output what i want reading from a log file but haven't had any success!
Edit: I need to also mention that 'something' can be multiple words for example:
SELECT blah1, blah2 or SELECT blah1, blah2, blah3
You should probably be able to use sort
sort -o outputfile +1 -2 +4 -5 +6 -7 inputfile
Where +1 -2 gives the "something" column, +4 -5 gives the first date column and +6 -7 gives the last date column.
(PS! Not tested)
You'll have to pre-filter your data and turn it into something you can use sort
with.
awk '{sub(/BETWEEN/, "|",$0) ;sub(/AND/,"|",$0)}' logFile \
| sort -t"|" +1 -2 +2 -3 \
| sed 's/|/BETWEEN/;s/|/AND/'
output
SELECT something WHERE TIM BETWEEN '2010-03-04' AND '2010-03-10' AND something;
SELECT something WHERE TIM BETWEEN '2010-03-04' AND '2010-03-10' AND something;
SELECT something WHERE TIM BETWEEN '2011-01-28' AND '2011-02-05' AND something;
SELECT something WHERE TIM BETWEEN '2011-01-28' AND '2011-02-05' AND something;
I hope this helps.
精彩评论