Convert AWK regex to Python script
Good morning all, Wondering if you could please help me with the following query:- I have just started learning Python last weekend after a colleague of mine showed me how to dramatically cut the time a Bash script takes to execute by re-writing it in Python. I was amazed at how fast it ran. I would now like to do the same thing with another script I have.
This other script reads a log file and using AWK it filters certain fields from the log and writes them to a new file. See below the regex the script is executing. I would like to re-write this regex in Python as my script is currently taking about 1 hour to execute on a log file with about 100,000 lines. I would like to cut this time down as much as possible.
cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status | wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done
This AWK command gets lines which look l开发者_Python百科ike this:-
2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >
And outputs lines like this:-
CC_SMS_SERVICE_51408 submit_resp: 0
I have tried writing the Python script myself but I am getting stuck writing the regex. So far I have the following:-
#!/usr/bin/python
# Import RegEx module
import re as regex
# Log file to work on
filetoread = open('/tmp/ pdu_log.log', "r")
# File to write output to
filetowrite = file('/tmp/ pdu_log_clean.log', "w")
# Perform filtering in the log file
linetoread = filetoread.readlines()
for line in linetoread:
filter0 = regex.sub(r"<G_","",line)
filter1 = regex.sub(r"\."," ",filter0)
# Write new log file
filetowrite.write(filter1)
filetowrite.close()
# Read new log and get required fields from it
filtered_log = open('/tmp/ pdu_log_clean.log', "r")
filtered_line = filtered_log.readlines()
for line in filtered_line:
token = line.split(" ")
print token[0], token[1], token[5], token[13], token[20]
print "Done"
Ugly I know but please bear in mind that I have just started learning Python two days ago.
I have been looking on this group and on the Internet for snippets of code that I could use but so far what I have found do not fit my needs or are too complicated (at least for me).
Any suggestion, advice you can give me on how to accomplish this task will be greatly appreciated.
On another note, can you also recommend a good no-nonsense book to learn Python? I have read the book “A Byte of Python” by Swaroop C H (great introductory book!) and I am now reading “Dive into Python” by Mark Pilgrim. I am looking for a book that explains things in simple terms and goes straight to the point (similar to how “A Byte of Python” was written)
Thanks in advance
Kind regards,
Junior
=====Answer to Eli who commented below=====
My apologies guys, I tried commenting on Eli's answer but my comment is too long and it won't save. I also tried answering to my own post but as I am a new user I cannot answer until after 8 hours!. so my only option is to add an edit to my post :)
Anyways, in response to Eli's comment:-
Ok lets see, My aim is to filter out several fields from a log file and write them to a new log file. The current log file, as I mentioned previously, has thousands of lines like this:-
2011-05-16 09:46:22,361 [Thread-4847133] PDU D
All the lines in the log file are similar and they all have the same length (same amount of fields). Most of the fields are separated by spaces except for couple of them which I am processing with AWK (removing "
I hope this is clearer now
Regards,
Junior
Since these lines are very structured, for simplicity (and speed), I would not go for a regex at all. Here's an example extracting your first piece of data:
>>> line = "2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >"
>>> istart = line.find('<G_')
>>> iend = line.find('.', istart)
>>> line[istart+3:iend]
'CC_SMS_SERVICE_51408_656'
Other fields can be extracted similarly, depending on the exact structure of all possible lines. It's hard to understand what your AWK does exactly and how it applies to the example you provided. It would be easier if you could describe the structure of your data lines and what exactly you need to extract.
For example, splitting the line by whitespace (the default for split
) you get:
>>> line.split()
['2011-05-16', '09:46:22,361', '[Thread-4847133]', 'PDU', 'D', '<G_CC_SMS_SERVICE_51408_656.O_', 'CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX', '-', '2011-05-16', '09:46:22', '-', 'OUT', '-', '(submit_resp:', '(pdu:', 'L:', '53', 'ID:', '80000004', 'Status:', '0', 'SN:', '25866)', '98053090-7f90-11e0-a2da-00238bce423b', '(opt:', ')', ')', '>']
Now you're pretty much free to extract whichever fields you need from here, as long as (as you say) the format is very fixed and it's always the same fields. So:
>>> line.split()[13]
'(submit_resp:'
Cleaning up a bit:
>>> line.split()[13].lstrip('(').rstrip(':')
'submit_resp'
As you can see, the possibilities are limitless. I suggest you get familiar with Python's string processing capabilities before you engorge yourself in regexes. Regexes are useful, but they're not the only tool for the job. Often, solutions based on alternative string processing techniques are faster and easier to understand. You can always supplement them with regexes, of course.
P.S. For books/resources on learning Python - there are many SO questions on this. Start here and browse.
精彩评论