Creating a dynamic "variable layout" for split() in Python
I have a script that parses IIS logs and at the moment it fetches log lines one by one using split to put IIS field values into multiple variables like this:
date, time, sitename, ip, uri_stem, whatever = log_line.split(" ")
And, it works fine for the default setup. But, if someone else uses a different log field layout (different order, or different log fields, or both) he would have to go and find this line in the source and modify it. This person would also have to know how to modify it so that nothing breaks since obviously this variables are used later in the code.
How could I make this more generic in a way of having some kind o开发者_开发技巧f a list that would contain IIS log field layout which a user could modify (a config variable, or a dict/list at the beginning of the script) that would later be used to hold log line values? That is what I consider - "dynamic". I was thinking of maybe using for-loop and a dictionary to do that, but I imagine it would have a big impact on performance compared to using split(), or wouldn't it? Does anyone have a suggestion on how this could/should be done?
Is it even worth the trouble, or should I just make a note for anyone that uses the script on where to change the code line that contains log_line.split(), how to do it and what to pay attention to?
Thank you.
If only the order of the fields may vary, it is possible to process a verification of each line and to automatically adapt the extraction of information to the detected order.
I think it would be easy to do so with the help of regex.
If not only the order, but the number and nature of fields may vary, I think it would still be possible to do the same, but at the condition to know in advance the possible fields.
And the common condition is that the fields must have "personalities" strong enough to be easily distinguishable
Without more precise information, nobody can go further, IMO
Monday, 15 August 9:39 GMT+0:00
It seems there is an error in spilp.py :
it must be
with codecs.open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:
not
with open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:
The latter uses the builtin open() which has not the keywords in question
Monday, 15 August 16:10 GMT+0:00
Presently , in the sample file, the fields are in this order:
date
time
s-sitename
s-ip
cs-method
cs-uri-stem
cs-uri-query
s-port cs-username
c-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
.
Suppose you want to extract the values of each line in the following order:
s-port
time
date
s-sitename
s-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
c-ip
cs-username
cs-method
cs-uri-stem cs-uri-query
to assign them to the following identifiers in the same order:
s_port
time
date
s_sitename
s_ip
cs_user_agent
sc_status
sc_substatus
sc_win32_status
c_ip
cs_username
cs_method
cs_uri_stem
cs_uri_query
doing
s_port,
time, date,
s_sitename, s_ip,
cs_user_agent, sc_status, sc_substatus, sc_win32_status,
c_ip,
cs_username,
cs_method, cs_uri_stem, cs_uri_query = line_spliter(line)
with a function line_spliter()
I know, I know, what you want is the contrary: to restore the values read in a file in the order they have presently is the file, in case there is a file with a different order than the generic present one.
But I take this only as example, in the aim to let the sample file as is. Otherwise I would need to create an other file with different order of values to expose an example.
Anyway, the algorithm doesn't depend of the example. It depends of the desired order in which one defines succession of the values that must be obtained to do a correct assignement.
In my code , this desired order is set with the object ref_fields
I think that my code and its execution speak themselves to make understand the principle.
import re
ref_fields = ['s-port',
'time','date',
's-sitename', 's-ip',
'cs(User-Agent)', 'sc-status',
'sc-substatus', 'sc-win32-status',
'c-ip',
'cs-username',
'cs-method', 'cs-uri-stem', 'cs-uri-query']
print 'REF_FIELDS :\n------------\n%s\n' % '\n'.join(ref_fields)
############################################
file_path = 'I:\\sample[1].log' # Path to put here
############################################
with open(file_path, 'r') as log_lines:
line = ''
while line[0:8]!='#Fields:':
line = next(log_lines)
# At this point, line is the line containing the fields keywords
print 'line of the fields keywords:\n----------------------------\n%r\n' % line
found_fields = line.split()[1:]
len_found_fields = len(found_fields)
regex_extractor = re.compile('[ \t]+'.join(len_found_fields*['([^ \t]+)']))
print 'list found_fields of keywords in the file:\n------------------------------------------\n%s\n' % found_fields
print '\nfound_fields == ref_fields is ',found_fields == ref_fields
if found_fields == ref_fields:
print '\nNORMAL ORDER\n------------'
def line_spliter(line):
return line.split()
else:
the_order = [ found_fields.index(fild) + 1 for fild in ref_fields]
# the_order is the list of indexes localizing the elements of ref_fields
# in the order in which they succeed in the actual line of found fields keywords
print '\nSPECIAL ORDER\n-------------\nthe_order == %s\n\n\n======================' % the_order
def line_spliter(line):
return regex_extractor.match(line).group(*the_order)
for i in xrange(1):
line = next(log_lines)
(s_port,
time, date,
s_sitename, s_ip,
cs_user_agent, sc_status, sc_substatus, sc_win32_status,
c_ip,
cs_username,
cs_method, cs_uri_stem, cs_uri_query) = line_spliter(line)
print ('LINE :\n------\n'
'%s\n'
'SPLIT LINE :\n--------------\n'
'%s\n\n'
'REORDERED SPLIT LINE :\n-------------------------\n'
'%s\n\n'
'EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :\n-------------------------------------------\n'
'date == %s\n'
'time == %s\n'
's_port == %s\n'
'c_ip == %s\n\n'
'======================') % (line,'\n'.join(line.split()),line_spliter(line),date,time,s_port,c_ip)
# ---- split each logline into multiple variables, populate dictionaries and db ---- #
def splitLogline(log_line):
# needs to be dynamic (for different logging setups)
s_port,
time, date,
s_sitename, s_ip,
cs_user_agent, sc_status, sc_substatus, sc_win32_status,
c_ip,
cs_username,
cs_method, cs_uri_stem, cs_uri_query = line_spliter(line)
result
REF_FIELDS :
------------
s-port
time
date
s-sitename
s-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
c-ip
cs-username
cs-method
cs-uri-stem
cs-uri-query
line of the fields keywords:
----------------------------
'#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status \n'
list found_fields of keywords in the file:
------------------------------------------
['date', 'time', 's-sitename', 's-ip', 'cs-method', 'cs-uri-stem', 'cs-uri-query', 's-port', 'cs-username', 'c-ip', 'cs(User-Agent)', 'sc-status', 'sc-substatus', 'sc-win32-status']
found_fields == ref_fields is False
SPECIAL ORDER
-------------
the_order == [8, 2, 1, 3, 4, 11, 12, 13, 14, 10, 9, 5, 6, 7]
======================
LINE :
------
2010-01-01 00:00:03 SITENAME 192.168.1.1 GET /news-views.aspx - 80 - 66.249.72.135 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200 0 0
SPLIT LINE :
--------------
2010-01-01
00:00:03
SITENAME
192.168.1.1
GET
/news-views.aspx
-
80
-
66.249.72.135
Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
200
0
0
REORDERED SPLIT LINE :
-------------------------
('80', '00:00:03', '2010-01-01', 'SITENAME', '192.168.1.1', 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)', '200', '0', '0\n', '66.249.72.135', '-', 'GET', '/news-views.aspx', '-')
EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :
-------------------------------------------
date == 2010-01-01
time == 00:00:03
s_port == 80
c_ip == 66.249.72.135
======================
This code applies only to the case where the fields in a file are shuffled, but in the same number as a normal known list of fields.
It may happen other cases, for example less values in a file than there are known and waited fields. If you need more help for these other cases, explain which cases may happen and I'll try to adapt the code.
.
I think I will have many remarks to do on the code I rapidly read in spilp.py . I 'll write them when I will have time.
Changing the log line layout is a pretty big deal, but something that gets done from time to time because new items are added or existing items deleted. Rarely does someone simply shuffle around existing items just for the heck of it.
These kinds of changes do not happen every day; they should be pretty rare. And when you have items within the logline added or deleted, the you are changing the code anyway -- after all, the new fields have to be processed in some way, and the code to process any deleted fields have to be removed.
Yes, writing resilient code is a Good Thing. Defining a schema mapping field names to their positions in the log line may seem like a great idea since it permits reshuffling and adding with out digging into the one split line. But is it worth it for schema changes that happen twice a year? And is it worth it to prevent the change of one line when so many other lines will have to be changed anyway? That is for you to decide.
That said, if you want to do this, consider using collections.namedtuple to process your line into a dict-like object. The specification of the names can be done in a configuration-area of your code. You will take a performance hit in doing so, so weigh that against the gain in flexibility....
精彩评论