Python: Parsing a colon delimited file with various counts of fields
I'm trying to parse a a few files with the following format in 'clientname'.txt
hostname:comp1
time: Fri Jan 28 20:00:02 GMT 2011
ip:xxx.xxx.xx.xx
fs:good:45
memory:ba开发者_如何学JAVAd:78
swap:good:34
Mail:good
Each section is delimited by a : but where lines 0,2,6 have 2 fields... lines 1,3-5 have 3 or more fields. (A big issue I've had trouble with is the time: line, since 20:00:02 is really a time and not 3 separate fields.
I have several files like this that I need to parse. There are many more lines in some of these files with multiple fields.
...
for i in clients:
if os.path.isfile(rpt_path + i + rpt_ext): # if the rpt exists then do this
rpt = rpt_path + i + rpt_ext
l_count = 0
for line in open(rpt, "r"):
s_line = line.rstrip()
part = s_line.split(':')
print part
l_count = l_count + 1
else: # else break
break
First I'm checking if the file exists first, if it does then open the file and parse it (eventually) As of now I'm just printing the output (print part) to make sure it's parsing right. Honestly, the only trouble I'm having at this point is the time: field. How can I treat that line specifically different than all the others? The time field is ALWAYS the 2nd line in all of my report files.
split method has the following syntax split( [sep [,maxsplit]])
and if the maxsplit is given, it will make maxsplit+1 parts. In you case, you just have give maxsplit as 1. Just split(':',1)
would solve your problem.
If time
is a special case, you could do:
[...]
s_line = line.rstrip()
if line.startswith('time:'):
part = s_line.split(':', 1)
else:
part = s_line.split(':')
print part
[...]
This would give you:
['hostname', 'comp1']
['time', ' Fri Jan 28 20:00:02 GMT 2011']
['ip', 'xxx.xxx.xx.xx']
['fs', 'good', '45']
['memory', 'bad', '78']
['swap', 'good', '34']
['Mail', 'good']
And doesn't rely on the position of time
in the file.
Design considerations:
Robustly handle extraneous whitespace, including blank lines, and missing colons.
Extract a record_type, which is then used to decide how to parse the remainder of the line.
>>> def munched(s, n=None):
... if n is None:
... n = 99999999 # this kludge should not be necessary
... return [x.strip() for x in s.split(':', n)]
...
>>> def parse_line(line):
... if ':' not in line:
... return [line.strip(), '']
... record_type, remainder = munched(line, 1)
... if record_type == 'time':
... data = [remainder]
... else:
... data = munched(remainder)
... return record_type, data
...
>>> for guff in """
... hostname:comp1
... time: Fri Jan 28 20:00:02 GMT 2011
... ip:xxx.xxx.xx.xx
... fs:good:45
... memory : bad : 78
... missing colon
... Mail:good""".splitlines(True):
... print repr(guff), parse_line(guff)
...
'\n' ['', '']
'hostname:comp1\n' ('hostname', ['comp1'])
'time: Fri Jan 28 20:00:02 GMT 2011\n' ('time', ['Fri Jan 28 20:00:02 GMT 2011'])
'ip:xxx.xxx.xx.xx\n' ('ip', ['xxx.xxx.xx.xx'])
'fs:good:45\n' ('fs', ['good', '45'])
' memory : bad : 78 \n' ('memory', ['bad', '78'])
'missing colon\n' ['missing colon', '']
'Mail:good' ('Mail', ['good'])
>>>
If the time field always the 2nd line. Why can't you skip it and parse it separately?
Something like
for i, line in enumerate(open(rpt, "r").read().splitlines()):
if i==1: # Special parsing for time: line
data = line[5:]
else:
# your normal parsing logic
精彩评论