How to extract data from an irregularly formatted data file in python
I need to extract certain data from a file, but this file is formatted to be read by humans, and is therefore irregular. First off there is a large amount of text before any of the data actually begins:
DL_POLY Version 2.20 Running on 10 nodes
*************** DLPOLY: LiNbO3 >***************
SIMULATION CONTROL PARAMETERS
simulation temperature 1.4500E+03
simulation pressure (katm) 0.0000E+00
selected number of timesteps 8000
equilibration period 500
data printing interval 80
statistics file interval 80
simulation timestep 5.0000E-04
Nose-Hoover (Melchionna) isotropic N-P-T thermostat relaxation time 1.0000E-01 barostat relaxation time 5.0000E-01
trajectory file option on
trajectory file start 1 trajectory file interval 80 trajectory file info key 2 ...
Then after a while there is the actual data but it is in this funny form:
step eng_tot temp_tot eng_cfg eng_vdw eng_cou eng_bnd > eng_ang eng_dih eng_tet time(ps) eng_pv temp_rot vir_cfg vir_vdw vir_cou vir_bnd >vir_ang vir_con vir_tet cpu (s) volume temp_shl eng_shl vir_shl alpha beta >gamma vir_pmf press
1 -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.0 -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.8 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01
rolling -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01
80 -1.1290E+05 1.5021E+03 -1.1392E+05 2.1894E+04 -1.3726E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.0 -1.1256E+05 0.0000E+00 8.6671E+02 -1.3974E+05 1.3707E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 10.6 5.3149E+04 1.1377E+03 1.4419E+03 3.5382E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.1119E+01
rolling -1.1290E+05 1.6145E+03 -1.1398E+05 2.0750E+04 -1.3588E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1333E+05 0.0000E+00 3.3694E+03 -1.3512E+05 1.3565E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3481E+04 1.0997E+03 1.1430E+03 2.8391E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.2096E+01
160 -1.1287E+05 1.2629E+03 -1.1376E+05 2.1450E+04 -1.3633E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.1 -1.1249E+05 0.0000E+00 3.8761E+02 -1.3824E+05 1.3612E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 20.5 5.3375E+04 4.9015E+02 1.1243E+03 2.5052E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.2676E+01
rolling -1.1288E+05 1.4677E+03 -1.1389E+05 2.1589E+04 -1.3663E+05 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1235E+05 0.0000E+00 2.1147E+02 -1.3884E+05 1.3643E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3152E+04 7.4818E+02 1.1440E+03 2.6211E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.7174E+01
On the 9th data interval there is a slight anamoly:
switching off temperature scaling at step 500
560 -1.1287E+05 1.4709E+03 -1.1390E+05 2.1600E+04 -1.3678E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.3 -1.1292E+05 0.0000E+00 1.9253E+03 -1.3743E+05 1.3656E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 68.4 5.4300E+04 1.5043E+02 1.2775E+03 2.7947E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 2.0576E-01
rolling -1.1286E+05 1.4784E+03 -1.1390E+05 2.1546E+04 -1.3673E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1298E+05 0.0000E+00 2.1361E+03 -1.3717E+05 1.3651E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.4303E+04 2.2261E+02 1.2785E+03 2.8027E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.7421E+00
As you can see there is a pair of '----' lines which may interfere with proper parsing of the data.
Lets say I want to get just 'the eng_tot' data from this file (the bolded numbers), how would I go about doing that in Python? The number is always in the same place in the file (second quantity, first row after second set of ----s.
By the way the header part with all the definitions in it repeats every 8 steps, execpt the first step in which there are 9 lines. I'd like to just ignore the first step. For now lets say I want to start with line 295 inclusive. Just so you know, I'm quite new to python and programming in general so all the help you can provide is appreciated.
Here's the code I tried, but Eng_Total is still an empty set:
import re
import inspect
def lineno():
"""Returns the current line number"""
linenum = inspect.currentframe().f_back.f_lineno
infile = open('FilePath/OUTPUT.01').read()
Eng_Total = []
for line in infile:
# if 'eng_tot' in line.split():
if re.match("\s+-+\s+", line):
lineno(line)
line = linenum+1
sanitized_line = line[8:]
eng_total = line.spl开发者_开发问答it()[0]
Eng_Total.append(eng_total)
print Eng_Total
I'd probably do this:
- iterate over lines in the output
- search for one containing
eng_tot
:if 'eng_tot' in line.split(): process_blocks
- gobble up lines until one matches all dashes (with optional spaces on either side)
if re.match("\s+-+\s+", line): proccess_metrics_block
- process the first line of metrics:
- cut the first column off the line (it makes it harder to parse, because it might not be there)
sanitized_line = line[8:]
eng_total = line.split()[0]
, the first column is now eng_total
- cut the first column off the line (it makes it harder to parse, because it might not be there)
- skip lines until you reach another line of dashes, then start again
After seeing your edits:
- You need to import the
re
(regular expression) module, at the top of the file :import re
- The
process_blocks
andprocess_metrics_block
were pseudo code. Those don't exist unless you define them. :) You don't need those functions exactly, you can avoid them using basic looping (while
) and conditional (if
) statements. - You'll have to make sure you understand what you're doing, not just copy from stack overflow! :)
It looks like you're trying to do something like this. It seems to work, but I'm sure with some effort, you can come up with something nicer:
import re
def find_header(lines):
for (i, line) in enumerate(lines):
if 'eng_tot' in line.split():
return i
return None
def find_next_separator(lines, start):
for (i, line) in enumerate(lines[start+1:]):
if re.match("\s*-+\s*", line):
return i + start + 1
return None
if __name__ == '__main__':
totals = []
lines = open('so.txt').readlines()
header = find_header(lines)
start = find_next_separator(lines, header+1)
while True:
end = find_next_separator(lines, start+1)
if end is None: break
# Pull out block, after line of dashes.
metrics_block = lines[start+1:end]
# Pull out 2nd column from 1st line of metrics.
eng_total = metrics_block[0].split()[1]
totals.append(eng_total)
start = end
print totals
You can use a generator to be a little more pythonic:
def metric_block_iter(lines):
start = find_next_separator(lines, find_header(lines)+1)
while True:
end = find_next_separator(lines, start+1)
if end is None: break
yield (start, end)
start = end
if __name__ == '__main__':
totals = []
lines = open('so.txt').readlines()
for (start, end) in metric_block_iter(lines):
# Pull out block, after line of dashes.
metrics_block = lines[start+1:end]
# Pull out 2nd column from 1st line of metrics.
eng_total = metrics_block[0].split()[1]
totals.append(eng_total)
print totals
You're going to need to define the file format explicitly, and then you should be able to parse that easily.
The first step is figuring out where the data you need is defined. Then throw away everything up to that point. Then start reading.
If the eng_tot
can move, you need to figure out where in the block of useful data it is. So, read a line, entries = line.split(); location = entries.index('eng_tot')
, then read th entry out of that location in the associated line in the output data.
The key is that you need to break down your problem into steps that you know you can do. When looking at something new it's easy to get overwhelmed. If you can just start doing something, you'll find that you can reach the solution without too much trouble after all.
精彩评论