开发者

How to extract data from an irregularly formatted data file in python

I need to extract certain data from a file, but this file is formatted to be read by humans, and is therefore irregular. First off there is a large amount of text before any of the data actually begins:

   DL_POLY Version 2.20

                        Running on   10 nodes



*************** DLPOLY: LiNbO3 >***************




SIMULATION CONTROL PARAMETERS

simulation temperature 1.4500E+03

simulation pressure (katm) 0.0000E+00

selected number of timesteps 8000

equilibration period 500

data printing interval 80

statistics file interval 80

simulation timestep 5.0000E-04

Nose-Hoover (Melchionna) isotropic N-P-T thermostat relaxation time 1.0000E-01 barostat relaxation time 5.0000E-01

trajectory file option on

trajectory file start 1 trajectory file interval 80 trajectory file info key 2 ...

Then after a while there is the actual data but it is in this funny form:


step eng_tot temp_tot eng_cfg eng_vdw eng_cou eng_bnd > eng_ang eng_dih eng_tet time(ps) eng_pv temp_rot vir_cfg vir_vdw vir_cou vir_bnd >vir_ang vir_con vir_tet cpu (s) volume temp_shl eng_shl vir_shl alpha beta >gamma vir_pmf press


1 -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.0 -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.8 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01

rolling -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01


80 -1.1290E+05 1.5021E+03 -1.1392E+05 2.1894E+04 -1.3726E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.0 -1.1256E+05 0.0000E+00 8.6671E+02 -1.3974E+05 1.3707E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 10.6 5.3149E+04 1.1377E+03 1.4419E+03 3.5382E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.1119E+01

rolling -1.1290E+05 1.6145E+03 -1.1398E+05 2.0750E+04 -1.3588E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1333E+05 0.0000E+00 3.3694E+03 -1.3512E+05 1.3565E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3481E+04 1.0997E+03 1.1430E+03 2.8391E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.2096E+01


160 -1.1287E+05 1.2629E+03 -1.1376E+05 2.1450E+04 -1.3633E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.1 -1.1249E+05 0.0000E+00 3.8761E+02 -1.3824E+05 1.3612E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 20.5 5.3375E+04 4.9015E+02 1.1243E+03 2.5052E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.2676E+01

rolling -1.1288E+05 1.4677E+03 -1.1389E+05 2.1589E+04 -1.3663E+05 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1235E+05 0.0000E+00 2.1147E+02 -1.3884E+05 1.3643E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3152E+04 7.4818E+02 1.1440E+03 2.6211E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.7174E+01


On the 9th data interval there is a slight anamoly:


switching off temperature scaling at step 500


 560 -1.1287E+05  1.4709E+03 -1.1390E+05  2.1600E+04 -1.3678E+05  0.0000E+00  >0.0000E+00  0.0000E+00  0.0000E+00
 0.3 -1.1292E+05  0.0000E+00  1.9253E+03 -1.3743E+05  1.3656E+05  0.0000E+00  >0.0000E+00  0.0000E+00  0.0000E+00
68.4  5.4300E+04  1.5043E+02  1.2775E+03  2.7947E+03  5.6396E+01  5.6396E+01  >5.6396E+01  0.0000E+00  2.0576E-01

rolling -1.1286E+05 1.4784E+03 -1.1390E+05 2.1546E+04 -1.3673E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1298E+05 0.0000E+00 2.1361E+03 -1.3717E+05 1.3651E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.4303E+04 2.2261E+02 1.2785E+03 2.8027E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.7421E+00



As you can see there is a pair of '----' lines which may interfere with proper parsing of the data.

Lets say I want to get just 'the eng_tot' data from this file (the bolded numbers), how would I go about doing that in Python? The number is always in the same place in the file (second quantity, first row after second set of ----s.

By the way the header part with all the definitions in it repeats every 8 steps, execpt the first step in which there are 9 lines. I'd like to just ignore the first step. For now lets say I want to start with line 295 inclusive. Just so you know, I'm quite new to python and programming in general so all the help you can provide is appreciated.

Here's the code I tried, but Eng_Total is still an empty set:

import re
import inspect

def lineno():
    """Returns the current line number"""
    linenum = inspect.currentframe().f_back.f_lineno
infile =  open('FilePath/OUTPUT.01').read()
Eng_Total = []
for line in infile:
#    if 'eng_tot' in line.split(): 
     if re.match("\s+-+\s+", line):
    lineno(line)
        line = linenum+1
        sanitized_line = line[8:]
        eng_total = line.spl开发者_开发问答it()[0]
        Eng_Total.append(eng_total)
print Eng_Total


I'd probably do this:

  • iterate over lines in the output
  • search for one containing eng_tot:
    • if 'eng_tot' in line.split(): process_blocks
  • gobble up lines until one matches all dashes (with optional spaces on either side)
    • if re.match("\s+-+\s+", line): proccess_metrics_block
  • process the first line of metrics:
    • cut the first column off the line (it makes it harder to parse, because it might not be there)
      • sanitized_line = line[8:]
      • eng_total = line.split()[0] , the first column is now eng_total
  • skip lines until you reach another line of dashes, then start again

After seeing your edits:

  • You need to import the re (regular expression) module, at the top of the file : import re
  • The process_blocks and process_metrics_block were pseudo code. Those don't exist unless you define them. :) You don't need those functions exactly, you can avoid them using basic looping (while) and conditional (if) statements.
  • You'll have to make sure you understand what you're doing, not just copy from stack overflow! :)

It looks like you're trying to do something like this. It seems to work, but I'm sure with some effort, you can come up with something nicer:

import re

def find_header(lines):
  for (i, line) in enumerate(lines):
    if 'eng_tot' in line.split():
      return i
  return None

def find_next_separator(lines, start):
  for (i, line) in enumerate(lines[start+1:]):
    if re.match("\s*-+\s*", line):
      return i + start + 1
  return None

if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  header = find_header(lines)
  start = find_next_separator(lines, header+1)

  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break

    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

    start = end

  print totals

You can use a generator to be a little more pythonic:

def metric_block_iter(lines):
  start = find_next_separator(lines, find_header(lines)+1)
  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break
    yield (start, end)
    start = end


if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  for (start, end) in metric_block_iter(lines):
    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

  print totals


You're going to need to define the file format explicitly, and then you should be able to parse that easily.

The first step is figuring out where the data you need is defined. Then throw away everything up to that point. Then start reading.

If the eng_tot can move, you need to figure out where in the block of useful data it is. So, read a line, entries = line.split(); location = entries.index('eng_tot'), then read th entry out of that location in the associated line in the output data.

The key is that you need to break down your problem into steps that you know you can do. When looking at something new it's easy to get overwhelmed. If you can just start doing something, you'll find that you can reach the solution without too much trouble after all.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜