How to extract data from an irregularly formatted data file in python

2023-01-04 00:18 问答作者：

I need to extract certain data from a file, but this file is formatted to be read by humans, and is therefore irregular. First off there is a large amount of text before any of the data actually begins:

   DL_POLY Version 2.20

                        Running on   10 nodes
*************** DLPOLY: LiNbO3 >***************

SIMULATION CONTROL PARAMETERS

simulation temperature 1.4500E+03

simulation pressure (katm) 0.0000E+00

selected number of timesteps 8000

equilibration period 500

data printing interval 80

statistics file interval 80

simulation timestep 5.0000E-04

Nose-Hoover (Melchionna) isotropic N-P-T thermostat relaxation time 1.0000E-01 barostat relaxation time 5.0000E-01

trajectory file option on
trajectory file start 1 trajectory file interval 80 trajectory file info key 2 ...

Then after a while there is the actual data but it is in this funny form:

step eng_tot temp_tot eng_cfg eng_vdw eng_cou eng_bnd > eng_ang eng_dih eng_tet time(ps) eng_pv temp_rot vir_cfg vir_vdw vir_cou vir_bnd >vir_ang vir_con vir_tet cpu (s) volume temp_shl eng_shl vir_shl alpha beta >gamma vir_pmf press

1 -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.0 -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.8 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01

rolling -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01

80 -1.1290E+05 1.5021E+03 -1.1392E+05 2.1894E+04 -1.3726E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.0 -1.1256E+05 0.0000E+00 8.6671E+02 -1.3974E+05 1.3707E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 10.6 5.3149E+04 1.1377E+03 1.4419E+03 3.5382E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.1119E+01

rolling -1.1290E+05 1.6145E+03 -1.1398E+05 2.0750E+04 -1.3588E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1333E+05 0.0000E+00 3.3694E+03 -1.3512E+05 1.3565E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3481E+04 1.0997E+03 1.1430E+03 2.8391E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.2096E+01

160 -1.1287E+05 1.2629E+03 -1.1376E+05 2.1450E+04 -1.3633E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 0.1 -1.1249E+05 0.0000E+00 3.8761E+02 -1.3824E+05 1.3612E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 20.5 5.3375E+04 4.9015E+02 1.1243E+03 2.5052E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.2676E+01

rolling -1.1288E+05 1.4677E+03 -1.1389E+05 2.1589E+04 -1.3663E+05 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1235E+05 0.0000E+00 2.1147E+02 -1.3884E+05 1.3643E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.3152E+04 7.4818E+02 1.1440E+03 2.6211E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.7174E+01

On the 9th data interval there is a slight anamoly:

switching off temperature scaling at step 500
 560 -1.1287E+05  1.4709E+03 -1.1390E+05  2.1600E+04 -1.3678E+05  0.0000E+00  >0.0000E+00  0.0000E+00  0.0000E+00
 0.3 -1.1292E+05  0.0000E+00  1.9253E+03 -1.3743E+05  1.3656E+05  0.0000E+00  >0.0000E+00  0.0000E+00  0.0000E+00
68.4  5.4300E+04  1.5043E+02  1.2775E+03  2.7947E+03  5.6396E+01  5.6396E+01  >5.6396E+01  0.0000E+00  2.0576E-01
rolling -1.1286E+05 1.4784E+03 -1.1390E+05 2.1546E+04 -1.3673E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 averages -1.1298E+05 0.0000E+00 2.1361E+03 -1.3717E+05 1.3651E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00 5.4303E+04 2.2261E+02 1.2785E+03 2.8027E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.7421E+00

As you can see there is a pair of '----' lines which may interfere with proper parsing of the data.

Lets say I want to get just 'the eng_tot' data from this file (the bolded numbers), how would I go about doing that in Python? The number is always in the same place in the file (second quantity, first row after second set of ----s.

By the way the header part with all the definitions in it repeats every 8 steps, execpt the first step in which there are 9 lines. I'd like to just ignore the first step. For now lets say I want to start with line 295 inclusive. Just so you know, I'm quite new to python and programming in general so all the help you can provide is appreciated.

Here's the code I tried, but Eng_Total is still an empty set:

import re
import inspect

def lineno():
    """Returns the current line number"""
    linenum = inspect.currentframe().f_back.f_lineno
infile =  open('FilePath/OUTPUT.01').read()
Eng_Total = []
for line in infile:
#    if 'eng_tot' in line.split(): 
     if re.match("\s+-+\s+", line):
    lineno(line)
        line = linenum+1
        sanitized_line = line[8:]
        eng_total = line.spl开发者_开发问答it()[0]
        Eng_Total.append(eng_total)
print Eng_Total

I'd probably do this:

iterate over lines in the output
search for one containing eng_tot:
- if 'eng_tot' in line.split(): process_blocks
gobble up lines until one matches all dashes (with optional spaces on either side)
- if re.match("\s+-+\s+", line): proccess_metrics_block
process the first line of metrics:
- cut the first column off the line (it makes it harder to parse, because it might not be there)
  - sanitized_line = line[8:]
  - eng_total = line.split()[0] , the first column is now eng_total
skip lines until you reach another line of dashes, then start again

After seeing your edits:

You need to import the re (regular expression) module, at the top of the file : import re
The process_blocks and process_metrics_block were pseudo code. Those don't exist unless you define them. :) You don't need those functions exactly, you can avoid them using basic looping (while) and conditional (if) statements.
You'll have to make sure you understand what you're doing, not just copy from stack overflow! :)

It looks like you're trying to do something like this. It seems to work, but I'm sure with some effort, you can come up with something nicer:

import re

def find_header(lines):
  for (i, line) in enumerate(lines):
    if 'eng_tot' in line.split():
      return i
  return None

def find_next_separator(lines, start):
  for (i, line) in enumerate(lines[start+1:]):
    if re.match("\s*-+\s*", line):
      return i + start + 1
  return None

if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  header = find_header(lines)
  start = find_next_separator(lines, header+1)

  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break

    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

    start = end

  print totals

You can use a generator to be a little more pythonic:

def metric_block_iter(lines):
  start = find_next_separator(lines, find_header(lines)+1)
  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break
    yield (start, end)
    start = end


if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  for (start, end) in metric_block_iter(lines):
    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

  print totals

You're going to need to define the file format explicitly, and then you should be able to parse that easily.

The first step is figuring out where the data you need is defined. Then throw away everything up to that point. Then start reading.

If the eng_tot can move, you need to figure out where in the block of useful data it is. So, read a line, entries = line.split(); location = entries.index('eng_tot'), then read th entry out of that location in the associated line in the output data.

The key is that you need to break down your problem into steps that you know you can do. When looking at something new it's easy to get overwhelmed. If you can just start doing something, you'll find that you can reach the solution without too much trouble after all.

继续阅读：parsing python

How to extract data from an irregularly formatted data file in python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？