Using python to extract a few lines from a data file
I have a large file which has an enormous of data in it. I need to extract 3 lines every 5000 or so lines. The format of the data file is as follows:
...
O_sh 9215 1.000000 -2.304400
-1.0680E+00 1.3617E+00 -5.7138E+00
O_sh 9216 1.000000 -2.304400
-8.1186E-01 -1.7454E+00 -5.8169E+00
timestep 501 9216 0 3 0.000500
20.54 -11.85 35.64
0.6224E-02 23.71 35.64
-20.54 -11.86 35.64
Li 1 6.941000 0.843200
3.7609E-02 1.1179E-01 4.1032E+00
Li 2 6.941000 0.843200
6.6451E-02 -1.3648E-01 1开发者_开发技巧.0918E+01
...
What I need is the the three lines after the line that starts with "timestep" so in this case I need the 3x3 array:
20.54 -11.85 35.64
0.6224E-02 23.71 35.64
-20.54 -11.86 35.64
in an output file for each time the word "timestep" appears.
Then I need the average of all those arrays in just one array. Just one array consisting of the average value of each element in the same position in every array for the whole file. I've been working on this for a while, but I haven't been able to extract the data correctly yet.
Thanks so much, and this is not for homework. You're advice will be helping the progress of science! =)
Thanks,
Assuming this is not homework, I think regex is overkill for the problem. If you know that you need three lines after one starts with 'timestep' why not approach the problem this way:
Matrices = []
with open('data.txt') as fh:
for line in fh:
# If we see timestep put the next three lines in our Matrices list.
if line.startswith('timestep'):
Matrices.append([next(fh) for _ in range(3)])
Per the comments - you use next(fh) in this situation to keep the file handle in sync when you want to pull the next three lines from it. Thanks!
I'd suggest using a coroutine (which is basically a generator that can accept values, if you are unfamiliar) to keep a running average as you iterate over your file.
def running_avg():
count, sum = 0, 0
value = yield None
while True:
if value:
sum += value
count += 1
value = yield(sum/count)
# array for keeping running average
array = [[running_avg() for y in range(3)] for x in range(3)]
# advance to first yield before we begin
[[elem.next() for elem in row] for row in array]
with open('data.txt') as f:
idx = None
for line in f:
if idx is not None and idx < 3:
for i, elem in enumerate(line.strip().split()):
array[idx][i].send(float(elem))
idx += 1
if line.startswith('timestep'):
idx = 0
To get a convert array
into a list of averages, just call each coroutine next
method, it'll return current average:
averages = [[elem.next() for elem in row] for row in array]
And you'd get something like:
averages = [[20.54, -11.85, 35.64], [0.006224, 23.71, 35.64], [-20.54, -11.86, 35.64]]
Okay, so you can do this:
Algorithm:
Read the file line by line
if the line starts with "timestep":
read the next three lines
take the average as needed
Code:
def getArrays(f):
answer = [[0, 0, 0], [0, 0, 0], [0, 0, 0]]
count = 0
line = f.readline()
while line:
if line.strip().startswith("timestep"):
one, two, three = getFloats(f.readline().strip()), getFloats(f.readline().strip()), getFloats(f.readline().strip())
answer[0][0] = ((answer[0][0]*count) + one[0])/(count+1)
answer[0][1] = ((answer[0][0]*count) + one[1])/(count+1)
answer[0][2] = ((answer[0][0]*count) + one[2])/(count+1)
answer[1][0] = ((answer[0][0]*count) + two[0])/(count+1)
answer[1][1] = ((answer[0][0]*count) + two[1])/(count+1)
answer[1][2] = ((answer[0][0]*count) + two[2])/(count+1)
answer[2][0] = ((answer[0][0]*count) + three[0])/(count+1)
answer[2][1] = ((answer[0][0]*count) + three[1])/(count+1)
answer[2][2] = ((answer[0][0]*count) + three[2])/(count+1)
line = f.readline()
count += 1
return answer
def getFloats(line):
answer = []
for num in line.split():
if "E" in num:
parts = num.split("E")
base = float(parts[0])
exp = int(parts[1])
answer.append(base**exp)
else:
answer.append(float(num))
return answer
answer
is now a list of all the 3x3 arrays. I don't know how you want to do the averaging, so if you post that, I can incorporate it into this algorithm. Else, you can write a function to take my array and compute the averages are required.
Hope this helps
Building on inspectorG4dget's and g.d.d.c's posts, here's a version that should do the reading, parsing, and averaging. Please point out my bugs! :)
def averageArrays(filename):
# initialize average variables then,
# open the file and iterate through the lines until ...
answer, count = [[0.0]*3 for _ in range(3)], 0
with open(filename) as fh:
for line in fh:
if line.startswith('timestep'): # ... we find 'timestep'!
# so , we read the three lines and sanitize them
# conversion to float happens here, which may be slow
raw_mat = [fh.next().strip().split() for _ in range(3)]
mat = []
for row in raw_mat:
mat.append([float(item) for item in row])
# now, update the running average, noting overflows as by
# http://invisibleblocks.wordpress.com/2008/07/30/long-running-averages-without-the-sum-of-preceding-values/
# there are surely more pythonic ways to do this
count += 1
for r in range(3):
for c in range(3):
answer[r][c] += (mat[r][c] - answer[r][c]) / count
return answer
import re
from itertools import imap
text = '''O_sh 9215 1.000000 -2.304400
-1.0680E+00 1.3617E+00 -5.7138E+00
O_sh 9216 1.000000 -2.304400
-8.1186E-01 -1.7454E+00 -5.8169E+00
timestep 501 9216 0 3 0.000500
20.54 -11.85 35.64
0.6224E-02 23.71 35.64
-20.54 -11.86 35.64
Li 1 6.941000 0.843200
3.7609E-02 1.1179E-01 4.1032E+00
Li 2 6.941000 0.843200
6.6451E-02 -1.3648E-01 1.0918E+01
O_sh 9215 1.000000 -2.304400
-1.0680E+00 1.3617E+00 -5.7138E+00
O_sh 9216 1.000000 -2.304400
-8.1186E-01 -1.7454E+00 -5.8169E+00
timestep 501 9216 0 3 0.000500
80.80 -14580 42.28
7.5224E-01 777.1 42.28
140.54 -33.86 42.28
Li 1 6.941000 0.843200
3.7609E-02 1.1179E-01 4.1032E+00
Li 2 6.941000 0.843200
6.6451E-02 -1.3648E-01 1.0918E+01'''
lin = '\r?\n{0}*({1}+){0}+({1}+){0}+({1}+){0}*'
pat = ('^timestep.+'+3*lin).format('[ \t]','[.\deE+-]')
regx = re.compile(pat,re.MULTILINE)
def moy(x):
return sum(map(float,x))/len(x)
li = map(moy,zip(*regx.findall(text)))
n = len(li)
g = iter(li).next
res = [(g(),g(),g()) for i in xrange(n//3)]
print res
result
[(50.67, -7295.925, 38.96), (0.379232, 400.40500000000003, 38.96), (60.0, -22.86, 38.96)]
精彩评论