开发者

Extract specific lines from file and create sections of data in python

Trying to write a python script to extract lines from a file. The file is a text file which is a dump of python suds output.

I want to:

  1. strip all characters except words and numbers. I don't want any "\n", "[", "]", "{", "=", etc characters.
  2. find a section where it starts with "ArrayOf_xsd_string"
  3. remove the next line "item[] =" from the result
  4. grab the remaining 6 lines and create a dictionary based on the unique number on the fifth line (123456, 234567, 345678) using this number as the key and the remaining lines as the values (pardon my ignorance if I'm not explaining this in pythonic terminology)
  5. output the results to a file

Data in file is a list:

[(ArrayOf_xsd_string){
   item[] = 
      "001",
      "ABCD",
      "1234",
      "wordy type stuff",
      "123456",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[]开发者_如何学C = 
      "002",
      "ABCD",
      "1234",
      "wordy type stuff",
      "234567",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "003",
      "ABCD",
      "1234",
      "wordy type stuff",
      "345678",
      "more stuff, etc",
 }]

I tried doing a re.compile and here is my poor attempt at the code:

import re, string

f = open('data.txt', 'rb')
linelist = []
for line in f:
  line = re.compile('[\W_]+')
 line.sub('', string.printable)
 linelist.append(line)
 print linelist

newlines = []
for line in linelist:
    mylines = line.split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
      newlines.append([next(linelist) for _ in range(6)])
      print newlines

I'm a Python newbie and haven't found any results in google or on stackoverflow for how to extract specific number of lines after finding specific text. Any help is most appreciated.

Please ignore my code as I am taking "shots in the dark" :)

Here is what I'd like to see as the results:

123456: 001,ABCD,1234,wordy type stuff,more stuff etc
234567: 002,ABCD,1234,wordy type stuff,more stuff etc
345678: 003,ABCD,1234,wordy type stuff,more stuff etc

I hope that helps with trying to interpret my flawed code.


Several suggestions on your code:

Stripping all non-alphanumeric characters is totally unnecessary and timewasting; there is no need whatsoever to build linelist. Are you aware you can simply use plain old string.find("ArrayOf_xsd_string") or re.search(...)?

  1. strip all characters except words and numbers. I don't want any "\n", "[", "]", "{", "=", etc characters.
  2. find a section where it starts with "ArrayOf_xsd_string"
  3. remove the next line "item[] =" from the result

Then as to your regex, _ is already covered under \W anyway. But the following reassignment to line overwrites the line you just read??

for line in f:
  line = re.compile('[\W_]+') # overwrites the line you just read??
  line.sub('', string.printable)

Here's my version, which reads the file directly, and also handles multiple matches:

with open('data.txt', 'r') as f:
    theDict = {}
    found = -1
    for (lineno,line) in enumerate(f):
        if found < 0:
            if line.find('ArrayOf_xsd_string')>=0:
                found = lineno
                entries = []
            continue
        # Grab following 6 lines...
        if 2 <= (lineno-found) <= 6+1:
            entry = line.strip(' ""{}[]=:,')
            entries.append(entry)
        #then create a dict with the key from line 5
        if (lineno-found) == 6+1:
            key = entries.pop(4)
            theDict[key] = entries
            print key, ','.join(entries) # comma-separated, no quotes
            #break # if you want to end on first match
            found = -1 # to process multiple matches

And the output is exactly what you wanted (that's what ','.join(entries) was for):

123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc


If you want to extract the specific number of lines after a specific line that matches. You may as well simply read in the array with readlines, loop through it to find the match, then take the next N lines from the array too. Also, you could use a while loop along with readline, which is preferable if the files can get large.

The following is the most straight-forward fix to your code I can think of, but its not necessarily the best overall implementation, I suggest following my tips above unless you have good reasons not to or just want to get the job done asap by hook or crook ;)

newlines = []
for i in range(len(linelist)):
    mylines = linelist[i].split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
        for l in linelist[i+2:i+20]:
            newlines.append(l)
        print newlines

Should do what you want if I have interpreted your requirements properly. This says: take the next but one line, and the next 17 lines (so, up to but not including the 20th line after the match), append them to newlines (you cannot append a whole list at once, that list becomes a single index in the list you are adding them to).

Have fun and good luck :)


Let's have some fun with iterators!

class SudsIterator(object):
    """extracts xsd strings from suds text file, and returns a 
    (key, (value1, value2, ...)) tuple with key being the 5th field"""
    def __init__(self, filename):
        self.data_file = open(filename)
    def __enter__(self):  # __enter__ and __exit__ are there to support 
        return self       # `with SudsIterator as blah` syntax
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.data_file.close()
    def __iter__(self):
        return self
    def next(self):     # in Python 3+ this should be __next__
        """looks for the next 'ArrayOf_xsd_string' item and returns it as a
        tuple fit for stuffing into a dict"""
        data = self.data_file
        for line in data:
            if 'ArrayOf_xsd_string' not in line:
                continue
            ignore = next(data)
            val1 = next(data).strip()[1:-2] # discard beginning whitespace,
            val2 = next(data).strip()[1:-2] #   quotes, and comma
            val3 = next(data).strip()[1:-2]
            val4 = next(data).strip()[1:-2]
            key = next(data).strip()[1:-2]
            val5 = next(data).strip()[1:-2]
            break
        else:
            self.data_file.close() # make sure file gets closed
            raise StopIteration()  # and keep raising StopIteration
        return key, (val1, val2, val3, val4, val5)

data = dict()
for key, value in SudsIterator('data.txt'):
    data[key] = value

print data
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜