开发者

Python: Indexing a file that is tab delimited

I have a text file that is tab delimited and looks like:

1_0 NP_045689 100.00 279 0 0 18 开发者_运维知识库 296 18 296 3e-156 539

1_0 NP_045688 54.83 259 108 6 45 296 17 273 2e-61 224

I need to parse out specific columns such as column 2.

I've tried with the code below:

z = open('output.blast', 'r')
for line in z.readlines():
    for col in line:
        print col[1]
z.close()

But i get a index out of range error.


z = open('output.blast', 'r')
for line in z.readlines():
    cols = line.split('\t'):
        print cols[1]
z.close()

You need to split() the line on tab characters first.

Alternatively, you could use Python's csv module in tab-delimiters mode.


Check out the csv module. That should help you a lot if you plan on doing more stuff with your tab-delimited files, too. One nice thing is that you can assign names to the various columns.


import csv,StringIO
text="""1_0 NP_045689   100.00  279 0   0   18  296 18  296 3e-156  539
1_0 NP_045688   54.83   259 108 6   45  296 17  273 2e-61   224"""

f = csv.reader(StringIO.StringIO(text), delimiter='\t')
for row in f:
    print row[1]

two things of note:

the delimiter argument to the reader method tells the csv module how to split the text line. Check the other arguments to the reader function to extend functionality (ie: quotechar)

I use StringIO to wrap the text example as a file object, you dont need that if you are using a file reference.

ex:

f=csv.reader(open('./test.csv'),delimiter='\t')


This has already been answered, but I thought I'd share the use of namedtuples for this sort of situation, as it allows pleasant object.attribute type attribute access.

from collections import namedtuple
import csv
rec = namedtuple('rec', 'col1, col2, col3, col4, col5')
for r in map(rec._make, csv.reader(open("myfile.tab", "rb"), delimiter='\t')):
    print r.col2, r.col5

See the Python collections documentation for more details.


This is why your code is going wrong:

for col in line:

will iterate over every CHARACTER in the line.

    print col[1]

A character is a string of length 1, so col[1] is always going to give an index out of range error.

As others have said, you either need to split the line on the TAB character '\t', or use the csv module, which will correctly handle quoted fields that may contain tabs or newlines.

I also recommend avoiding using readlines - it will read the entire file into memory, which may cause problems if it is very large. You can iterate over the open file a line at a time instead:

z = open('output.blast', 'r')
for line in z:
    ...
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜