Python: Indexing a file that is tab delimited

2023-01-05 22:34 问答作者：

I have a text file that is tab delimited and looks like:

1_0 NP_045689 100.00 279 0 0 18 开发者_运维知识库 296 18 296 3e-156 539

1_0 NP_045688 54.83 259 108 6 45 296 17 273 2e-61 224

I need to parse out specific columns such as column 2.

I've tried with the code below:

z = open('output.blast', 'r')
for line in z.readlines():
    for col in line:
        print col[1]
z.close()

But i get a index out of range error.

z = open('output.blast', 'r')
for line in z.readlines():
    cols = line.split('\t'):
        print cols[1]
z.close()

You need to split() the line on tab characters first.

Alternatively, you could use Python's csv module in tab-delimiters mode.

Check out the csv module. That should help you a lot if you plan on doing more stuff with your tab-delimited files, too. One nice thing is that you can assign names to the various columns.

import csv,StringIO
text="""1_0 NP_045689   100.00  279 0   0   18  296 18  296 3e-156  539
1_0 NP_045688   54.83   259 108 6   45  296 17  273 2e-61   224"""

f = csv.reader(StringIO.StringIO(text), delimiter='\t')
for row in f:
    print row[1]

two things of note:

the delimiter argument to the reader method tells the csv module how to split the text line. Check the other arguments to the reader function to extend functionality (ie: quotechar)

I use StringIO to wrap the text example as a file object, you dont need that if you are using a file reference.

ex:

f=csv.reader(open('./test.csv'),delimiter='\t')

This has already been answered, but I thought I'd share the use of namedtuples for this sort of situation, as it allows pleasant object.attribute type attribute access.

from collections import namedtuple
import csv
rec = namedtuple('rec', 'col1, col2, col3, col4, col5')
for r in map(rec._make, csv.reader(open("myfile.tab", "rb"), delimiter='\t')):
    print r.col2, r.col5

See the Python collections documentation for more details.

This is why your code is going wrong:

for col in line:

will iterate over every CHARACTER in the line.

    print col[1]

A character is a string of length 1, so col[1] is always going to give an index out of range error.

As others have said, you either need to split the line on the TAB character '\t', or use the csv module, which will correctly handle quoted fields that may contain tabs or newlines.

I also recommend avoiding using readlines - it will read the entire file into memory, which may cause problems if it is very large. You can iterate over the open file a line at a time instead:

z = open('output.blast', 'r')
for line in z:
    ...

继续阅读：indexing python

Python: Indexing a file that is tab delimited

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？