开发者

How to read this file using Python?

I have a DNA file in the following format:

>gi|5524211|gb|AAD44166.1| cytochrome
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGC
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
ACACCCCCCCCGGTGTGTGTGGGGGGTTAAAAATGATGAGTGATGAGTGAGTTGTGTG
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAG开发者_如何学JAVACATCGACGACT
TTCTATCATCATTCGGCGGGGGGATATATTATAGCGCGCGATTATTGCGCAGTCTACG
TCATCGACTACGATCAGCATCAGCATCAGCATCAGCATCGACTAGCATCAGCTACGAC

How do I read this file and extract the DNA sequence part (ACCAGAGCGG...) without any newlines, for example:

ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGCCTACATCATCACAGCAGCATCA

Maybe regex isn't needed?


If there's always only one line of header :

dnalines = text.split('\n')[1:]
dna = ''.join(dnalines)

With text = the contents of your file (for example, text = open('yourfile').read())


I did some tests, and it appears that the following is more efficient than delroth's answer:

text.split('\n', 1)[1].replace('\n', '')

Edit: wait, it's not so simple. I timed both methods, twice, using Python 2.6.4 and 3.1.1, on an ~30MB file:

  • Python 2.6.4, my version:

    $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 221 msec per loop
    $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 219 msec per loop
    
  • Python 2.6.4, delroth's version:

    $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 392 msec per loop
    $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 390 msec per loop
    
  • Python 3.1.1, my version:

    $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 803 msec per loop
    $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 798 msec per loop
    
  • Python 3.1.1, delroth's version:

    $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 610 msec per loop
    $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 610 msec per loop
    

Conclusion: Python 3 is much slower, and it depends on the Python version which of the two code snippets is faster!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜