Python: Removing characters from beginnings of sequences in fasta format

2022-12-09 22:56 问答作者：

I have sequences in fasta format that contains primers of 17 bp at the beginning of the sequences. And the primers sometimes have mismatches. I therefore want to remo开发者_运维百科ve the first 17 chars of the sequences, except from the fasta header.

The sequences look like this:

> name_name_number_etc
SEQUENCEFOLLOWSHERE
> name_number_etc
SEQUENCEFOLLOWSHERE
> name_name_number_etc
SEQUENCEFOLLOWSHERE

How can I do this in python?

Thanks! Jon

If I understand correctly, you have to remove the primer only from the first 17 characters of a potentially multiline sequence. What you ask is a bit more difficult. Yes, a simple solution exists, but it can fail in some situations.

My suggestion is: use Biopython to perform the parsing of the FASTA file. Straight from the tutorial

from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)
handle.close()

Then rewrite the sequence down with the first 17 letters deleted. I don't have an installation of biopython on my current machine, but if you take a look at the tutorial, it won't take more than 15 lines of code in total.

If you want to go hardcore, and do it manually, you have to do something like this (from the first poster, modified)

f = open('sequence.fsa')

first_line = False
for line in f.xreadlines():
    if line[0] == ">":
        first_line=True
        print line,
    else:
        if first_line:
             print line[17:],
        else:
             print line,
        first_line = False

with open('fasta_file') as f:
    for line in f:
        if not line.startswith('>'):
            print line[17:]

If your file looks like

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

and you want to remove the first 17 chars of every sequence line, you want to do something like this:

f = open('sequence.txt')

for line in f.xreadlines():
    if line.find('>') < 0:
        print line.strip()[17:]

I don't know if posting on this thread is pointless, but I came across a method that really helped me out while I started working with .fasta files.

file = input('Input your fasta file')
o_file = open(file).readlines()

o_file = o_file[1:]

for line in o_file:
     #do something

继续阅读：character extract fasta python sequences

Python: Removing characters from beginnings of sequences in fasta format

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？