开发者

Python-Remove rows based on lack of character

I would like to remove a row from my file if it contains a letter other than A, C, G, or T. So that ['TC', 'CY', 'GS', 'GA', 'CT'] will become ['TC', 'GA', 'CT'].

The files will have an unknown number of rows and will contain patterns of 2 or more letters in any order. In addition, I do not know the other letters that are present (Y or S or something else).

How would I go about setting up a program for this preferably in Python? I alread开发者_StackOverflow社区y can import my file and read the rows.

Thanks!


You can solve it with a simple regular expression and a list comprehension.

>>> import re
>>> data = ['TC', 'CY', 'GS', 'GA', 'CT']
>>> [x for x in data if re.match(r'^[ACGT]+$', x)]
['TC', 'GA', 'CT']


How about this, as a one liner:

valid = [l.strip() for l in fh if all(c in 'ACGT' for c in l.strip())]

where fh is your file handle.


A little slow one-liner because of type casting(you can decrease it by assigning set("ACGT") before), but a small one:

>>> l
['TC', 'CY', 'GS', 'GA', 'CT']
>>> [i for i in l if not set(i) - set("ACGT")]
['TC', 'GA', 'CT']
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜