Python-Remove rows based on lack of character
I would like to remove a row from my file if it contains a letter other than A, C, G, or T. So that ['TC', 'CY', 'GS', 'GA', 'CT'] will become ['TC', 'GA', 'CT'].
The files will have an unknown number of rows and will contain patterns of 2 or more letters in any order. In addition, I do not know the other letters that are present (Y or S or something else).
How would I go about setting up a program for this preferably in Python? I alread开发者_StackOverflow社区y can import my file and read the rows.
Thanks!
You can solve it with a simple regular expression and a list comprehension.
>>> import re
>>> data = ['TC', 'CY', 'GS', 'GA', 'CT']
>>> [x for x in data if re.match(r'^[ACGT]+$', x)]
['TC', 'GA', 'CT']
How about this, as a one liner:
valid = [l.strip() for l in fh if all(c in 'ACGT' for c in l.strip())]
where fh is your file handle.
A little slow one-liner because of type casting(you can decrease it by assigning set("ACGT") before), but a small one:
>>> l
['TC', 'CY', 'GS', 'GA', 'CT']
>>> [i for i in l if not set(i) - set("ACGT")]
['TC', 'GA', 'CT']
精彩评论