Puzzling dictionary matching problem
I've been stuck on this problem for a while and I am hoping someone can help. I am trying to iterate through a column, row[1] in a csv file called transcripts_test.csv and for each string in row[1] match the same string in a dictionary I created called OCR_dict from another csv file called coors_test.csv.
transcripts_test.csv contains:
ENST00000347869,chr3,50126341,50156454,1
ENST00000452166,chr14,21679063,21737583,2
ENST00000452166,chr14,21679063,21737583,2
coors_test.csv contains:
chr3,141030221,141开发者_JAVA技巧031065,Valid_10009,1000,+
chr6,141030221,141031065,Valid_10005,1000,+
chr14,141047080,141047610,Valid_10006,1000,+
This is my code:
import csv
with open('coors_test.csv', mode='r') as coors_infile:
coors_reader = csv.reader(coors_infile)
for row in coors_reader:
chromo = row[0]
start = row[1]
end = row[2]
coordinates_list = [chromo,start,end]
OCR_dict = {row[3]:coordinates_list}
for keys,values in OCR_dict.items():
OCR_chromosome = values[0]
with open('transcripts_test.csv', mode='r') as transcripts_infile:
transcripts_reader = csv.reader(transcripts_infile)
for row in transcripts_reader:
transcript_chromosome = row[1]
if transcript_chromosome == OCR_chromosome:
print(transcript_chromosome, keys, OCR_chromosome)
When I execute the code above, the output I get is:
chr14 Valid_10006 chr14
chr14 Valid_10006 chr14
The output I am looking for is:
chr3 Valid_10009 chr3
chr14 Valid_10006 chr14
chr14 Valid_10006 chr14
Why doesn't my code match and print chr3 Valid_10009 chr3
? Any help would be greatly appreciated. Thanks!
This is not what you want:
coordinates_list = [chromo,start,end]
OCR_dict = {row[3]:coordinates_list}
for keys,values in OCR_dict.items():
OCR_chromosome = values[0]
it creates a new dict in every iteration and that dict just has a single key. Then you loop over that one item and change a local variable ...
What you want is probably more like this:
from collections import defaultdict
OCR_dict = defaultdict(list)
with open('coors_test.csv', mode='r') as coors_infile:
coors_reader = csv.reader(coors_infile)
for row in coors_reader:
chromo = row[0]
start = row[1]
end = row[2]
# OCR_dict is a mapping `chromo -> [(start,end), (start,end), ...]`
OCR_dict[chromo].append((start,end))
with open('transcripts_test.csv', mode='r') as transcripts_infile:
transcripts_reader = csv.reader(transcripts_infile)
for row in transcripts_reader:
transcript_chromosome = row[1]
# look that chromosome up in the dict and print it if it exists
if transcript_chromosome in OCR_dict:
print(transcript_chromosome, OCR_dict[transcript_chromosome])
OCR_chromosome
is set to the last value of chromo
that's encountered. In other words, OCR_chromosome
will be the first value in the last row of coors_test.csv. chr14 will be the only value that can be matched. I'm not positive what exactly you're going for, but this should produce the chromo
values you're looking for:
import csv
chromos = set()
with open('coors_test.csv', mode='r') as coors_infile:
for row in csv.reader(coors_infile):
chromo = row[0]
chromos.add(chromo)
with open('transcripts_test.csv', mode='r') as transcripts_infile:
for row in csv.reader(transcripts_infile):
transcript_chromosome = row[1]
if transcript_chromosome in chromos:
print transcript_chromosome
精彩评论