开发者

Puzzling dictionary matching problem

I've been stuck on this problem for a while and I am hoping someone can help. I am trying to iterate through a column, row[1] in a csv file called transcripts_test.csv and for each string in row[1] match the same string in a dictionary I created called OCR_dict from another csv file called coors_test.csv.

transcripts_test.csv contains:

ENST00000347869,chr3,50126341,50156454,1    
ENST00000452166,chr14,21679063,21737583,2  
ENST00000452166,chr14,21679063,21737583,2  

coors_test.csv contains:

chr3,141030221,141开发者_JAVA技巧031065,Valid_10009,1000,+  
chr6,141030221,141031065,Valid_10005,1000,+  
chr14,141047080,141047610,Valid_10006,1000,+  

This is my code:

import csv

with open('coors_test.csv', mode='r') as coors_infile:
    coors_reader = csv.reader(coors_infile)
    for row in coors_reader:
            chromo = row[0]
            start = row[1]
            end = row[2]
            coordinates_list = [chromo,start,end]   
            OCR_dict = {row[3]:coordinates_list}
            for keys,values in OCR_dict.items():
                OCR_chromosome = values[0]
    with open('transcripts_test.csv', mode='r') as transcripts_infile:
        transcripts_reader = csv.reader(transcripts_infile)
        for row in transcripts_reader:
            transcript_chromosome = row[1]
            if transcript_chromosome == OCR_chromosome:
                print(transcript_chromosome, keys, OCR_chromosome)

When I execute the code above, the output I get is:

chr14 Valid_10006 chr14  
chr14 Valid_10006 chr14  

The output I am looking for is:

chr3 Valid_10009 chr3  
chr14 Valid_10006 chr14  
chr14 Valid_10006 chr14  

Why doesn't my code match and print chr3 Valid_10009 chr3? Any help would be greatly appreciated. Thanks!


This is not what you want:

        coordinates_list = [chromo,start,end]   
        OCR_dict = {row[3]:coordinates_list}
        for keys,values in OCR_dict.items():
            OCR_chromosome = values[0]

it creates a new dict in every iteration and that dict just has a single key. Then you loop over that one item and change a local variable ...

What you want is probably more like this:

from collections import defaultdict
OCR_dict = defaultdict(list)

with open('coors_test.csv', mode='r') as coors_infile:
    coors_reader = csv.reader(coors_infile)
    for row in coors_reader:
        chromo = row[0]
        start = row[1]
        end = row[2] 
        # OCR_dict is a mapping `chromo -> [(start,end), (start,end), ...]`
        OCR_dict[chromo].append((start,end))

with open('transcripts_test.csv', mode='r') as transcripts_infile:
    transcripts_reader = csv.reader(transcripts_infile)
    for row in transcripts_reader:
        transcript_chromosome = row[1]
        # look that chromosome up in the dict and print it if it exists
        if transcript_chromosome in OCR_dict:
            print(transcript_chromosome, OCR_dict[transcript_chromosome])


OCR_chromosome is set to the last value of chromo that's encountered. In other words, OCR_chromosome will be the first value in the last row of coors_test.csv. chr14 will be the only value that can be matched. I'm not positive what exactly you're going for, but this should produce the chromo values you're looking for:

import csv

chromos = set()
with open('coors_test.csv', mode='r') as coors_infile:
    for row in csv.reader(coors_infile):
            chromo = row[0]
            chromos.add(chromo)
    with open('transcripts_test.csv', mode='r') as transcripts_infile:
        for row in csv.reader(transcripts_infile):
            transcript_chromosome = row[1]
            if transcript_chromosome in chromos:
                print transcript_chromosome
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜