开发者

Count how many reads in a data file are in an interval from reference file. Python

I am trying to count the number of hits a value in one file(column) falls between an interval from another file (two columns).

I am completely stuck on how to map it.

I tried something like this:

for line in file1:
    if line[0]=line2[0] and line2[1]<line[1]<line2[2]:
    print line

I'm not sure if this is correct.

file 1:
elem1     39887
elem1     72111

file 2:
elem1     1     57898
elem1     57899 69887
elem2     69888 82111

In file1 elem1 is an element in my project. the value 39887 is the start coord开发者_运维知识库inate.

In file2 elem1 is still an element in my project, but the values are start and end coordinates. File2 is only a reference file.

For every line in file2, I want to see if the "elem#"=="elem#" in file 1. If the elem# in file1 is equal to elem# in file2, then I want to continue in this loop and see if the corresponding value in file1 is between the start and end positions in file2.

For instance, in the first line of file1, elem1==elem1 in the first line of file2. Since they are equal, is 39887 between 1 and 57898? Yes it is, therefore count it. I need to do this for every line in file2.

In the end, I want to see how many elements are within each group of coordinates from file2.


Assuming your lines match up one-to-one (so you want to test whether the value on the first line of one file lies between the values on the first line of the other, second line to second line, etc), you can zip the two files to iterate over them in step:

with open(...) as interval_file, open(...) as value_file:
    for value, interval in zip(interval_file, value_file):
        left, right = map(int, interval.split())
        if float(left) <= float(value) <= float(right):
            #do stuff


Drop the concepts of 'files' for a second and think about the data.

You have two groups of textual data, one that is one column and one that is two columns, correct? Assume for a second you can work out separating the text in two colums, what you really have is three lists (after converting the strings to ints lets say):

c1 = [random.randint(0,100) for i in range(100)]     
c2 = [random.randint(0,100) for i in range(100)]
c3 = [random.randint(0,100) for i in range(100)]

If I understand, you want to count the interval hits of the data in c1 in c2 and c3, correct? Now focus on what a 'hit' is. If you have 3 in c1, and you have [1,3,5,5,3,10] in c2, how many hits is that? Only 3's? The interval between 1,3,5? Or the interval of 1,3,5,5,3? Or all the above.

As a simple example, with the randoms int lists above, this prints every int in c1 that occurs both in c2 and c3:

for i in c1:
    if i in c2 and i in c3:
        print i 

Once you further define what a 'hit' is, this basic structure will work. Once you have the basic data and the 'hit' structure working, then go back and deal with the files. Should be easy then.

Edit: If I understand what you are trying to do (and that is a massive if), this is a framework:

with open("file2.txt") as val_file:
    for val_line in val_file:
        val_elems=val_line.split()
        with open("file1.txt") as int_file:
            for int_line in int_file:
                int_elems=int_line.split()
                if (int_elems[0] == val_elems[0] and 
                    int_elems[1] > val_elems[1] and
                    int_elems[1] < val_elems[2]):
                        print val_line

Running against your sample data, the output: elem1 1 57898

It is not clear to me if you are trying to 1) positionally comparing the two files line by line or 2) if you are reading each line of file 2 and comparing to each and every line of file 1. The example here does the later.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜