Count how many reads in a data file are in an interval from reference file. Python
I am trying to count the number of hits a value in one file(column) falls between an interval from another file (two columns).
I am completely stuck on how to map it.
I tried something like this:
for line in file1:
if line[0]=line2[0] and line2[1]<line[1]<line2[2]:
print line
I'm not sure if this is correct.
file 1:
elem1 39887
elem1 72111
file 2:
elem1 1 57898
elem1 57899 69887
elem2 69888 82111
In file1 elem1 is an element in my project. the value 39887 is the start coord开发者_运维知识库inate.
In file2 elem1 is still an element in my project, but the values are start and end coordinates. File2 is only a reference file.
For every line in file2, I want to see if the "elem#"=="elem#" in file 1. If the elem# in file1 is equal to elem# in file2, then I want to continue in this loop and see if the corresponding value in file1 is between the start and end positions in file2.
For instance, in the first line of file1, elem1==elem1 in the first line of file2. Since they are equal, is 39887 between 1 and 57898? Yes it is, therefore count it. I need to do this for every line in file2.
In the end, I want to see how many elements are within each group of coordinates from file2.
Assuming your lines match up one-to-one (so you want to test whether the value on the first line of one file lies between the values on the first line of the other, second line to second line, etc), you can zip
the two files to iterate over them in step:
with open(...) as interval_file, open(...) as value_file:
for value, interval in zip(interval_file, value_file):
left, right = map(int, interval.split())
if float(left) <= float(value) <= float(right):
#do stuff
Drop the concepts of 'files' for a second and think about the data.
You have two groups of textual data, one that is one column and one that is two columns, correct? Assume for a second you can work out separating the text in two colums, what you really have is three lists (after converting the strings to ints lets say):
c1 = [random.randint(0,100) for i in range(100)]
c2 = [random.randint(0,100) for i in range(100)]
c3 = [random.randint(0,100) for i in range(100)]
If I understand, you want to count the interval hits of the data in c1 in c2 and c3, correct? Now focus on what a 'hit' is. If you have 3
in c1, and you have [1,3,5,5,3,10]
in c2, how many hits is that? Only 3's? The interval between 1,3,5? Or the interval of 1,3,5,5,3? Or all the above.
As a simple example, with the randoms int lists above, this prints every int in c1 that occurs both in c2 and c3:
for i in c1:
if i in c2 and i in c3:
print i
Once you further define what a 'hit' is, this basic structure will work. Once you have the basic data and the 'hit' structure working, then go back and deal with the files. Should be easy then.
Edit: If I understand what you are trying to do (and that is a massive if), this is a framework:
with open("file2.txt") as val_file:
for val_line in val_file:
val_elems=val_line.split()
with open("file1.txt") as int_file:
for int_line in int_file:
int_elems=int_line.split()
if (int_elems[0] == val_elems[0] and
int_elems[1] > val_elems[1] and
int_elems[1] < val_elems[2]):
print val_line
Running against your sample data, the output: elem1 1 57898
It is not clear to me if you are trying to 1) positionally comparing the two files line by line or 2) if you are reading each line of file 2 and comparing to each and every line of file 1. The example here does the later.
精彩评论