reducing list based on fuzzy values in python
I hav开发者_C百科e a list that contains groups of nearly identical numeric values. i.e. (1004.523, 1004.575, 1004.475, 791.385, 791.298, 791.301, 791.305, 791.299)
What I am trying to do is read through the list and find all the 1004.5 +- values aggregate them and find the average value. then continue on and find all the 791.0 +- values and do the same to them.
I do not know how many individual values there will be in each "group" nor do I know how many groups there will be.
The result I am looking for is another list which would contain the average values of each of the groups. So in the example my result would be (1004.524, 791.3176)
The code I'm currently using is very Kludgey and it seems there should be a much better way to do it.
As you can see I have to repeat code twice once in the else and once at the end of the if since the last set of numbers does not trigger the else. Plus at the completion of the if I need to add the last value.
If I use the len(tones) rather that len(tones)-i I get an out of range error.
Any thoughts or suggestions would be appreciated. Thanks for your help.
Ed
    toneLen = len(tones) -1
    for i in range(0, toneLen):
        if abs(tones[i]-tones[i+1]) <= 2.0:
            tmpTones.append(tones[i])
        else:
            freq = mean(tmpTones)
            newTones.append(freq)
            tmpTones = []
    tmpTones.append(tones[i+1])
    freq = mean(tmpTones)
    newTones.append(freq)                
    tones = newTones
UPDATE: First I wanted to thank everyone who submitted suggestions. The response was very quick and helpful. I should have probably included some more info which I am doing below. Thanks so much for your help.
Second , a quick explanation of what I am trying to do. Our local Fire Department is looking for a way to track dispatches for departments close to them. For the most part they use two tone sequential paging i.e. 1000Hz followed by 500Hz.
So I am using numpy fft to find the tone frequency. Since the accuracy of the tone appears to be about +- 2 Hz, I compare the calculated frequency to a list of known paging tones and pick the closest match. After all the tones have been matched to the paging tones I look for matches to departments of interest.
One thing I did not know when I started this that in any given dispatch the same tone can be repeated several times, so the order of the tones is important. An example: 707.3, 339.6, 707.3, 569.1, 447.2, 569.1 would be a typical dispatch. I then look to see if any of the tone pairs are ones I'm interested in if so I display a message
Thanks again for all your help.
Ed
Perhaps you are looking for kmeans clustering.
In the code below, I use scipy.cluster.vq.kmeans to cluster the data into k groups.
If the distortion is greater than some set threshold amount, then we increase k by one, and redo the kmeans clustering. We repeat until we find groups whose total distortion is less than the threshold amount.
import scipy.cluster.vq as scv
import numpy as np
import collections
def auto_cluster(data,threshold=0.1):
    # There are more sophisticated ways of determining k
    # See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
    k=1
    distortion=1e20
    while distortion>threshold:
        codebook,distortion=scv.kmeans(data,k)
        k+=1   
    code,dist=scv.vq(data,codebook)    
    groups=collections.defaultdict(list)
    for index,datum in zip(code,data):
        groups[index].append(datum)
    return groups
data=np.array((1004.523, 1004.575, 1004.475, 791.385, 791.298, 791.301, 791.305, 791.299))
groups=auto_cluster(data)    
for index in groups:
    print('{index}: ave({d}) = {ave}'.format(
        index=index,
        d=','.join(map('{0:g}'.format,groups[index])),
        ave=np.mean(groups[index]))
        )
yields
0: ave(791.385,791.298,791.301,791.305,791.299) = 791.3176
1: ave(1004.52,1004.58,1004.48) = 1004.52433333
This finds the borders between groups of nearly identical values and then computes the mean using slices on the original list.
tones = (1004.523, 1004.575, 1004.475, 791.385, 791.298, 791.301, 791.305, 791.299)
splits = [i for i in range(1, len(tones)) if abs(tones[i-1] - tones[i]) > 2]
splits = [0] + splits + [len(tones)]
tones = [mean(tones[splits[i-1]:splits[i]]) for i in range(1, len(splits))]
# [1004.5243333333333, 791.31759999999997]
This does without the intermediate temp list:
assert tones
total = prev = tones[0]
count = 1
newlist = []
for i in xrange(1, len(tones)):
    t = tones[i]
    if abs(t - prev) <= DELTA:
        total += t
        count += 1
        prev = t
    else:
        newlist.append(total / count)
        total = prev = t
        count = 1
newlist.append(total / count)
If you know what numbers may appear in the sequence, you can use this (exacttones is expected values list):
tones = (1004.523, 1004.575, 1004.475, 791.385, 791.298, 791.301, 791.305, 791.299)
exacttones = (1004.5, 791.3)
limit = 0.2
[sum(x)/len(x) for x in [[y for y in tones if abs((y-e))<=limit] for e in exacttones]]
# [1004.5243333333333, 791.31759999999997]
To analyze the sequence without knowing the exacttones, something like this will work:
def calc(d, value):
    for k in d:
        if abs(k-value) <= limit:
            d[k].append(value)
            return d
    d[value] = [value]
    return d
[sum(x)/len(x) for x in reduce(calc, values, {}).values()]
# [1004.5243333333333, 791.31759999999997]
Assuming that this is a list of audio tones, you probably want to use a fraction such as 1.059 to determine the range to assign to a group, rather than hard-coding a number like 2.0.
def average_tones(tones):
    threshold = 1.059
    average = 0
    count = 0
    for tone in sorted(tones):
        if count != 0 and tone >= average*threshold:
            yield average
            count = 0
        average = (average * count + tone) / (count + 1)
        count += 1
    if count != 0:
        yield average
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论