Downsampling the number of entries in a list (without interpolation)
I have a Python list with a number of entries, which I need to downsample using either:
- A maximum number of rows. For example, limiting a list of 1234 entries to 1000.
- A proportion of the original rows. For example, making the list 1/3 its original length.
(I need to be able to do both ways, but only one is used at a time).
I believe that for the maximum number of rows I can just calculate the proportion needed and pass that to the proportional downsizer:
def downsample_to_max(self, rows, max_rows):
return downsample_to_proportion开发者_JS百科(rows, max_rows / float(len(rows)))
...so I really only need one downsampling function. Any hints, please?
EDIT: The list contains objects, not numeric values so I do not need to interpolate. Dropping objects is fine.
SOLUTION:
def downsample_to_proportion(self, rows, proportion):
counter = 0.0
last_counter = None
results = []
for row in rows:
counter += proportion
if int(counter) != last_counter:
results.append(row)
last_counter = int(counter)
return results
Thanks.
You can use islice
from itertools
:
from itertools import islice
def downsample_to_proportion(rows, proportion=1):
return list(islice(rows, 0, len(rows), int(1/proportion)))
Usage:
x = range(1,10)
print downsample_to_proportion(x, 0.3)
# [1, 4, 7]
Instead of islice()
+ list()
it is more efficient to use slice syntax directly if the input is already a sequence type:
def downsample_to_proportion(rows, proportion):
return rows[::int(1 / proportion)]
This solution might be a bit overkill for the original poster, but I thought I would share the code that I've been using to solve this and similar problems.
It's a bit lengthy (about 90 lines), but if you often have this need, want an easy-to-use oneliner, and need a pure-Python dependency free environment then I reckon it might be of use.
Basically, the only thing you have to do is pass your list to the function and tell it what length you want your new list to be, and the function will either:
- downsize your list by dropping items if the new length is smaller, much like the previous answers already suggested.
- stretch/upscale your list (the opposite of downsizing) if the new length is larger, with the added option that you can decide whether to:
- linearly interpolate bw the known values (automatically chosen if list contains ints or floats)
- duplicate each value so they occupy a proportional size of the new list (automatically chosen if the list contains non-numbers)
- pull the original values apart and leave gaps in between
Everything is collected inside one function so if you need it just copy and paste it to your script and you can start using it right away.
For instance you might say:
origlist = [0,None,None,30,None,50,60,70,None,None,100]
resizedlist = ResizeList(testlist, 21)
print(resizedlist)
and get
[0, 5.00000000001, 9.9999999999900009, 15.0, 20.000000000010001, 24.999999999989999, 30, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70, 75.000000000010004, 79.999999999989996, 85.0, 90.000000000010004, 94.999999999989996, 100]
Note that minor inaccuracies will occur due to floating point limitations. Also, I wrote this for Python 2.x, so to use it on Python 3.x just add a single line that says xrange = range
.
And here is a nifty trick to interpolate between positioned subitems in a list of lists. So for instance you can easily interpolate between RGB color tuples to create a color gradient of x nr of steps. Assuming a list of RGB color tuples of 3 and a desired GRADIENTLENGTH variable you do this with:
crosssections = zip(*rgbtuples)
grad_crosssections = ( ResizeList(spectrum,GRADIENTLENGTH) for spectrum in crosssections )
rgb_gradient = [list(each) for each in zip(*grad_crosssections)]
It probably could need quite a few optimizations, I had to do quite a bit of experimentation. If you feel you can improve it feel free to edit my post. Here is the code:
def ResizeList(rows, newlength, stretchmethod="not specified", gapvalue=None):
"""
Resizes (up or down) and returns a new list of a given size, based on an input list.
- rows: the input list, which can contain any type of value or item (except if using the interpolate stretchmethod which requires floats or ints only)
- newlength: the new length of the output list (if this is the same as the input list then the original list will be returned immediately)
- stretchmethod: if the list is being stretched, this decides how to do it. Valid values are:
- 'interpolate'
- linearly interpolate between the known values (automatically chosen if list contains ints or floats)
- 'duplicate'
- duplicate each value so they occupy a proportional size of the new list (automatically chosen if the list contains non-numbers)
- 'spread'
- drags the original values apart and leaves gaps as defined by the gapvalue option
- gapvalue: a value that will be used as gaps to fill in between the original values when using the 'spread' stretchmethod
"""
#return input as is if no difference in length
if newlength == len(rows):
return rows
#set auto stretchmode
if stretchmethod == "not specified":
if isinstance(rows[0], (int,float)):
stretchmethod = "interpolate"
else:
stretchmethod = "duplicate"
#reduce newlength
newlength -= 1
#assign first value
outlist = [rows[0]]
writinggapsflag = False
if rows[1] == gapvalue:
writinggapsflag = True
relspreadindexgen = (index/float(len(rows)-1) for index in xrange(1,len(rows))) #warning a little hacky by skipping first index cus is assigned auto
relspreadindex = next(relspreadindexgen)
spreadflag = False
gapcount = 0
for outlistindex in xrange(1, newlength):
#relative positions
rel = outlistindex/float(newlength)
relindex = (len(rows)-1) * rel
basenr,decimals = str(relindex).split(".")
relbwindex = float("0."+decimals)
#determine equivalent value
if stretchmethod=="interpolate":
#test for gap
maybecurrelval = rows[int(relindex)]
maybenextrelval = rows[int(relindex)+1]
if maybecurrelval == gapvalue:
#found gapvalue, so skipping and waiting for valid value to interpolate and add to outlist
gapcount += 1
continue
#test whether to interpolate for previous gaps
if gapcount > 0:
#found a valid value after skipping gapvalues so this is where it interpolates all of them from last valid value to this one
startvalue = outlist[-1]
endindex = int(relindex)
endvalue = rows[endindex]
gapstointerpolate = gapcount
allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
outlist.extend(allinterpolatedgaps[1:-1])
gapcount = 0
writinggapsflag = False
#interpolate value
currelval = rows[int(relindex)]
lookahead = 1
nextrelval = rows[int(relindex)+lookahead]
if nextrelval == gapvalue:
if writinggapsflag:
continue
relbwval = currelval
writinggapsflag = True
else:
relbwval = currelval + (nextrelval - currelval) * relbwindex #basenr pluss interindex percent interpolation of diff to next item
elif stretchmethod=="duplicate":
relbwval = rows[int(round(relindex))] #no interpolation possible, so just copy each time
elif stretchmethod=="spread":
if rel >= relspreadindex:
spreadindex = int(len(rows)*relspreadindex)
relbwval = rows[spreadindex] #spread values further apart so as to leave gaps in between
relspreadindex = next(relspreadindexgen)
else:
relbwval = gapvalue
#assign each value
outlist.append(relbwval)
#assign last value
if gapcount > 0:
#this last value also has to interpolate for previous gaps
startvalue = outlist[-1]
endvalue = rows[-1]
gapstointerpolate = gapcount
allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
outlist.extend(allinterpolatedgaps[1:-1])
outlist.append(rows[-1])
gapcount = 0
writinggapsflag = False
else:
outlist.append(rows[-1])
return outlist
Keep a counter, which you increment by the second value. Floor it each time, and yield the value at that index.
Can't random.choices() solve your problem? More examples are available here
With reference to answer from Ignacio Vazquez-Abrams:
Print 3 numbers from the 7 available:
msg_cache = [1, 2, 3, 4, 5, 6]
msg_n = 3
inc = len(msg_cache) / msg_n
inc_total = 0
for _ in range(0, msg_n):
msg_downsampled = msg_cache[math.floor(inc_total)]
print(msg_downsampled)
inc_total += inc
Output:
0
2
4
Useful for down-sampling many log messages to a smaller subset.
精彩评论