Algorithm in Python to store and search daily occurrence for thousands of numbered events?
I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.
This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.
Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:
class Streetlamp(object):
"""Class for streetlamp record"""
def __init__(self, **args):
self.location = args['location']
self.power = args['power']
self.inservice = ???
My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.
A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:
sl1000_up = dict('20开发者_运维知识库10': '11100000000000...', '2011':'11111100100...')
Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.
Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.
I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.
Create a 2D-array in Numpy:
import numpy as np
nbLamps = 200000
nbDays = 365
arr = np.array([nbLamps, nbDays], dtype=np.bool)
It will be very memory-efficient and you can aggregate easily the days and lamps.
In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.
I'd probably dictionary the lamps and have each of them contain a list of state changes where the first element is the time of the change and the second the value that's valid since that time.
This way when you get to the next sample you do nothing unless the state changed compared to the last item.
Searching is quick and efficient as you can use binary search approaches on the times.
Persisting it is also easy and you can append data to an existing and running system without any problems too as well as dictionary the lamp state lists to further reduce resource usage.
If you want to search for a gap you just go over all the items and compare the next and prev times - and if you decided to dictionary the state lists then you'll be able to do it just once for every different list rather then every lamp and then get all the lamps that had the same "offline" states with just one iteration which may sometimes help
精彩评论