开发者

Inequality Filters on a date and a number

Am trying to query my Google App Engine datastore [Python], which has a item_name, manufacturing_date and number_of_items_shipped. There are ~1.0 million records in the datastore and ever increasing.

The scenario: Get all the item_names which have been shipped more than x_items [user input] and manufactured after some_date [user input]. Basically, kind of an inventory check.

Effectively 2 inequalities on properties. But due to restrictions on queries in GAE, am not able to do this.

Searched SO for this issue. But, no luck till now. Did you come across this issue? If so, were you able to resolve this? Please let me know.

Also in Google I/O 2010, Next Gen Queries开发者_开发百科, Alfred Fuller mentioned that they are going to remove this restriction soon. Its been more than 8 months, but this restriction is in place even now. Unfortunately.

Appreciate if anyone can post an answer if they were able to circumvent this restriciton.

Thanks a lot.


Building on Sudhir's answer, I'd probably assign each record to a manufacture date "bucket", based on the granularity you care about. If your range of manufacturing dates is over a couple of years, use monthly buckets for example. If your range is just in the last year, weekly.

Now when you want to find records with > n sales and manufacturing date in a given range, do your query once per bucket in that range, and postfilter out the items you are not interested in.

For example (totally untested):

BUCKET_SIZE_DAYS = 10

def put(self):
    self.manufacture_bucket = int(self.manufacture_date.toordinal() / BUCKET_SIZE_DAYS)
    super(self.__class__, self).put()

def filter_date_after(self, date_start):
    first_bucket = int(date_start.toordinal() / BUCKET_SIZE_DAYS)
    last_bucket = int(datetime.datetime.today().toordinal() / BUCKET_SIZE_DAYS)

    for this_bucket in range(first_bucket, last_bucket+1):
        for found in self.filter("manufacture_bucket =", this_bucket):
            if found.manufacture_date >= date_start:
                yield found

You should be then able to use this like:

widgets.filter("sold >", 7).filter_date_after(datetime.datetime(2010,11,21))

Left as an exercise for the reader:

  • Making it play nicely with other filters added to the end
  • Multiple bucket sizes allowing you to always query ln(days in date range) buckets.


Unfortunately, you can't circumvent this restriction, but I can help you model the data in a slightly different way.

First off, Bigtable is suited to very fast reads off large databases - the kind you do when have a million people hitting your app at the same time. What you're trying to do here is a report on historical data. While I would recommend moving the reporting to a RDBMS, there is a way you can do it on Bigtable.

First, override the put() method on your item model to split the date before saving it. What you would do is something like

def put(self):
  self.manufacture_day = self.manufacture_date.day
  self.manufacture_month = self.manufacture_date.month
  self.manufacture_year = self.manufacture_date.year
  super(self.__class__, self).put()

You can do this to any level of granularity you want, even hours, minutes, seconds, whatever.

You can apply this retroactively to your database by just loading and saving your item entities. The mapper is very convenient for this.

Then change your query to use the inequality only on the item count, and select the days / months / years you want using normal equalities. You can do ranges by either firing multiple queries or using the IN clause. (Which does the same thing anyway).

This does seem contrived and tough to do, but keep in mind that your reports will run almost instantaneously if you do this, even when millions of people try to run them at the same time. You might not need this kind of scale, but well... that's what you get :D

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜