开发者

Django Using Annotate Instead of Distinct()

I have read that distinct() API call has some performance issues at times. I wanted to try to rewrite a query through the orm which avoided using distinct (at least profile the difference).

My understanding is that values() performs a Group By under the hood. When I test out the two methods, though, the Count of objects differs depending on whether I use distinct() or values()/annotate().

   zip_codes = Location.objects.values('zip_code').annotate(zip_count=Count('zip_code')).exclude(zip_code=None).count()

VS.

  zip_codes = Location开发者_JS百科.objects.values_list('zip_code', flat=True).exclude(zip_code=None).distinct()

any thoughts on what is wrong here?

Thanks!


I just quickly checked your queries against a database I have with a similar query. The counts was identical so I'm not sure what about your data is resulting in issues.

I'd also be HIGHLY skeptical of the premise though. DISTINCT is indeed a cpu intensive query. However, so is COUNT(*) and your second query is going to first run an count aggregate with a group by and then run a COUNT on the results. I'd be put money on the single DISTINCT call being faster (I'd also check with whichever database backend you're using to see). All of this has very little to do with django's ORM and a whole heck of a lot more to do with your database backend.

Also think about this. The distinct based query is an order of magnitude clearer as to what it's accomplishing compared to the annotate based one. Do you have evidence to support that DISTINCT is going to be slow in your situation, or better still that it's forming a bottlneck right now? If not you're well into the range of premature optimization and should heavily reconsider your path.

Premature Optimization.

Optimization matters only when it matters. When it matters, it matters a lot, but until you know that it matters, don't waste a lot of time doing it. Even if you know it matters, you need to know where it matters. Without performance data, you won't know what to optimize, and you'll probably optimize the wrong thing.

The result will be obscure, hard to write, hard to debug, and hard to maintain code that doesn't solve your problem. Thus it has the dual disadvantage of (a) increasing software development and software maintenance costs, and (b) having no performance effect at all.

In other words write your software clearly and then when you find a problem trace it to the source and fix it. Anything you do before that is counterproductive. Spend your time worrying about which indexes are going to matter on your db, and where to use select_related. Those are 10000% more effective than what you are worrying about here (unless you are counting zip codes all the time, in which case let me introduce you to caching)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜