How can I optimize multiple nested SELECTs in SQLite (w/Python)?

2022-12-18 23:12 问答作者：

I'm building a CGI script that polls a SQLite database and builds a table of statistics. The source database table is described below, as is the chunk of pertinent code. Everything works (functionally), but the CGI itself is very slow as I have multiple nested SELECT COUNT(id) calls. I figure my best shot at optimization is to ask the SO community as my time with Google has been relatively fruitless.

The table:

CREATE TABLE messages (
    id TEXT PRIMARY KEY ON CONFLICT REPLACE,
    date TEXT,
    hour INTEGER,
    sender TEXT,
    size INTEGER,
    origin TEXT,
    destination TEXT,
    relay TEXT,
    day TEXT);

(Yes, I know the table isn't normalized but it's populated with extracts from a mail log... I was happy enough to get the extract & populate working, let alone normalize it. I don't think the table structure has a lot to do with my question at this point, but I could be wrong.)

Sample row:

476793200A7|Jan 29 06:04:47|6|admin@mydomain.com|4656|web02.mydomain.pvt|user@example.com|mail01.mydomain.pvt|Jan 29

And, the Python code that builds my tables:

#!/usr/bin/python
print 'Content-type: text/html\n\n'

from datetime import date

import re
p = re.compile('(\w+) (\d+)')

d_month = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
l_wkday = ['Mo','Tu','We','Th','Fr','Sa','Su']

days = []
curs.execute('SELECT DISTINCT(day) FROM messages ORDER BY day')
for day in curs.fetchall():
    m = p.match(day[0]).group(1)
    m = d_month[m]
    d = p.match(day[0]).group(2)
    days.append([day[0],"%s (%s)" % (day[0],l_wkday[date.weekday(date(2010,int(m),int(d)))])])

curs.execute('SELECT DISTINCT(sender) FROM messages')
senders = curs.fetchall()
for sender in senders:
    curs.execute('SELECT COUNT(id) FROM messages WHERE sender=%s',(sender[0]))
    print '  <div id="'+sender[0]+'">'
    print '   <h1>Stats for Sender: '+sender[0]+'</h1>'
    print '   <table><caption>Total messages in database: %d</caption>' % curs.fetchone()[0]
    print '    <tr><td>&nbsp;</td><th colspan=24>Hour of Day</th></tr>'
    print '    <tr><td class="left">Day</td><th>%s</th></tr>' % '</th><th>'.join(map(str,range(24)))
    for day in days:
            print '    <tr><td>%s</td>' % day[1]
            for hour in range(24):
                    sql = 'SELECT COUNT(id) FROM messages WHERE sender="%s" AND day="%s" AND hour="%s"' % (sender[0],day[0],str(hour))
                    curs.execute(sql)
                    d = curs.fetchone开发者_JAVA百科()[0]
                    print '    <td>%s</td>' % (d>0 and str(d) or '')
            print '    </tr>'
    print '   </table></div>'

print ' </body>\n</html>\n'

I'm not sure if there are any ways I can combine some of the queries, or approach it from a different angle to extract the data. I had also thought about building a second table with the counts in it and just updating it when the original table is updated. I've been staring at this for entirely too long today so I'm going to attack it fresh again tomorrow, hopefully with some insight from the experts ;)

Edit: Using the GROUP BY answer provided below, I was able to get the data needed from the database in one query. I switched to Perl since Python's nested dict support just didn't work very well for the way I needed to approach this (building a set of HTML tables in a specific way). Here's a snippet of the revised code:

my %data;
my $rows = $db->selectall_arrayref("SELECT COUNT(id),sender,day,hour FROM messages GROUP BY sender,day,hour ORDER BY sender,day,hour");
for my $row (@$rows) {
    my ($ct, $se, $dy, $hr) = @$row;
    $data{$se}{$dy}{$hr} = $ct;
}
for my $se (keys %data) {
    print "Sender: $se\n";
    for my $dy (keys %{$data{$se}}) {
    print "Day: ",time2str('%a',str2time("$dy 2010"))," $dy\n";
        for my $hr (keys %{$data{$se}{$dy}}) {
            print "Hour: $hr = ".$data{$se}{$dy}{$hr}."\n";
        }
    }
    print "\n";
}

What once executed in about 28.024s now takes 0.415s!

first of all you can use the group by clause:

select count(*), sender from messages group by sender;

and with this you execute one query for all senders instead of on query for each sender. Another possibility could be:

select count(*), sender, day, hour
    from messages group by sender, day, hour
    order by sender, day, hour;

i didn't test it but at least now you know the existances of group by clause. this should reduce the number of queries and i think this is the first big step to increase performance.

second, create indexes based on search columns, in your case sender, day and hour.

if this isn't enough use profiling tools to find where the most the time is spent. you should also consider the use of fetchmany instead of fetchall to keep low memory consumption. remember that since sqlite module is coded in C use it as much as possible.

For starters, create an index:

CREATE INDEX messages_sender_by_day ON messages (sender, day);

(You probably don't need to include "hour" in there.)

If that doesn't help or you've already tried it, then please fix up your question a bit: give us some code to generate test data and SQL for all indexes on the table.

Maintaining a count cache is fairly common, but I can't tell if that's needed here.

继续阅读：optimization perl python query-optimization

How can I optimize multiple nested SELECTs in SQLite (w/Python)?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？