postgresql CLUSTER command not clearing dead tuples
We have a background process (linux daemon in an infinite loop) that automatically takes all lines from csv files that are placed in a certain directory and imports them into a table. The daemon processes any files that appear in the directory one by one, is written in python, and uses psycopg2 to connect to our postgresql database.
That process imports those records using INSERT statements, but first DELETES any table records that have the same unique key as any of the records in the csv file. Generally the process is DELETING a record for every record it INSERTS. So as this daemon is running in the background it is DELETING and then INSERTING rows. Every time it processes one file it specifically commits the transaction, closes the cursor, and then closes the connection.
Periodically (twice a day) we want to run CLUSTER to remove the dead tuples and keep the table to a manageable on disk size.
However, something in this process is stopping the CLUSTER command from removing the dead tuples for all the records that are being deleted as the process is running. We know this happening because if we run CLUSTER while the process is running, the on disk size of the table containing this imported data will not decrease and pg_stat_user_tables will show many dead tuples.
If we stop the process and then run CLUSTER, the on disk size of the table will decrease dramatically and and pg_stat_user_tables will report that all of the dead tuples are gone.
What's strange is we are committing the transaction and closing 开发者_运维技巧the connections every time we process each file, so I have no idea what is not allowing the dead tuples to be removed while the process is running.
Also strange, is that if we stop the process, then start the process again, then do a CLUSTER, it will remove all of the dead tuples created by the previous run of the daemon process; but any subsequent calls of CLUSTER will not clear any dead tuples created by the current run of the daemon process (while it is still running of course).
So something is maintaining some kind of link to the dead tuples until the process is stopped, even though we have committed the transaction and closed all connections to postgres that created those dead tuples. pg_locks does not report any open locks and no running transactions are reported, so it doesn't seem like its a lock or open transaction issue.
At the end of the day, this is stopping us from periodically running CLUSTER on the table so that it doesn't keep growing and growing.
I'm sure there is a simple answer to this, but I can't find it anywhere. Some skeleton code for the process is below. It really is a simple process so I have no idea what is going on here. Any guidance would be greatly appreciated.
while True:
l = [(get_modified_time(fname), fname) for fname in os.listdir('/tmp/data')]
l.sort()
for (t, fname) in l:
conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
# Calls a postgresql function that reads a file and imports it into
# a table via INSERT statements and DELETEs any records that have the
# same unique key as any of the records in the file.
cursor.execute("SELECT import('%s', '%s');" % (fname, t))
conn.commit()
cursor.close()
conn.close()
os.remove(get_full_pathname(fname))
time.sleep(0.100)
What's wrong with autovacuum? When autovacuum does it's job, you don't have to use CLUSTER to cleanup dead tuples. CLUSTER isn't made for this, it's VACUUM.
If you change the proces to UPDATE duplicates, things might get even better when you use a lower FILLFACTOR: HOT updates. These are faster, reclaim space, keep the same order in storage and no need for VACUUM nor CLUSTER.
精彩评论