postgresql CLUSTER command not clearing dead tuples

2023-02-04 08:46 问答作者：

We have a background process (linux daemon in an infinite loop) that automatically takes all lines from csv files that are placed in a certain directory and imports them into a table. The daemon processes any files that appear in the directory one by one, is written in python, and uses psycopg2 to connect to our postgresql database.

That process imports those records using INSERT statements, but first DELETES any table records that have the same unique key as any of the records in the csv file. Generally the process is DELETING a record for every record it INSERTS. So as this daemon is running in the background it is DELETING and then INSERTING rows. Every time it processes one file it specifically commits the transaction, closes the cursor, and then closes the connection.

Periodically (twice a day) we want to run CLUSTER to remove the dead tuples and keep the table to a manageable on disk size.

However, something in this process is stopping the CLUSTER command from removing the dead tuples for all the records that are being deleted as the process is running. We know this happening because if we run CLUSTER while the process is running, the on disk size of the table containing this imported data will not decrease and pg_stat_user_tables will show many dead tuples.

If we stop the process and then run CLUSTER, the on disk size of the table will decrease dramatically and and pg_stat_user_tables will report that all of the dead tuples are gone.

What's strange is we are committing the transaction and closing 开发者_运维技巧the connections every time we process each file, so I have no idea what is not allowing the dead tuples to be removed while the process is running.

Also strange, is that if we stop the process, then start the process again, then do a CLUSTER, it will remove all of the dead tuples created by the previous run of the daemon process; but any subsequent calls of CLUSTER will not clear any dead tuples created by the current run of the daemon process (while it is still running of course).

So something is maintaining some kind of link to the dead tuples until the process is stopped, even though we have committed the transaction and closed all connections to postgres that created those dead tuples. pg_locks does not report any open locks and no running transactions are reported, so it doesn't seem like its a lock or open transaction issue.

At the end of the day, this is stopping us from periodically running CLUSTER on the table so that it doesn't keep growing and growing.

I'm sure there is a simple answer to this, but I can't find it anywhere. Some skeleton code for the process is below. It really is a simple process so I have no idea what is going on here. Any guidance would be greatly appreciated.

while True:
    l = [(get_modified_time(fname), fname) for fname in os.listdir('/tmp/data')]
    l.sort()

    for (t, fname) in l:
        conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
        cursor = conn.cursor()

        # Calls a postgresql function that reads a file and imports it into 
        # a table via INSERT statements and DELETEs any records that have the 
        # same unique key as any of the records in the file.
        cursor.execute("SELECT import('%s', '%s');" % (fname, t))

        conn.commit()
        cursor.close()
        conn.close()

        os.remove(get_full_pathname(fname))

        time.sleep(0.100)

What's wrong with autovacuum? When autovacuum does it's job, you don't have to use CLUSTER to cleanup dead tuples. CLUSTER isn't made for this, it's VACUUM.

If you change the proces to UPDATE duplicates, things might get even better when you use a lower FILLFACTOR: HOT updates. These are faster, reclaim space, keep the same order in storage and no need for VACUUM nor CLUSTER.

继续阅读：cluster-computing postgresql psycopg

postgresql CLUSTER command not clearing dead tuples

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？