Introduction
Database systems like PostgreSQL provide a useful layer of abstraction: you can store and retrieve data using SQL (via the Django ORM) without having to worry about how the database manages the data under the hood. Database indexes enable fast queries, and automatic vacuuming cleans up old row versions and deleted rows. But as tables become extremely large, the abstraction starts to break down, and performance issues become apparent. In this talk, I'll describe what options are available (including partitioning) to manage large database tables, and how we're approaching this problem at Kraken Tech.
The problem
At Kraken Tech, one of our installations of Kraken has a table with about 9 billion rows and is about 3 TB in size. This table takes around 20 hours to vacuum, and it requires vacuuming about once a day. During a recent bigint conversion project (see my talk from DjangoCon Europe 2025), this long and frequent vacuuming interfered with the other maintenance work we were doing on the table. We'd like to reduce the time taken to vacuum; other performance improvements would also be welcome. We'd also like the solution we adopt to be generally applicable to other large tables.
One approach would be to delete old data from the table and store it in another storage system instead (such as S3). However, that would also require application-level changes to enable access to the old data, and those changes wouldn't necessarily generalise to other tables.
Another approach would be to start with a brand new table for all new data. The new table will be fast (until it grows to the size of the old table), and the old table will stop needing to be vacuumed once no more writes to it are needed. This approach will also require application-level changes to access both databases, and will only defer the problem, not solve it.
A third approach is to partition the table using PostgreSQL's native support for partitioning. Partitioning splits a large table into a number of smaller tables, with each row being assigned to one of the partitions depending on the value of its partition key. These smaller tables have their own indexes and are vacuumed separately as needed, which will address the problem of slow vacuuming. And accessing a partitioned table is transparent to the application, mostly.
Requirements, choices and compromises
PostgreSQL supports different types of partitioning: range, list and hash; that's one choice we need to make. Another choice is how many partitions will be used, and when to add new ones.
But the biggest choice, and the one with the most requirements and consequences, is which column(s) to use for the partition key. There usually will be a compromise here, since the partition key that offers the most performance improvement will often not support desired constraints. And Django imposes its own requirements that affect the choice of partition key.
Partitioning existing tables
For our use case at Kraken Tech, we would like to partition existing tables without imposing any system downtime. The PostgreSQL extension pg_partman provides support for "online" partitioning that enables this, but via manual processes that don't scale across the large number of installed systems we support. We have developed a Python package psycopack for replicating PostgreSQL tables and performing schema changes in the process; we are working on enhancing that tool to support partitioning as well.