Introduction

In one of the projects I used to work on, we’ve employed CQRS approach with PostgreSQL as a write storage and NoSQL database as a read storage. As a safety measure, we had a special endpoint that allowed us to regenerate entire content of the read storage based on the write storage that is supposed to be the single source of truth.

The story started one day when we discovered that our safety handle started to fail with the write storage database timeout due to several millions of rows in it. Generally speaking, the solution proposed below is applicable to a wide range of data storages but we’ll touch upon some Postgres-specific tooling, so let’s focus on that.

Partitioning overview

Partition is the process of splitting a dataset into more manageable, smaller pieces. This, in turn yeilds reduced INSERT and SELECT costs. Partitioning can be accomplished in many ways. For instance, vertical partitioning separates data into columns. Horizontal partition, on the contrary, divides data into rows according to particular criteria (partition key). When choosing partition key we can use the following strategies: - range partitioning (splitting by date or numeric ranges) - list partitioning (splitting by categorical values like region) - hash partitioning (even distribution using a hash function)

Partitioning vs sharding

While partitioning occurs within a single database instance, sharding is a way to split the data into multiple physical instances. Thus, sharding increases operational costs due to such issues as the necessity to put the proxy, which will route requests to shards, need in distributed transactions or expensive cross-shards joins.

For that reason, we should employ sharding only when partitions exceed the limits of a single machine storage.

Choosing the right partitioning strategy

So our plan is to split the write storage into some manageable chunks so that none of those chunks will fail due to a timeout. How do we design those chunks?

The code of the safety handle reads items from the read storage in batches. I already blogged about how batching can be a neat solution for handling a large dataset. So ideally, we would like to have our partition equal to or several times bigger than the batch size, so we could read the entire partition with an integer number of iterations.

For that reason, we have to discard the idea of list partition or range partition based on serial data (such as timestamps), as they generally offer uneven distribution of data.

The other two options are hash partitioning and range partitioning based on an integer primary key. Since both of them offer even distribution, we’ll stick with a simpler range-based approach.

Using pg_partman

The go-to solution for partitioning in PostgreSQL is pg_partman extension.

First, let’s enable the extension.

CREATE SCHEMA partman;
CREATE EXTENSION pg_partman WITH SCHEMA partman;

To partition existing data, we’ll create a new partitioned table with the desired structure and migrate data from the old table to the new one.

ALTER TABLE public."Entities" RENAME CONSTRAINT "PK_Entities" TO "PK_Entities_old";

ALTER TABLE public."Entities" RENAME TO "Entities_old";

CREATE TABLE IF NOT EXISTS public."Entities"
(
    "Id" bigint GENERATED ALWAYS AS IDENTITY,
    "UserName" character varying(255) COLLATE pg_catalog."default" NOT NULL,
    "TimeStamp" timestamp with time zone NOT NULL,
    "Entity" jsonb,
    "OrganizationInfo" text COLLATE pg_catalog."default",
    CONSTRAINT "PK_Entities" PRIMARY KEY ("Id")
)  PARTITION BY RANGE ("Id");

Now we have to figure out how many partitions to create. To achieve this, we have to do some math. We’ll take the number of existing rows, and given the row count in a single partition, we’ll calculate how many partitions we need.

DO $$
DECLARE
    v_row_count bigint;
    v_partition_size bigint := 200000;
    v_premake integer;
BEGIN
    SELECT count(*) INTO v_row_count
    FROM public."Entities_old";

    v_premake := ceil(v_row_count::numeric / v_partition_size) + 1;

    PERFORM partman.create_parent(
        p_parent_table := 'public.Entities',
        p_control      := 'Id',
        p_interval     := v_partition_size::text,
        p_premake      := v_premake
    );
END $$;

A caveat here is the parameter p_premake that plays an important role, as it represents the maximum number of partitions that are available in advance when performing partitioning. By default, this parameter is 5, which means if we have 10 000 000 rows in the original table and we want to configure partitions 200 000 rows each, we’ll end up with 5 partitions holding together 1 000 000 rows and another 9 000 000 rows residing in the default partition!

p_premake is rarely mentioned in most of the examples dedicated to pg_partman, as they use quite a modest amount of data for the illustration, so I spent a lot of time debugging and researching until I discovered it.

Before we wrap up with transferring data into the new table, it should be noted that pg_partman does not create new partitions automatically. This means that if we have 5 partitions with 200 000 rows each, when we insert row number 1 000 001, the 6th partition won’t be created. Instead new record will fall into the default partition. For that reason, we have to perform partition maintenance, which is a built-in function of pg_partman.

To achieve this, we’ll install pg_cron extension, which initiates an internal scheduler.

CREATE EXTENSION pg_cron;

SELECT cron.schedule('@hourly', $$CALL partman.partition_data_proc('public.Entities');$$);
Calling partition_data_proc allows us to create a new partition and remove data from the default partition, if any.

With that done, the last step is migration to a new table, which is pretty straightforward.

INSERT INTO public."Entities"(
    "UserName", "TimeStamp", "Entity", "OrganizationInfo")
SELECT "UserName", "TimeStamp", "Entity", "OrganizationInfo"
FROM public."Entities_old"
ORDER BY "Id";

DROP TABLE IF EXISTS public."Entities_old"

Assessing the result

Before partitioning, we had 9 million rows in the table mentioned above. Selecting a single batch from it could take up to 30 seconds. Now, with a little bit of SQL glue, we were able to reduce this time to the fractions of a second, yielding orders of magnitude improvement.