Scaling past the single machine

TL/DR

We cut down disk usage to 12% of the original by using btrfs with zstd compression, and organizing the table data on disk for optimal query performance.

By using PostgreSQL 10 native partitioning, postgres_fdw and btrfs for storage we cut down disk usage while maintaining performance.

Intro

Here at Modio AB we store a fair bit of data, just about hitting the big data size on some of our datasets.

We've got enough data that it's unfeasible to back it up, and it doesn't fit well in RAM either. It's enough that asking a DB server to read data from disk will cause enough IO load to incur damages on other clients, simply because of disk activity.

But all data are not equal. The data from the last few weeks is more likely to be accessed and showed up than data from a year ago, so all data doesn't have to be treated equally.

For this article, I'll use a simple data set table of the following (example) schema:

`id`: int	`clock`: int	`ns`: int	`value`: int
12	1525080407	0123123	0
1200	1525080407	0123123	12

The table will have an index over the pair (id, clock) and nothing else. To avoid inter-table lookups at insert time, id is not a primary key.

Since this is time series data, it will naturally come in sorted by the clock field (as events happen), so all data for one moment is stored togeter on disk.

Most queries are in the form, "outside temperature between 12 and 14", expressed in SQL as:
select * from history where id=12 and clock between (1525080407) and (1525080407+3600)

This would be better suited if all the data for a single sensor would be located together. Thus, our natural insert order causes data fragmentation on disk because it's not laid out in the way we query such data.

This is the cost of using a generic datastore vs. an time series datastore that knows more about the layout of the data. A columnar store, or a specialized time series storage layout performs a lot better here.

Partitioning

Native partitioning is a new feature in PostgreSQL 10, and it allows you to split out your data into smaller sub-groups. In our case, we use range based partitioning for the clock column, in order to split data into tables on a monthly basis.

Column	Type	Nullable	Storage
id	integer	not null	plain
clock	integer	not null	plain
value	integer	not null	plain
ns	integer	not null	plain

Partition key: RANGE (clock)
Partitions: history_y2018m01 FOR VALUES FROM (1514764800) TO (1517443200),
            history_y2018m02 FOR VALUES FROM (1517443200) TO (1519862400),
            history_y2018m03 FOR VALUES FROM (1519862400) TO (1522540800)

Partition maintenance

Once we've partitioned our data, we have smaller, more manageable datasets of ~10-15GB, each containing half a billion of datapoints, it should be more manageable.

This involves using the CLUSTER command. It will re-organize a table on disk, rewriting all data to appear in the order prescribed by an index on the table. There is also an extension called pg_repack that can do similar, but online ( pg_repack requires a primary key or a unique key, which we don't have).

CLUSTER takes an exclusive, database-level, read-and-write lock on the table, and thus any read or write to the primary partition will block while it happens. To prevent this, we start by detaching the partition, and attaching a new, empty table in it's place. This causes a gap in the data for the time period that the maintenance runs.

After this, we make sure that there's a btree index on the table for (itemid, clock), since CLUSTER cannot use the more space-efficient BRIN index.

Once that's done, we CLUSTER the table, add CHECK constraints, and return it to it's position in the primary table.

For the hairy details, you can see the source code.

The disk space issue

Now we have neatly ordered data on disk, but it's still using a lot of space. How do we deal with this? I set about to test out cstore_fdw, which uses the ORC format to store and compress data.

The results of such experiments were mixed, and tests with our usual queries were much slower than usual. Our SELECT benchmark resulted in, at best, 11 Transactions per second, compared to approximately 150-200 otherwise.

cstore_fdw gave good savings on disk space (8.5 GB on disk for 75GB stored data).

As an alternative to that, I ran a test with btrfs+zstd for archival tables, and while the compression was slightly worse ( approximately 3%), query performance was an order of magnitude higher.

The below table uses on disk size, not apparent size, and differ between various filesystems, so they are more for ballpark idea, than exact measures. The dataset for this test is ~75GB of time series, and the queries represent our typical workload. Machine and postgres configuration is the same for all tests. Data is much larger than available RAM.

solution	disk used	insert tps	select tps
no compression	75717528	3879	187
cstore_fdw	9604680	4565	11
btrfs (zstd)	9773740	4663	166

In other tests, compression gave a slight advantage to query perfomance, simply because of reduced IO latencies.

Putting it together

So, a simple solution wrapping this together would be to use tablespaces where archival data is on a different filesystem from normal data. For various reasons, we decided against this to begin with, a decision we might return to.

Instead, we're using postgres_fdw to use remote tables for archival data on a second database server.

Moving data from one database to another over postgres_fdw turned out to be slow, so I had to make a solution to efficiently copy data between the databases, using the COPY command and pipes.

The steps

So, the tool will:

Create tables on the archive database
Create foreign tables on the primary database
Detach source table from the parent table
Attach foreign table to the parent table
Use COPY to move and sort the data from primary to archive
Cluster the table after data injection
Clean up after itself

Overall, getting all the small details correct has been difficult. However, the end result is a reduction of disk usage to 12% of the original, while keeping query performance at around 90% of the original.

Compression Gains

A single month's data (September 2017) according to compsize:

Type	Perc	Disk Usage	Uncompressed	Referenced
TOTAL	11%	1.4G	12G	12G

BRIN index vs. Btree index

For the above table a brin index is approximately 4.1MB uncompressed, while a btree index for the same columns and dataset is 7.4GB.

What about dropping nanosecond resolution of data?
We can't just remove a column from the table, even if that would be desirable, since that would break the foreign tables. However, zeroing it in the table and letting disk compression sort it out can help there. However, this only saves ~1.1% in size.

All together now

Table partitioning in ranges
Use brin indexes on older data (disk savings)
Use btreeindexes on fresh data (performance)
CLUSTER tables on disk to further optimize brin index (perfromance)
Place older partitions on btrfs with zstd (disk savings)
Use postgres_fdw to offload older data to another DB (disk savings)