2022-05-13

pg_ivm 1.0 released!

pg_ivm v1.0 was officially released!

pg_ivm is an extension module that provides Incremental View Maintenance (IVM) feature, which is a way to make materialized views up-to-date in which only incremental changes are computed and applied on views rather than recomputing.

I have already explained pg_ivm in the previous post, but it was in alpha release stage at that time. The official v1.0 was released after a few bug fixes and code cleaning.

Also, RPM packages for pg_ivm are now available from PostgreSQL yum repository. So, you can also install pg_ivm for PostgreSQL 14 using yum or dnf intead of building from the source code, like:

$ sudo yum install pg_ivm_14

See the official instruction for details.

We would appreciate it if you could try pg_ivm v1.0 and give us any feedback!

2022-04-15

pg_ivm: a PostgreSQL extension providing Incremental View Maintenance feature

Introduction

As I introduced in the past posts ([1], [2], [3]), we have proposed to implement Incremental View Maintenance (IVM) support in PostgreSQL core. This project is aiming to add the new feature into PostgreSQL in future, but not to make a tool that enables the current PostgreSQL to use IVM. However, since publishing the code to GitHub repository, we have received several questions of whether we can use our IVM feature in the current PostgreSQL versions. So, for meeting this demand, we decided to launch a new project, that is named pg_ivm.

pg_ivm is the extension module that provides Incremental View Maintenance (IVM) feature for PostgreSQL, and this enables you to use IVM with the current PostgreSQL version!

This post describes what is pg_ivm and how to use it. The alpha version of pg_ivm 1.0 is released for public testing at the end of last month. So, you can try pg_ivm in the same way as explained in this post, and it would be so appreciated if you would give us any feedback for pg_ivm.

Installation

The installation is same as the other extension modules.

First, execute make install under in the module directory.

$ cd pg_ivm
$ make install USE_PGXS=1

If you installed PostgreSQL from rpm or deb, you will need the devel package (for example, postgresql14-devel or postgresql-server-dev-14). Set the PG_CONFIG variable (make PG_CONFIG=...) in case you want to install pg_ivm to a non-default PostgreSQL.

Then, execute CREATE EXTENSION under the super user.

CREATE EXTENSION pg_ivm;

How to Use

We call a materialized view that are maintained incrementally as an Incrementally Maintainable Materialized View (IMMV).

In our proposal for the PostgreSQL core ([4] ), an IMMV is a special type of materialized view and you can create one by executing CREATE INCREMENTAL MATERIALIZED VIEW command. On the other hand, in the pg_ivm extension, an IMMV is a special type of table and you can create one by calling the function create_immv.

For example, if you want to create IMMV with the same definition of a materialized view defined as;

test=# CREATE MATERIALIZED VIEW mv_normal(aid, bid, abalance, bbalance) AS
       SELECT a.aid, b.bid, a.abalance, b.bbalance
        FROM pgbench_accounts a JOIN pgbench_branches b USING(bid);
SELECT 10000000

call create_immv as below;

test=# SELECT create_immv('immv(aid, bid, abalance, bbalance)',                   
                   'SELECT a.aid, b.bid, a.abalance, b.bbalance
                    FROM pgbench_accounts a JOIN pgbench_branches b USING(bid)');
NOTICE:  created index "immv_index" on immv "immv"
 create_immv 
-------------
10000000
(1 row)

This function has two arguments of text and returns the number of rows in IMMV. The first is the name of the IMMV with the optional column names, and the second is the view definition query.

The created IMMV is updated automatically and incrementally when its base table is modified. For example, after updating a row in pgbench_accounts, immv is automatically updated like this;

test=# UPDATE pgbench_accounts SET abalance = 1234 WHERE aid = 1;
UPDATE 1
Time: 15.448 ms

test=# SELECT * FROM immv WHERE aid = 1;
 aid | bid | abalance | bbalance 
-----+-----+----------+----------
   1 |   1 |     1234 |        0
(1 row)

It took about only 15 ms, whereas REFRESH of the materialized view with the same definition required 20 sec.

test=# REFRESH MATERIALIZED VIEW mv_normal ;
REFRESH MATERIALIZED VIEW
Time: 20575.721 ms (00:20.576)

Summary

In this post, I explained the pg_ivm extension module that provides Incremental View Maintenance (IVM) feature for PostgreSQL.

It is still in alpha release stage and the official release is planed in this month. The first release will support only sub-set of features implemented in the original project, and aggregates will not be supported, for example. Also, it is compatible with only PostgreSQL 14 for now. After confirming that they works without problems, we are going to support the remaining features and other PostgreSQL versions.

2021-11-29

Transition Tables in Incremental View Maintenance (Part II): Multliple Tables Modification case

Introduction

In a previous post, I explained how we use transition tables in our implementation of Incremental View Maintenance (IVM) on PostgreSQL. Transition table is a features of AFTER trigger which allows trigger functions to refer to the changes of a table that occurred in a statement. We are using transition tables in order to extract table changes needed to calculate changes to be applied on views .

In this article I describes a more complicated situation, specifically how we handle transition tables when multiple tables are modified in a statement.

Single Table Modification

In a case where a single table is modified in a statement, the view maintenance process is simple. For example, suppose we have three tables R, S, and T. We also define a materialized view V = R ⨝ S ⨝ T that joins these tables as bellow:

SELECT x,y,z FROM R,S,T WHERE R.i=S.i AND S.j=T.j;

Then, suppose that table R was modified in a statement. This operation can be written as R ← R ∸ ∇R ⊎ ΔR, where ∇R is a bag of tuples deleted from R, and ΔR is a bag of tuples inserted into R in this statement. In this case, the changes are calculated as ∇V = ∇R ⨝ S ⨝ T and ΔV = ΔR ⨝ S ⨝ T, and we can update the view as V ← V ∸ ∇V ⊎ ΔV. The SQL representation of these calculations is as follows:

-- ∇V: tuples to be deleted from the view 
SELECT x,y,z FROM R_old,S,T WHERE R_old.i=S.i AND S.j=T.j;
-- ΔV: tuples to be inserted into the view
SELECT x,y,z FROM R_new,S,T WHERE R_new.i=S.i AND S.j=T.j;

where R_old and R_new are transition tables corresponding to ∇R and ΔR, respectively.

Multiple Tables Modification

Now, let’s see cases where multiple tables are modified in a statement. You can observe it when you use modifying CTEs (WITH clause), like:

WITH i1 AS (INSERT INTO R VALUES(1,10) RETURNING 1), 
     i2 AS (INSERT INTO S VALUES(1,100) RETURNING 1) 
SELECT;

In addition, multiple tables can be updated when you use triggers, or foreign key constraint.

Pre-Update State of Tables

At that time, we need the state of tables before the modification. For example, when some tuples, ΔR, ΔS, and ΔT are inserted into R, S and T respectively, the tuples that will be inserted into the view are calculated by the following three queries:

SELECT x,y,z FROM R_new,S_pre,T_pre WHERE R_new.i=S_pre.i AND S_pre.j=T_pre.j;
SELECT x,y,z FROM R,S_new,T_pre WHERE R.i=S_new.i AND S_new.j=T_pre.j;
SELECT x,y,z FROM R,S,T_new WHERE R.i=S.i AND S.j=T_new.j;

where R_new, S_new, T_new are transition tables corresponding to ΔR, ΔS and, ΔT respectively. S_pre, T_pre are the table states before the modification, R, S are the current states of tables, that is, after the modification.

In our implementation, the “pre-update state of table” is calculated by using some system columns. Specifically, tuples inserted into the table are filtered by cmin/xmin system columns. Also, tuples deleted from the table are put back by appending tuples stored in the old transition table.

Collecting Multiple Transition Tables

As shown above, we need transition tables as many as table modifications for view maintenance. Therefore, transition tables for each modification are collected in each AFTER trigger function call. Then, the view maintenance is performed in the last call of the trigger.

However, in the original PostgreSQL, lifespan of transition tables is not enough long to preserve all transition tables during multiple modifications. Therefore, in order to implement IVM feature, I fixed the trigger code in PostgreSQL to allow to prolong a transition table’s lifespan and prevent it from being freed from memory early.

Summary

In this article, I explained how transition table is used when multiple tables are modified simultaneously in our IVM implementation. Transition tables are useful feature to extract table changes, but when we wanted to use it to implement IVM, we had to devise methods for calculating pre-update state and prolonging transition tables lifespan. If there would be cases where these transition table’s features are useful other than IVM, implementing them in PostgreSQL core as an independent proposal might make sense. However, we don’t have good idea about use cases other than IVM for now, so they are just for IVM.

2021-10-07

Transition Tables in Incremental View Maintenance

Introduction

In a previous post, I explained that our implementation of Incremental View Maintenance (IVM) on PostgreSQL were using AFTER trigger and transition tables. In this article I describes how we use transition tables in the IVM implementation.

Transition Tables

In order to calculate changes to be applied to views, changes that occurred on the updated tables are extracted using AFTER triggers and transition tables.

The transition table is a features of AFTER trigger, which allows trigger functions to refer to the changes that occurred on the table like normal tables. There are two transition tables; one contains “tuples deleted from the table” and another contains “tuples inserted into the table”.

When you create a trigger using CREATE TRIGGER command, you can specify names of transition tables by REFERENCING clause. In the following example, old_table and new_table are used as transition table names.

CREATE TRIGGER update_trigger
    AFTER UPDATE ON tbl
    REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
    FOR EACH STATEMENT EXECUTE FUNCTION update_trigger_function();

If you define the function using PL/pgSQL, these transition tables can be referenced in SQL by your specified names like normal tables although they do not exist in the catalogs. Instead, tuples of transition tables are stored in tuplestore, which is in memory or temporary files. Such kind of table is called ephemeral named relation.

When you write a trigger function in C language instead of PL/pgSQL, you can use Server Programming Interface (SPI) to run SQL commands. (You can find an example in the documentation.) In order to use transition tables in trigger functions written in C, you have to call SPI_register_relation or SPI_register_trigger_data to make ephemeral named relation(s) available.

Transition Tables in IVM

In our IVM implementation, when a base table is modified, a trigger function is called for maintaining a materialized view. In this function, view delta tables that contain the changes that will occur on the view are calculated as explained in before.

The trigger function is written in C, but we don’t use SPI for this calculation because we don’t run SQL command directly. Instead, we use a view definition query stored as an internal structure (namely, Query Tree). For example, suppose we have a materialized view defined as bellow:

SELECT x,y,z FROM R,S where R.i = S.i;

This SQL query is stored as the view definition after transformed to a query tree. Then, suppose that table R is modified. Using the stored query tree, we can get the query to calculate the view delta tables by replacing R with transition tables (old_table_R and new_table_R) as bellow:

-- tuples to be deleted from the view 
SELECT x,y,z FROM old_table_R,S where old_table_R.i = S.i;
-- tuples to be inserted into the view
SELECT x,y,z FROM new_table_R,S where new_table_R.i = S.i;

Specifically, the query tree is rewritten so that the range table entry of R is replaced with it of a transition table, and run the rewritten query tree. The results are stored into tuplestores. Then, these tuplestores are transferred to ephemeral named relations, and applied to the view by SQL executed via SPI.

Summary

In this article, I explained what is transition table and how it is used in our IVM implementation. In the next post, I will explain a more complicated situation, that is, where multiple tables are modified simultaneously.

2021-09-15

Tuple Duplicates Support in Incremental View Maintenance

Introduction

In the previous posts ([1], [2], [3]), I have explained the Incremental View Maintenance (IVM) that we are proposing as a new feature of PostgreSQL. As I explained in [1], our IVM implementation supports tuple duplicates, that is, it allows views to contain duplicated tuples. This article describes how we are handling tuple duplicate in the IVM implementation.

Tuple Duplicates in Incremental View Maintenance

In a [3], I introduced the basic theory of Incremental View Maintenance (IVM). Now, I will briefly review it.

Suppose table R is updated and the changes are stored into two delta tables, ∇R and ΔR. ∇R is the collection of tuples deleted from R, and ΔR is the collection of tuples inserted into R. Using them, we can calculate the delta tables of the view, ∇V and ΔV, that contain the changes that will occur on the view.

Finally, we can update the view incrementally by applying the delta tables ∇V and ΔV to the view V. Specifically, tuples in ∇V are deleted from V, then tuples in ΔV are inserted into V.

Note here that tables and views may contain duplicate tuples. Because SQL is based on bag-semantics, not set-semantics. For example, suppose view V contains two tuples A and three tuples B. This is expressed as V = {A (2), B (3)}, which means that tuple A has a multiplicity of 2 and tuple B has a multiplicity of 3. Then, suppose you want to delete one tuple A and two tuple B. That is, ΔV = {A (1), B (2)}, and we want to perform the operation V ← V ∸ ΔV. (Here, ∸ is an operator called “monus”.)

How to Remove the Specified Number of Tuples?

Now, how can we get the result of this operation in PostgreSQL?

If you simply use DELETE as shown below, you will not be able to delete the specified number of tuples because all rows will be deleted.

DELETE FROM V WHERE x IN (SELECT x FROM ΔV);

One way to get the desired result is to use EXCEPT ALL as follows:

(SELECT * FROM V) EXCEPT ALL (SELECT * FROM ΔV);

However, while this gives the result of the deletion, it does not actually remove the tuple from V. Also, EXCEPT ALL is inefficient because it sorts the entire subquery result.

If you’re focus on a particular tuple, you can use ctid, which is a physical ID that has different values for tuples with the same content . For example, you can delete two tuples B with the following command:

DELETE FROM V WHERE ctid IN (SELECT ctid FROM ΔV WHERE x ='B' LIMIT 2);

However, if you want to delete multiple types of tuples, you need to execute this command multiple times, which is not efficient if there are many types of tuples to delete.

As shown above, there is no command to delete the specified number of tuples although SQL is based on bag-semantics that allows tuple duplication. Perhaps this was because there was little practical need.

Nevertheless, in order to implement IVM, we need the operation V ← V ∸ ΔV. Our IVM implementation solves this problem with the row_number() window function. Specifically, the following complex SQL is executed. The same tuples in the view V are numbered by row_number(). Then, by comparing that number (= __num__) with the multiplicity of tuples of ΔV (= __count__), the ctids of the tuple to be deleted are obtained.

DELETE FROM V WHERE ctid IN (
  SELECT tid FROM (
    SELECT row_number () OVER (PARTITION BY x) AS __num__,
                   V.ctid AS tid,
                   ΔV .__ count__
    FROM V, ΔV WHERE V.x = ΔV.x) sub
WHERE sub.__ num__ <= sub.__count__;

Although view V must have the appropriate index defined for efficiency because using the window function requires sorting the tuples in V, we can get the desired results by this way.

Summary

As explained in this article, even an operation that can be written by a theoretically simple formula such as V ← V ∸ ΔV will be complicated to implement. This is one of the challenges of implementing an IVM that supports tuple duplication.

2021-08-02

Reducing the size of IVM patch set

In the last three posts [1, 2, 3], I've covered the features, performance, and implementation of Incremental View Maintenance (IVM) that we are proposing as a new feature of PostgreSQL.

In the current code in the repository, the IVM implementation supports views whose definition query contains SELECT ... FROM ... WHERE ..., joins (including self-join and outer joins), some built-in aggregate functions (count, sum, avg, min and max), DISTINCT clause, some types of sub-queries, and CTEs. Also, the view can contain multiple tuples with the same content (duplicate tuple support).

As you can imagine, we made large changes for the original PostgreSQL code for implementing all the above features. As the result, the size of patch set proposed to the development community has grown too large. This would cause to make it hard to review the patch.

We hope the IVM feature will be adopted into PostgreSQL 15. Therefore, we decided to reduce the size of patch by omitting a part of features in the first release in order to make the review easier. Especially, I would like propose the IVM supporting inner join, DISTINCT, some built-in aggregates, and views with tuple duplicates as a new feature of PostgreSQL 15. Support for outer-join, sub-queries, and CTEs are excluded here. As a result, the size of the latest patch size became less than half of the previous.

2021-06-30

Implementing Incremental View Maintenance for PostgreSQL (Part III)

Introduction

In the previous post, I showed some performance evaluation of Incremental View Maintenance (IVM) feature that is under development. In this article, I will explain the implementation of IVM and how it works.

Simple Example of Incremental View Maintenance

First, I will explain the process of IVM using a simple example.

Suppose you have a view V defined by natural join on tables R and S as shown below.

Then, suppose that table R is updated as follows, and (1, one, I) is changed to (1, first, I). In the figure, ∇R is the set of tuples deleted from R, and ΔR is the set of tuples inserted into R. Using these delta tables (∇R and ΔR) , the contents of table S, and the view definition of V, we can calculate the change that will occur on the view, that is, ∇V and ΔV.

Finally, we can update the view incrementally by applying the delta tables ∇V and ΔV to the view V.

Steps in Incremental View Maintenance

In our implementation, Incremental view Maintenance is performed in the three steps:

Extract the changes that occurred on the updated table
Calculate the changes that will occurs on the view
Apply the changes to the view

Extract the table change

In our implementation, we use statement-level AFTER triggers and transition tables to extract the change that occurred in the table.

AFTER triggers are automatically created on all base tables contained in the view definition query when the view is created by CREATE INCREMENTAL MATERIALIZED VIEW command. The triggers are created for INSERT, DELETE, and UPDATE commands, but the TRUNCATE command is not currently supported.

The Transition Table is a features of AFTER trigger, which allows trigger functions to refer to the changes that occurred on the table like normal tables. There are two transition tables; one contains “tuples deleted from the table” and another contains “tuples inserted into the table”, which correspond to the delta tables∇R and ΔR explained above, respectively.

Calculate the view delta

Changes that occurs in views are calculated using the extracted table changes and the view definition queries, as described above. More specifically, we rewrite the view definition query and replace the update table in the query with the transition table. Actually, what is rewritten is not the query string written in SQL but the internal structure (struct Query) which represents the parsed query.

By executing These rewritten queries, we can get the delta tables that contains changes which will occur on the views. These view delta tables are corresponds to ∇V, ΔV.

Apply deltas to the view

The calculated view deltas are applied to the materialized view using a SQL query. Specifically, we use DELETE statement to search for tuples to be deleted (contents of ∇V) from the materialized view and delete them, and use INSERT statement to insert tuples to be inserted (contents of ΔV) into the materialized view.

For views on aggregate query, the aggregated values in the view are updated using UPDATE statement. The way to update depends on aggregate functions. For example, with regard to count() and sum(), we can calculate the new aggregated value by adding an aggregated value calculated from the view delta to the old aggregated value in the view. For calculating the new avg() value, we need to know new sum() and count() values beforehand, so these values are stored in the view as hidden columns.

Recall that a unique index is automatically created if possible when the view is defined. This index is necessary to identify tuples that should be removed from the view efficiently. Without the index, all tuples in the view would be scanned sequentially to search tuples to be deleted, which would be inefficient.

In our implementation, if all the primary key columns of all the tables appear in the target list of the view definition query, a unique index on these columns are created. In the case of a view on aggregate query, a unique index is created on the columns used in GROUP BY clause. We also create a unique index if the view has DISTINCT clause. Otherwise, a index is not automatically created, so you would need to create an appropriate index manually for efficient IVM.

Conclusion

In this article, I explained the overview of our IVM implementation. This feature is under discussion and we are continuously developing targeting PostgreSQL 15. If you want to learn more details, please see the GitHub repository and the discussion. in pgsql-hackers. Also, if you would like to test this feature, you could get a Docker image from Docker Hub.

We would appreciate any your suggestion, question, and feedback.