2021-10-07

Transition Tables in Incremental View Maintenance

Introduction

In a previous post, I explained that our implementation of Incremental View Maintenance (IVM) on PostgreSQL were using AFTER trigger and transition tables. In this article describes how we use transition tables in the IVM implementation.

Transition Tables

In order to calculate changes to be applied to views, changes that occurred on the updated tables are extracted using AFTER triggers and transition tables.

The transition table is a features of AFTER trigger, which allows trigger functions to refer to the changes that occurred on the table like normal tables. There are two transition tables; one contains “tuples deleted from the table” and another contains “tuples inserted into the table”.

When you create a trigger using CREATE TRIGGER command, you can specify names of transition tables by REFERENCING clause. In the following example, old_table and new_table are used as transition table names.

CREATE TRIGGER update_trigger
    AFTER UPDATE ON tbl
    REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
    FOR EACH STATEMENT EXECUTE FUNCTION update_trigger_function();

If you define the function using PL/pgSQL, these transition tables can be referenced in SQL by your specified names like normal tables although they do not exist in the catalogs. Instead, tuples of transition tables are stored in tuplestore, which is in memory or temporary files. Such kind of table is called ephemeral named relation.

When you write a trigger function in C language instead of PL/pgSQL, you can use Server Programming Interface (SPI) to run SQL commands. (You can find an example in the documentation.) In order to use transition tables in trigger functions written in C, you have to call SPI_register_relation or SPI_register_trigger_data to make ephemeral named relation(s) available.

Transition Tables in IVM

In our IVM implementation, when a base table is modified, a trigger function is called for maintaining a materialized view. In this function, view delta tables that contain the changes that will occur on the view are calculated as explained in before.

The trigger function is written in C, but we don’t use SPI for this calculation because we don’t run SQL command directly. Instead, we use a view definition query stored as an internal structure (namely, Query Tree). For example, suppose we have a materialized view defined as bellow:

SELECT x,y,z FROM R,S where R.i = S.i;

This SQL query is stored as the view definition after transformed to a query tree. Then, suppose that table R is modified. Using the stored query tree, we can get the query to calculate the view delta tables by replacing R with transition tables (old_table_R and new_table_R) as bellow:

-- tuples to be deleted from the view 
SELECT x,y,z FROM old_table_R,S where old_table_R.i = S.i;
-- tuples to be inserted into the view
SELECT x,y,z FROM new_table_R,S where new_table_R.i = S.i;

Specifically, the query tree is rewritten so that the range table entry of R is replaced with it of a transition table, and run the rewritten query tree. The results are stored into tuplestores. Then, these tuplestores are transferred to ephemeral named relations, and applied to the view by SQL executed via SPI.

Summary

In this article, I explained what is transition table and how it is used in our IVM implementation. In the next post, I will explain a more complicated situation, that is, where multiple tables are modified simultaneously.

2021-09-15

Tuple Duplicates Support in Incremental View Maintenance

Introduction

In the previous posts ([1], [2], [3]), I have explained the Incremental View Maintenance (IVM) that we are proposing as a new feature of PostgreSQL. As I explained in [1], our IVM implementation supports tuple duplicates, that is, it allows views to contain duplicated tuples. This article describes how we are handling tuple duplicate in the IVM implementation.

Tuple Duplicates in Incremental View Maintenance

In a [3], I introduced the basic theory of Incremental View Maintenance (IVM). Now, I will briefly review it.

Suppose table R is updated and the changes are stored into two delta tables, ∇R and ΔR. ∇R is the collection of tuples deleted from R, and ΔR is the collection of tuples inserted into R. Using them, we can calculate the delta tables of the view, ∇V and ΔV, that contain the changes that will occur on the view.

Finally, we can update the view incrementally by applying the delta tables ∇V and ΔV to the view V. Specifically, tuples in ∇V are deleted from V, then tuples in ΔV are inserted into V.

Note here that tables and views may contain duplicate tuples. Because SQL is based on bag-semantics, not set-semantics. For example, suppose view V contains two tuples A and three tuples B. This is expressed as V = {A (2), B (3)}, which means that tuple A has a multiplicity of 2 and tuple B has a multiplicity of 3. Then, suppose you want to delete one tuple A and two tuple B. That is, ΔV = {A (1), B (2)}, and we want to perform the operation V ← V ∸ ΔV. (Here, ∸ is an operator called “monus”.)

How to Remove the Specified Number of Tuples?

Now, how can we get the result of this operation in PostgreSQL?

If you simply use DELETE as shown below, you will not be able to delete the specified number of tuples because all rows will be deleted.

DELETE FROM V WHERE x IN (SELECT x FROM ΔV);

One way to get the desired result is to use EXCEPT ALL as follows:

(SELECT * FROM V) EXCEPT ALL (SELECT * FROM ΔV);

However, while this gives the result of the deletion, it does not actually remove the tuple from V. Also, EXCEPT ALL is inefficient because it sorts the entire subquery result.

If you’re focus on a particular tuple, you can use ctid, which is a physical ID that has different values for tuples with the same content . For example, you can delete two tuples B with the following command:

DELETE FROM V WHERE ctid IN (SELECT ctid FROM ΔV WHERE x ='B' LIMIT 2);

However, if you want to delete multiple types of tuples, you need to execute this command multiple times, which is not efficient if there are many types of tuples to delete.

As shown above, there is no command to delete the specified number of tuples although SQL is based on bag-semantics that allows tuple duplication. Perhaps this was because there was little practical need.

Nevertheless, in order to implement IVM, we need the operation V ← V ∸ ΔV. Our IVM implementation solves this problem with the row_number() window function. Specifically, the following complex SQL is executed. The same tuples in the view V are numbered by row_number(). Then, by comparing that number (= __num__) with the multiplicity of tuples of ΔV (= __count__), the ctids of the tuple to be deleted are obtained.

DELETE FROM V WHERE ctid IN (
  SELECT tid FROM (
    SELECT row_number () OVER (PARTITION BY x) AS __num__,
                   V.ctid AS tid,
                   ΔV .__ count__
    FROM V, ΔV WHERE V.x = ΔV.x) sub
WHERE sub.__ num__ <= xub.__count__;

Although view V must have the appropriate index defined for efficiency because using the window function requires sorting the tuples in V, we can get the desired results by this way.

Summary

As explained in this article, even an operation that can be written by a theoretically simple formula such as V ← V ∸ ΔV will be complicated to implement. This is one of the challenges of implementing an IVM that supports tuple duplication.

2021-08-02

Reducing the size of IVM patch set

In the last three posts [1, 2, 3], I've covered the features, performance, and implementation of Incremental View Maintenance (IVM)  that we are proposing as a new feature of PostgreSQL.
 

In the current code in the repository, the IVM implementation supports views whose definition query contains SELECT ... FROM ... WHERE ..., joins (including self-join and outer joins),  some built-in aggregate functions (count, sum, avg, min and max), DISTINCT clause, some types of sub-queries, and CTEs. Also,  the view can contain multiple tuples with the same content (duplicate tuple support).

As you can imagine, we made large changes for the original PostgreSQL code for implementing all the above features.  As the result, the size of patch set proposed to the development community  has grown too large. This would cause to make it hard to review the patch.

We hope the IVM feature will be adopted into PostgreSQL 15. Therefore, we decided to reduce the size of patch by omitting a part of features in the first release in order to make the review easier. Especially, I would like propose the IVM  supporting inner join, DISTINCT, some built-in aggregates, and views with tuple duplicates as a new feature of PostgreSQL 15. Support for outer-join, sub-queries, and CTEs are excluded here. As a result,  the size of the latest patch size became less than half of the previous.


2021-06-30

Implementing Incremental View Maintenance for PostgreSQL (Part III)

Introduction

In the previous post, I showed some performance evaluation of Incremental View Maintenance (IVM) feature that is under development. In this article, I will explain the implementation of IVM and how it works.

Simple Example of Incremental View Maintenance

First, I will explain the process of IVM using a simple example.

Suppose you have a view V defined by natural join on tables R and S as shown below.

Then, suppose that table R is updated as follows, and (1, one, I) is changed to (1, first, I). In the figure, ∇R is the set of tuples deleted from R, and ΔR is the set of tuples inserted into R. Using these delta tables (∇R and ΔR) , the contents of table S, and the view definition of V, we can calculate the change that will occur on the view, that is, ∇V and ΔV.

Finally, we can update the view incrementally by applying the delta tables ∇V and ΔV to the view V.

Steps in Incremental View Maintenance

In our implementation, Incremental view Maintenance is performed in the three steps:

  1. Extract the changes that occurred on the updated table
  2. Calculate the changes that will occurs on the view
  3. Apply the changes to the view

Extract the table change

In our implementation, we use statement-level AFTER triggers and transition tables to extract the change that occurred in the table.

AFTER triggers are automatically created on all base tables contained in the view definition query when the view is created by CREATE INCREMENTAL MATERIALIZED VIEW command. The triggers are created for INSERT, DELETE, and UPDATE commands, but the TRUNCATE command is not currently supported.

The Transition Table is a features of AFTER trigger, which allows trigger functions to refer to the changes that occurred on the table like normal tables. There are two transition tables; one contains “tuples deleted from the table” and another contains “tuples inserted into the table”, which correspond to the delta tables∇R and ΔR explained above, respectively.

Calculate the view delta

Changes that occurs in views are calculated using the extracted table changes and the view definition queries, as described above. More specifically, we rewrite the view definition query and replace the update table in the query with the transition table. Actually, what is rewritten is not the query string written in SQL but the internal structure (struct Query) which represents the parsed query.

By executing These rewritten queries, we can get the delta tables that contains changes which will occur on the views. These view delta tables are corresponds to ∇V, ΔV.

Apply deltas to the view

The calculated view deltas are applied to the materialized view using a SQL query. Specifically, we use DELETE statement to search for tuples to be deleted (contents of ∇V) from the materialized view and delete them, and use INSERT statement to insert tuples to be inserted (contents of ΔV) into the materialized view.

For views on aggregate query, the aggregated values in the view are updated using UPDATE statement. The way to update depends on aggregate functions. For example, with regard to count() and sum(), we can calculate the new aggregated value by adding an aggregated value calculated from the view delta to the old aggregated value in the view. For calculating the new avg() value, we need to know new sum() and count() values beforehand, so these values are stored in the view as hidden columns.

Recall that a unique index is automatically created if possible when the view is defined. This index is necessary to identify tuples that should be removed from the view efficiently. Without the index, all tuples in the view would be scanned sequentially to search tuples to be deleted, which would be inefficient.

In our implementation, if all the primary key columns of all the tables appear in the target list of the view definition query, a unique index on these columns are created. In the case of a view on aggregate query, a unique index is created on the columns used in GROUP BY clause. We also create a unique index if the view has DISTINCT clause. Otherwise, a index is not automatically created, so you would need to create an appropriate index manually for efficient IVM.

Conclusion

In this article, I explained the overview of our IVM implementation. This feature is under discussion and we are continuously developing targeting PostgreSQL 15. If you want to learn more details, please see the GitHub repository and the discussion. in pgsql-hackers. Also, if you would like to test this feature, you could get a Docker image from Docker Hub.

We would appreciate any your suggestion, question, and feedback.

2021-05-31

Implementing Incremental View Maintenance for PostgreSQL (Part II)

Introduction

In the previous post, I explained the Incremental View Maintenance (IVM) feature that is under development. In this article, I will show more about the performance evaluation of our IVM implementation.

Incrementally Maintainable Materialized Views on the TPC-H queries

In the previous post, I showed that the IVM feature enables materialized views to be updated automatically and more rapidly than the REFRESHcommand by using pgbench tables.
Here, as a more practical example, I will evaluate the performance using queries of TPC-H benchmark.

TPC-H queries

Firstly, we checked how many of the 22 queries could be used as a definition of an incrementally maintainable materialized view in our implementation. As the results, 9 of 22 queries are supported. Other queries are not supported in the current implementation, due to subqueries including aggregation, etc.

Among these queries, we used Q01 and Q09 here, which took a long time to execute the query. Q01 is a query that aggregates one large table,

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate <= date '1998-12-01' - interval '78' day
group by
    l_returnflag,
    l_linestatus

and Q09 is a query that joins six tables and aggregates them.

select
    nation,
    o_year,
    sum(amount) as sum_profit
 rom
    (
       select
            n_name as nation,
            extract(year from o_orderdate) as o_year,
           l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
        from
           part,
           supplier,
           lineitem,
           partsupp,
           orders,
           nation
       where
           s_suppkey = l_suppkey
           and ps_suppkey = l_suppkey
           and ps_partkey = l_partkey
           and p_partkey = l_partkey
           and o_orderkey = l_orderkey
           and s_nationkey = n_nationkey
           and p_name like '%sandy%'
    ) as profit
 group by
    nation,
     o_year

In both queries, ORDER BY and LIMIT are removed because these clauses are not supported in the view definition. The scale factor of TPC-H was 1.

Performance of Updating View

I created an incremental maintainable materialized view on each query, and compare the execution times of 1) direct SELECT to the query, 2) SELECT to the view, 3) REFRESH of the view, and 4) updating a row of a table (lineitem). This table update triggers IVM and the view is updated automatically and incrementally.

Here is the results.

SELECT query SELECT view REFRESH command IVM
Q01 10.311 s 1.585 ms 36.811 s 17.104 ms
Q09 3.124 s 2.331 ms 6.153 s 29.301 ms

Issuing SELECT directly to a query took about 10 seconds for Q01 and 3 seconds for Q09. On the other hand, SELECTto a view took only 1.6 ms for Q01 and 2.3ms for Q09. These fast response time is the original benefit of materialized views.

However, REFRESH of the view took a long time, that is, about 37 seconds for Q01 and 6 seconds for Q09. This shows that On the other hand, IVM took only 17ms for Q01 and 29ms for Q09. This execution time includes both update of the table and update of the view contents, so it is very rapid. It was more than 2000 times faster for Q01, and more than 200 times faster than REFRESH command.

Performance of Updating Tables

As shown above, the IVM feature allows materialized views to be updated automatically and quickly. However, it obviously affects table update performance because a view is updated for each table update. In addition, the table update performance is significantly degraded in the simultaneous update by concurrent transactions because an exclusive lock is acquired when the view is updated, which is required to avoid inconsistent view update. This exclusive lock is not necessary if the view has only one table in the definition, though.

I evaluated the table update performance by letting pgbench run a custom script, which updated a tuple selected randomly from lineitem table in a transaction. I compared the performance between conditions where there was no materialized view and where an incrementally maintainable materialized view for Q01 was created.

This graph shows the result. The vertical line is TPS (transaction per second) and the horizontal axis is the number of connections. The blue line is the performance measured when there was no materialized view ,that is, the performance of simple table updates. The red line is the performance including IVM of the view defined on Q01 query. This is from two to five times slower than the simple table updates due to the overhead of updating the view. In addition, the yellow line is the performance of IVM where an exclusive lock is forced. In this case, as you can see, the performance doesn’t scale if the connection is increased. The exclusive lock is actually not required for the Q01 view because the view has only one table, but this is required if there are more than one table in the view definition like Q09. Note that this evaluation was performed under the condition where there was only one incrementally maintainable materialized view. When there are more than one views, we could see additional overhead on table update performance.

From these results, the current implementation of IVM is not suitable for write-heavy workload. However, It will be useful when table update is not so frequent but users would like know the analysis results immediately after a table update.

Conclusion

The benefit of IVM that enables rapid and automatic update of materialized views was shown for the practical views like TPC-H queries. On the other hand, it was also shown that it was unsuitable when table update is frequent due to the overhead of the view update.

The approach that update the view in the same transaction which updates a table like our IVM implementation is called “immediate maintenance”. On the other hand, there is another approach called “deferred maintenance” that updates a table after the transaction committed, for example, when a user command is executed, when the view is accessed, or when the maintenance is triggered periodically in background. I would like to implement also deferred approach to resolve the table update performance degression in future.

In the next post, I will explain the implementation of IVM and how it works in more detailed.

2021-04-30

Implementing Incremental View Maintenance for PostgreSQL (Part I)

Introduction

A materialized view is a view that stores results of SELECT query in the database, and this enables you to get the query results quickly comparing to executing the query every time. On the other hand, when a table underlying the view is updated, the data in the materialized view gets out of date, so you need to update the materialized view.

PostgreSQL provides REFRESH MATERIALIZED VIEW command as a way to update materialized views . This command refreshes the materialized view by re-running the query.

However, this is not very efficient because it recalculates all the data. If only a small part of the table is updated and only a part of the materialized view can be updated, this must be more efficient than recalculation.

This method is called Incremental View Maintenance (IVM). However, this feature is not implemented in the current PostgreSQL. So we are developing this feature in PostgreSQL and proposing it to the development community.

In this article, I will explain the IVM feature under development.

Note that the feature descried in this article is about the “under development” function, and not available in the current PostgreSQL . If you have any comments or requests for this feature, please give us your feedback!

Features

Overview

The proposed IVM feature allows materialized views to be updated automatically and incrementally just after a underlying table is modified. Because this is an incremental update, it is faster than running a regular REFRESH. And, of course, you don’t have to write trigger functions for your self.

Currently, the SQL queries used to define views are SELECT ... FROM ... WHERE ..., joins (inner joins, outer joins, self-joins), some built-in aggregate functions (count, sum, avg, min, max), GROUP BY clause, DISTINCT clause, simple sub-queries in FROM clause, simple CTEs ( WITH clauses), and EXISTSsub-queries inWHEREclause, where “simple” means that a sub-query or CTE does not include aggregates, outer joins, or DISTINCT. The view can contain multiple tuples with the same content (duplicate tuple support).

On the other hand, aggregates of other than the above (such as user-defined aggregate), complex subqueries or CTEs, window functions, ORDER BY, LIMIT, OFFSET, set operations ( UNION, EXCEPT, INTERSECT) are not supported.

Creating a View

UseCREATE INCREMENTAL MATERIALIZED VIEWcommand to create a materialized view that is updated automatically and incrementally . This INCREMENTAL is the new keyword added to CREATE MATERIALIZED VIEW in this implementation.

You can also check the syntax with the psql command.

postgres=# \h CREATE MATERIALIZED VIEW 
Command:     CREATE MATERIALIZED VIEW
Description: define a new materialized view
Syntax:
CREATE [ INCREMENTAL ] MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
    [ (column_name [, ...] ) 
    [ USING method ]
    [ WITH ( storage_parameter [= value] [, ... ] ) ]
    [ TABLESPACE tablespace_name ]
    AS query
    [ WITH [ NO ] DATA ]

For example, we can create a materialized view using tables of pgbench as follows. Here, the scale factor is 100.

test=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm AS
       SELECT a.aid, b.bid, abalance, bbalance
       FROM pgbench_accounts a JOIN pgbench_branches b ON a.bid = b.bid:
NOTICE:  created index "mv_ivm_index" on materialized view "mv_ivm"
SELECT 10000000

This materialized view was created with IVM support because INCREMENTALoption was specified. In addition, an index was automatically created on this materialized view that is necessary for efficient IVM. These can be confirmed by \d meta command of psql command. We can see in the result “Incremental view maintenance: yes” and an unique index defined on aid and bid, which are primary key columns of the base tables.

test=# \d+ mv_ivm
                                    Materialized view "public.mv_ivm"
  Column  |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Descrip
tion 
----------+---------+-----------+----------+---------+---------+-------------+--------------+--------
-----
 aid      | integer |           |          |         | plain   |             |              | 
 bid      | integer |           |          |         | plain   |             |              | 
 abalance | integer |           |          |         | plain   |             |              | 
 bbalance | integer |           |          |         | plain   |             |              | 
Indexes:
    "mv_ivm_index" UNIQUE, btree (aid, bid)
View definition:
 SELECT a.aid,
    b.bid,
    a.abalance,
    b.bbalance
   FROM pgbench_accounts a
     JOIN pgbench_branches b ON a.bid = b.bid;
Access method: heap
Incremental view maintenance: yes

Updating a View Incrementally

Let’s see the effect of IVM. First, when I executed REFRESH command on this view, it took about 33 seconds.

test=# REFRESH MATERIALIZED VIEW mv_ivm ;
REFRESH MATERIALIZED VIEW
Time: 32863.279 ms (00:32.863)

On the other hand, when I updated a row in pgbench_accounts, it took about 16ms, which was more than 1000 times faster than the REFRESH command. In addition, of course, the contents of the view was updated correctly.

test=# SELECT * FROM mv_ivm WHERE aid = 1;
 aid | bid | abalance | bbalance 
-----+-----+----------+----------
   1 |   1 |      100 |     1000
(1 row)

Time: 1.583 ms
test=# UPDATE pgbench_accounts SET abalance = 11111 WHERE aid = 1;
UPDATE 1
Time: 16.202 ms
test=# SELECT * FROM mv_ivm WHERE aid = 1;
 aid | bid | abalance | bbalance 
-----+-----+----------+----------
   1 |   1 |    11111 |     1000
(1 row)

Time: 1.465 ms

Aggregate View

Next, let’s create a materialized view on a aggregate query . Here, pgbench's scale factor is 1000.

test2=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm2 AS
        SELECT bid, count(abalance), sum(abalance), avg(abalance)
        FROM pgbench_accounts GROUP BY bid;
NOTICE:  created index "mv_ivm2_index" on materialized view "mv_ivm2"
SELECT 1000

It took more than 40 seconds for REFRESH of this view .

test2=# REFRESH MATERIALIZED VIEW mv_ivm2;
REFRESH MATERIALIZED VIEW
Time: 42427.666 ms (00:42.428)

On the other hand, when I updated a row in the table, the view was automatically updated due to the IVM function. It took about 18ms, which is more than 2000 times faster than REFRESH.

test2=# SELECT * FROM mv_ivm2 WHERE bid = 1;
 bid | count  | sum  |          avg           
-----+--------+------+------------------------
   1 | 100000 | 1000 | 0.01000000000000000000
(1 row)

Time: 2.199 ms
test2=# UPDATE pgbench_accounts SET abalance = abalance + 5000 WHERE aid = 1;
UPDATE 1
Time: 17.951 ms
test2=# SELECT * FROM mv_ivm2 WHERE bid = 1;
 bid | count  | sum  |          avg           
-----+--------+------+------------------------
   1 | 100000 | 6000 | 0.06000000000000000000
(1 row)

Time: 2.152 ms

Conclusion

In this article, I showed that the proposed IVM feature enables materialized views to be updated automatically and more rapidly than the REFRESH command. In the next post, I will explain the performance of IVM in more detailed.

Transition Tables in Incremental View Maintenance

Introduction In a previous post , I explained that our implementation of Incremental View Maintenance (IVM) on PostgreSQL were using AFTER...