2021-05-31

Implementing Incremental View Maintenance for PostgreSQL (Part II)

Introduction

In the previous post, I explained the Incremental View Maintenance (IVM) feature that is under development. In this article, I will show more about the performance evaluation of our IVM implementation.

Incrementally Maintainable Materialized Views on the TPC-H queries

In the previous post, I showed that the IVM feature enables materialized views to be updated automatically and more rapidly than the REFRESHcommand by using pgbench tables.
Here, as a more practical example, I will evaluate the performance using queries of TPC-H benchmark.

TPC-H queries

Firstly, we checked how many of the 22 queries could be used as a definition of an incrementally maintainable materialized view in our implementation. As the results, 9 of 22 queries are supported. Other queries are not supported in the current implementation, due to subqueries including aggregation, etc.

Among these queries, we used Q01 and Q09 here, which took a long time to execute the query. Q01 is a query that aggregates one large table,

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate <= date '1998-12-01' - interval '78' day
group by
    l_returnflag,
    l_linestatus

and Q09 is a query that joins six tables and aggregates them.

select
    nation,
    o_year,
    sum(amount) as sum_profit
 rom
    (
       select
            n_name as nation,
            extract(year from o_orderdate) as o_year,
           l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
        from
           part,
           supplier,
           lineitem,
           partsupp,
           orders,
           nation
       where
           s_suppkey = l_suppkey
           and ps_suppkey = l_suppkey
           and ps_partkey = l_partkey
           and p_partkey = l_partkey
           and o_orderkey = l_orderkey
           and s_nationkey = n_nationkey
           and p_name like '%sandy%'
    ) as profit
 group by
    nation,
     o_year

In both queries, ORDER BY and LIMIT are removed because these clauses are not supported in the view definition. The scale factor of TPC-H was 1.

Performance of Updating View

I created an incremental maintainable materialized view on each query, and compare the execution times of 1) direct SELECT to the query, 2) SELECT to the view, 3) REFRESH of the view, and 4) updating a row of a table (lineitem). This table update triggers IVM and the view is updated automatically and incrementally.

Here is the results.

SELECT query SELECT view REFRESH command IVM
Q01 10.311 s 1.585 ms 36.811 s 17.104 ms
Q09 3.124 s 2.331 ms 6.153 s 29.301 ms

Issuing SELECT directly to a query took about 10 seconds for Q01 and 3 seconds for Q09. On the other hand, SELECTto a view took only 1.6 ms for Q01 and 2.3ms for Q09. These fast response time is the original benefit of materialized views.

However, REFRESH of the view took a long time, that is, about 37 seconds for Q01 and 6 seconds for Q09. This shows that On the other hand, IVM took only 17ms for Q01 and 29ms for Q09. This execution time includes both update of the table and update of the view contents, so it is very rapid. It was more than 2000 times faster for Q01, and more than 200 times faster than REFRESH command.

Performance of Updating Tables

As shown above, the IVM feature allows materialized views to be updated automatically and quickly. However, it obviously affects table update performance because a view is updated for each table update. In addition, the table update performance is significantly degraded in the simultaneous update by concurrent transactions because an exclusive lock is acquired when the view is updated, which is required to avoid inconsistent view update. This exclusive lock is not necessary if the view has only one table in the definition, though.

I evaluated the table update performance by letting pgbench run a custom script, which updated a tuple selected randomly from lineitem table in a transaction. I compared the performance between conditions where there was no materialized view and where an incrementally maintainable materialized view for Q01 was created.

This graph shows the result. The vertical line is TPS (transaction per second) and the horizontal axis is the number of connections. The blue line is the performance measured when there was no materialized view ,that is, the performance of simple table updates. The red line is the performance including IVM of the view defined on Q01 query. This is from two to five times slower than the simple table updates due to the overhead of updating the view. In addition, the yellow line is the performance of IVM where an exclusive lock is forced. In this case, as you can see, the performance doesn’t scale if the connection is increased. The exclusive lock is actually not required for the Q01 view because the view has only one table, but this is required if there are more than one table in the view definition like Q09. Note that this evaluation was performed under the condition where there was only one incrementally maintainable materialized view. When there are more than one views, we could see additional overhead on table update performance.

From these results, the current implementation of IVM is not suitable for write-heavy workload. However, It will be useful when table update is not so frequent but users would like know the analysis results immediately after a table update.

Conclusion

The benefit of IVM that enables rapid and automatic update of materialized views was shown for the practical views like TPC-H queries. On the other hand, it was also shown that it was unsuitable when table update is frequent due to the overhead of the view update.

The approach that update the view in the same transaction which updates a table like our IVM implementation is called “immediate maintenance”. On the other hand, there is another approach called “deferred maintenance” that updates a table after the transaction committed, for example, when a user command is executed, when the view is accessed, or when the maintenance is triggered periodically in background. I would like to implement also deferred approach to resolve the table update performance degression in future.

In the next post, I will explain the implementation of IVM and how it works in more detailed.