2021-09-15

Tuple Duplicates Support in Incremental View Maintenance

Introduction

In the previous posts ([1], [2], [3]), I have explained the Incremental View Maintenance (IVM) that we are proposing as a new feature of PostgreSQL. As I explained in [1], our IVM implementation supports tuple duplicates, that is, it allows views to contain duplicated tuples. This article describes how we are handling tuple duplicate in the IVM implementation.

Tuple Duplicates in Incremental View Maintenance

In a [3], I introduced the basic theory of Incremental View Maintenance (IVM). Now, I will briefly review it.

Suppose table R is updated and the changes are stored into two delta tables, ∇R and ΔR. ∇R is the collection of tuples deleted from R, and ΔR is the collection of tuples inserted into R. Using them, we can calculate the delta tables of the view, ∇V and ΔV, that contain the changes that will occur on the view.

Finally, we can update the view incrementally by applying the delta tables ∇V and ΔV to the view V. Specifically, tuples in ∇V are deleted from V, then tuples in ΔV are inserted into V.

Note here that tables and views may contain duplicate tuples. Because SQL is based on bag-semantics, not set-semantics. For example, suppose view V contains two tuples A and three tuples B. This is expressed as V = {A (2), B (3)}, which means that tuple A has a multiplicity of 2 and tuple B has a multiplicity of 3. Then, suppose you want to delete one tuple A and two tuple B. That is, ΔV = {A (1), B (2)}, and we want to perform the operation V ← V ∸ ΔV. (Here, ∸ is an operator called “monus”.)

How to Remove the Specified Number of Tuples?

Now, how can we get the result of this operation in PostgreSQL?

If you simply use DELETE as shown below, you will not be able to delete the specified number of tuples because all rows will be deleted.

DELETE FROM V WHERE x IN (SELECT x FROM ΔV);

One way to get the desired result is to use EXCEPT ALL as follows:

(SELECT * FROM V) EXCEPT ALL (SELECT * FROM ΔV);

However, while this gives the result of the deletion, it does not actually remove the tuple from V. Also, EXCEPT ALL is inefficient because it sorts the entire subquery result.

If you’re focus on a particular tuple, you can use ctid, which is a physical ID that has different values for tuples with the same content . For example, you can delete two tuples B with the following command:

DELETE FROM V WHERE ctid IN (SELECT ctid FROM ΔV WHERE x ='B' LIMIT 2);

However, if you want to delete multiple types of tuples, you need to execute this command multiple times, which is not efficient if there are many types of tuples to delete.

As shown above, there is no command to delete the specified number of tuples although SQL is based on bag-semantics that allows tuple duplication. Perhaps this was because there was little practical need.

Nevertheless, in order to implement IVM, we need the operation V ← V ∸ ΔV. Our IVM implementation solves this problem with the row_number() window function. Specifically, the following complex SQL is executed. The same tuples in the view V are numbered by row_number(). Then, by comparing that number (= __num__) with the multiplicity of tuples of ΔV (= __count__), the ctids of the tuple to be deleted are obtained.

DELETE FROM V WHERE ctid IN (
  SELECT tid FROM (
    SELECT row_number () OVER (PARTITION BY x) AS __num__,
                   V.ctid AS tid,
                   ΔV .__ count__
    FROM V, ΔV WHERE V.x = ΔV.x) sub
WHERE sub.__ num__ <= sub.__count__;

Although view V must have the appropriate index defined for efficiency because using the window function requires sorting the tuples in V, we can get the desired results by this way.

Summary

As explained in this article, even an operation that can be written by a theoretically simple formula such as V ← V ∸ ΔV will be complicated to implement. This is one of the challenges of implementing an IVM that supports tuple duplication.