Skip to content
This repository has been archived by the owner on May 22, 2024. It is now read-only.

v12 merge commit message

Heikki Linnakangas edited this page Sep 21, 2020 · 9 revisions

Merge with PostgreSQL version 12 (up to a point between beta2 and beta3).

This is the point where PostgreSQL REL_12_STABLE was branched off the master branch and v13 development started.

In addition to the changes below, there's a ton of little things marked with GPDB_12_MERGE_FIXME comments.

Planner changes

  • Big changes to partition planning and pruning, as we merged upstream partititioning code. Most GPDB partitioning code has been replaced with corresponding upstream. See section "Partitioning" for details.

  • PostgreSQL v12 planner learned to inline side-effect free CTEs. That is, if you "WITH foo (SELECT ...) SELECT * FROM foo, ...", the 'foo' subquery is inlined into the main query and planned as one. Previously, the CTE was always planned separately, and materialized in full. GPDB has always supported inlining CTEs, so this isn't new to GPDB, but we adopted the code changes related to that as much as possible. In PostgreSQL, the feature can be controlled by specifying "WITH foo AS MATERIALIZED (...)" or NOT MATERIALIZED, that now works in GPDB too.

    In GPDB, the inlining could previously be controlled with the "gp_cte_sharing" GUC. It still works, but if MATERIALIZED or NOT MATERIALIZED was specified in the query it takes precedence. Like in previous releases, Even if gp_cte_sharing is set to 'on', the subquery is inlined if there is only one reference to it in the query.

    When gp_cte_sharing = 'off' (the default), GPDB inlines the subqueries even if it's not strictly legal to do so. See https://github.com/greenplum-db/gpdb/issues/1349. This merge doesn't change that.

  • Partitionwise joins and aggregates were introduced in PostgreSQL v11. They work in GPDB too. The main benefit of partitionwise aggregation is that if the partition hierarchy contains a foreign table, the aggregation can be pushed down to the foreign server.

  • GROUPING SETS can now be implemented by hashing, or a mix of hashing and sorting. It shows up as a MixedAggregate node in EXPLAIN. However, the GPDB-specific multi-stage aggregate planning code (in cdbgroupingpaths.c) has not been updated to take advantage of that, so you only get such plans with one-stage aggregates, when the grouping keys happen to coincide with the table's distribution keys. This is a TODO.

Other changes

  • The default logging directory, 'pg_log', was renamed to just 'log' in PostgreSQL v10. That might affect tools that would try to look into the logs. (The 'pg_xlog' and 'pg_clog' directories were also renamed to 'pg_wal' and 'pg_xact', respectively, but external tools shouldn't be peeking into those directories directly)

  • Is log_line_prefix resurrected? (a3dd474470f35f87a69c11d8e292923389813cf8)

  • JIT compilation, introduced in PostgreSQL v11, works. We haven't done anything extra for JIT compilation of GPDB-specific code, but everything that evaluates expressions benefits from the upstream JIT support, also in GPDB-specific node types.

  • Hash indexes are enabled now. We had disabled them completely in GPDB, because they were not WAL-logged. Since PostgreSQL v10 they are, so we can enable them.

  • "MK sort", in tuplesort_mk.c, is gone. The "gp_enable_mk_sort" GUC is gone too. We always use the upstream sort code now. There have been performance improvements to it in the upstream releases, but this is still a performance regression in cases where the MK-sort was particularly good. That is, if you have multiple ORDER BY columns, with lots of duplicates. We might still reintroduce MK sort later in some form, as a separate PR. But that work should be made on pgsql-hackers for PostgreSQL first.

  • The GPDB-specific "Hybrid Hash Agg" code is gone. Its purpose was to support spilling large hash aggregates to disk, instead of running out of memory. That functionality has been replaced by backporting the functionally similar hash agg spilling feature from PostgreSQL v13. That consisted of a number of upstream commits:

    a10701c0e3 Allow simplehash to use already-calculated hash values. 81080df32b Add transparent block-level memory accounting 87b6b57191 Fix edge case leading to agg transitions skipping ExecAggTransReparent() calls. f76e8f8507 Refactor hash_agg_entry_size(). 4ac4bd07fa Introduce TupleHashTableHash() and LookupTupleHashEntryHash(). 5dddc85d78 Change signature of TupleHashTableHash(). d238458f6d Minor refactor of nodeAgg.c. a9e9e9bc8e Fixup for nodeAgg.c refactor. b27eec5ea6 Extend ExecBuildAggTrans() to support a NULL pointer check. 0a390a8e8b Introduce LogicalTapeSetExtend(). 5d9fd3ae3d Disk-based Hash Aggregation. 5471822147 Fixes for Disk-based Hash Aggregation. 8b6151eae7 Avoid allocating unnecessary zero-sized array. 04fced3688 Fix costing for disk-based hash aggregation. e4ee460db3 Include chunk overhead in hash table entry size estimate. 255871265e Create memory context for HashAgg with a reasonable maxBlockSize. 02f3d5670a Fix HashAgg regression from choosing too many initial buckets. f12207ca4e Logical Tape Set: use min heap for freelist.

    As a result, the nodeAgg.c code is more similar to the v13 version than v12.

    TODO: backpatch followup commits from REL_13_STABLE

  • TODO: createlang and droplang binaries were removed in PostgreSQL v10. Update GPDB docs.

  • Autoconf changes from upstream for building C++ code?? FIXME

  • Parallelism is still disabled in GPDB. That includes the new parallel CREATE INDEX code.

  • Fix behavior with UPDATE WHERE CURRENT OF. As you can see in the changes to expected output of the 'portals_updatable' test, some of the test results changed. As far as I can see, the old behaviour memorized in the expected output was incorrect. PostgreSQL has always returned the result we're now seeing on the merge branch, which is different from the result you get on master. The test had not been modified since 2012, when the test was added along with support for WHERE CURRENT OF. I could not find any explanation in git history or in test comments. I believe it was simply a bug when WHERE CURRENT OF support was added, and no one noticed that the test result was wrong.

  • SRFs in target lists. We adoped the new upstream node type, ProjectSet, to deal with SRFs in target lists. As a performance optimization, there were some GPDB changes in the planner earlier, to always use a Result node if there were SRFs in the target list, and the SRF-in-targetlist support was removed from other executor nodes. That was pretty much the same changes that were made in upstream, except that upstream uses ProjectSet instead of Result, and the behavior was changed to be more sane when you had multiple SRFs. Replace all the GPDB changes with upstream. Due to the behavior changes with multiple SRFs, some INSERT statements in GPDB tests had to modified so that they generate the same test data as before.

  • The GPDB implementation of "RANGE offset PRECEDING/FOLLOWING" in window functions was replaced with the upstream implementation that was introduced in version 11. The upstream implementation is more strict about the datatypes than the code in GPDB master. However, looking at old JIRA ticket MPP-5246 that was mentioned in one of the tests, it had a test case like this attached to it:

    explain select cn, count(*) over (order by dt range between '2 day'::interval preceding and 2 preceding) from mpp5246_sale; ERROR: can't coerce leading frame bound to type of trailing frame bound HINT: specify the leading and trailing frame bounds as the same type

    So apparently we were strict about the types back in 2009, and you got that error up to 5X_STABLE, but it got relaxed in 6X_STABLE. And now it becomes more strict with the upstream code again. I think that's fine, although we probably shouldn't have relaxed it in 6X_STABLE. Too late to change that.

  • Remove the undocumented 'gp_indexcheck_vacuum' GUC. A new contrib module, contrib/amcheck, was added in PostgreSQL v10, with similar functionality.

  • Logical replication is still disabled in GPDB.

  • The default of 'hot_standby' was changed to 'on' in upstream, but hot standby is still not supported GPDB, and the default in GPDB remains 'off'. It is now possible to turn it 'on' manually in mirrors, though. That allows tools like pg_basebackup to work, and it allows running the upstream replication tests in src/test/recovery. However, there's no user-visible, supported, new functionality here.

  • pg_basebackup got a feature to verify checksums, but that has been disabled in the wrapper python scripts. So checksums are verified by the management utilities that use pg_basebackup under the hood (gpexpand, gpinitstandby, gpconfigurenewsegment. Likewise, tests that use pg_basebackup were modified with the --no-verify-checkums flag. The reason for this is that the checksum verification is broken for AO/AOCO tables, which don't have page checksum like other relations do, and pg_basebackup cannot distinguish them. We ought to fix this somehow, but for now it's a FIXME.

  • pg_ctl -w now waits for distributed transaction recovery.

    When starting master in distributed mode, in GPDB pg_ctl -w should wait for distributed transaction recovery to complete. As only after that connections are allowed.

    Adding new state "dtmready" to be recorded in PID file to mark distributed transaction recovery completion. When dtx background worker exits successfully, postmaster writes this state to file.

    pg_ctl waits for this "dtmready" state instead of "ready" state is starting master in distribted mode which is conveyed using "-E" postmaster options.

    This patch should meet the previous expectations, that after successfull completion of gpstart, distributed transaction can be made.

  • If a client tries to connect while DTM recovery in-progress, you now get an error. Client connections used to wait for dtx recovery process to complete distributed transaction recovery. Depending on situation this waiting could be infinite. To avoid hung connections though in certain situation where dtx recovery can't complete the distributed transactions, better to error out and let client retry in this situation similar to local crash recovery situation. (Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/dgU9il8l24A/NTAz2cl5BQAJ)

Source code changes

  • PostgreSQL v11 replaced the way the initial catalog contents are stored in the source code. Instead of the "DATA(insert ..." lines in the .h files, the initial contents are stored in JSON format, in separate .dat files. The JSON format is much more convenient to edit, and makes it more convenient to carry diffs against them, so we don't need the GPDB-specific perl scripts to provide defaults for GPDB-added catalog columns anymore, nor the "catullus.pl" script that was used to generate pg_proc.h contents for GPDB-added functions. That's all gone, and all GPDB-specific changes are made to the .dat files directly.

  • PostgreSQL v12 removed support for OIDs in user tables, and changed all catalog tables with OIDs so that the OID is a regular column. A lot of GPDB specific code and tests to deal with OIDs was removed with that.

  • A new Kerberos test suite was added to PostgreSQL v11. We had one in GPDB already, under src/test/gpkerberos. It's not clear how much overlap there is between the two, we could possibly get rid of the GPDB test suite now, but for now, keep them both.

  • Support for passing around MemTuples in TupleTableSlots was removed. MemTuple is still the native format of AO (row) tables, but is deformed and stored in a virtual tuple slot as soon as it's read from disk. MemTuple is no longer used as the on-wire format in Motions either, nor as the on-disk format for temp files when sorting; we now use upstream MinimalTuple for those.

  • HeapTuple->t_tableOid has been resurrected. We had removed it in previous GPDB versions for performance reasons, I don't believe that the overhead is really noticeable, at least not in recent PostgreSQL versions.

  • The pg_rewind regression tests were previously implemented in GPDB as a shell script. It was done that way when we backported the upstream tests but didn't have the TAP test framework yet. Now we do, so replace the shell implementation with the upstream TAP tests.

  • The way ALTER TABLE commands are dispatched was refactored. Some ALTER TABLE subcommands need to dispatch some extra information from QD to QEs. Previously, the ATExec* functions modified the sub-commands in the QD, and the modified subcommands were dispatched to the QEs. The problem with this approach is that the QEs try to run the ATPrep* phase on the already- prepped commands, so we had to hack the ATPrep* functions to cope with that. That was causing merge conflicts, which could be fixed, but I was not happy with the diff footprint of that approach. Instead, dispatch the AlteredTableInfo structs that were constructed in the QD's prep phase. That way, the execute phase can modifiy the sub-commands and those modifications are dispatched.

  • Upstream isolation tests (src/test/isolation) are now run in utility mode. Instead of disabling or heavily modifying isolation upstream test better to run them in utility mode (kind of single node postgres) and get the code coverage. Still doesn't test GPDB clustered behavior. Need to enhance the framework for it to understand locking across segments and also work with global deadlock detector.

Partitioning

All the GPDB-specific code to deal with partitioning has been ripped out and replaced with the new PostgreSQL partitioning code. Compatibility for the old GPDB partitioning syntax was reimplemented on top of the upstream code, in the parse analysis phase.

  • Multi-column list partitioning is no longer supported. As a work-around, you can create a composite type and use that as the partitioning key, like this:

    create type foo_partkey as (b int, c int); create table foo (a int, b int, c int) partition by list ((row(b, c)::foo_partkey));

    Partition pruning might not work in all the cases that it used to with that work-around though.

  • The gp_enable_exchange_default_partition GUC was removed. With upstream partition code exchanging (essentially attaching new) default partition comes with no risk of wrong data. Attaching a default partition code validates that the data does not violate the partition constraints, so blocking the exchange of default partition under GUC is no longer required. It's same as exchanging any other partition.

  • pg_dump will use the upstream syntax now.

  • FOR (RANK (1)) addressing no longer supported

  • PARTITION BY is now accepted on both sides of DISTRIBUTED BY

  • START/END/EVERY is converted in the parser to upstream primitives

  • PostgreSQL range partitions don't support START EXCLUSIVE or END INCLUSIVE. The start boundary is always inclusive and end exclusive. To support the old syntax, we convert EXCLUSIVE to INCLUSIVE by doing "+ 1 " on the boundary value. That only works for datatypes that have a suitable "+" operator.

  • Unlogged tables are now allowed with regular tables in partition hierarchy. Temporary and permanent tables still cannot be mixed.

  • Unlike PostgreSQL, we allow COPY TO on a partitioned table. That worked in GPDB before, so we needed to not break it. Re-implement it by automatically transforming the "COPY

    TO ..." command to "COPY (SELECT * FROM
    ) TO ...". The hint in the upstream error message suggests doing that, but we now do it automatically. There is already code to do that transformation, for COPY TO when Row-Level Security is enabled, so this was a one line change.
  • REINDEX TABLE on a partitioned table recurses to the partitions, and reindexes all indexes on all partitions. This is different from upstream, where REINDEX TABLE on a partitioned emits a warnings and does nothing. It is the same behavior as on previous versions of GPDB.

  • REINDEX INDEX on a partitioned index is a no-op, as a partitioned table doesn't contain any data. In PostgreSQL, it throws an error, but we have allowed it historically in GPDB.

  • In PostgreSQL, GRANT/REVOKE on a partitioned table only affects the partitioned table itself, not the partitions. Permission on the partitioned table is enough to query the partitioned table and all its children, but it makes a difference if you query the partition directly, or if the partition is detached. We kept the old GPDB behavior where GRANT/REVOKE recurses on the child partitions. (There is no ONLY syntax on GRANT/REVOKE that could be used to control it.)

  • Similarly, when you create a new partition with CREATE TABLE PARTITION OF, the permissions of the new parent table are copied to the new partition.

  • Inserting into an intermediate partition is now allowed. The row will be routed to the correct partition, like in PostgreSQL. It used to be forbidden in GPDB.

  • Removed a lot of NOTICEs when partitions are created or dropped.

  • commit b23036066df7d4851855402c0dbed245c1e63f3b Author: Alexandra Wang [email protected] Date: Fri Jun 5 21:03:32 2020 -0700

    Support inclusive END range partition boundary
    
    END..INCLUSIVE can be specified for CREATE TABLE, ADD PARTITION and
    SPLIT PARTITION commands for range partitions.
    
    For now, this patch only supports partitions keys that are compatible
    with the "+1" operator, where "1" is of type Integer. For example, int
    and date data types work, timestamp does not work.
    
    During parsing, if the END value is INCLUSIVE, we canonicalize it into
    an EXCLUSIVE value. We need to do this because the upperdatums stored in
    the catalog are always EXCLUSIVE. This implementation leverages the same
    code that supports "EVERY", the only difference is that EVERY operates
    "+interval" with the user specified interval, whereas the
    canonicalization operates "+1". The "1" is currently hardcoded as an
    A_Const of type Integer which is not ideal, A FIXME is added for the
    future.
    
    • partition names vs table names. Might differ

    • pg_partition_def() is gone. old catalogs are done. What to use in queries to replace them? pg_partitions no more exists, use pg_partition_tree() instead.

    • Partitioned tables have no access method. They used to be heap or AO or AOCO too, even though they contain no data.

    • Triggers? changed behavior I think. See 'triggers_gp' test.

    • how do templates work now?

    • Partition conditions are no longer represented as CHECK constraints.
      
    • Moving row to another partition with UPDATE is now supported.

    • error messages are different in many cases

    commit 9f89426cd939af14cfa038cbe6d711c0b29fbba9 Author: Ashwin Agrawal [email protected] Date: Mon Jul 13 23:55:14 2020 -0700

    Partition test EVERY clause related accept behavior
    
    Rounding of EVERY clause argument similar to START/FROM and END/TO
    seems fine behavior. So, modify the test and answer file to reflect
    this change is behavior and okay to be lenient is this respect.
    
    Co-authored-by: Alexandra Wang <[email protected]>
    

    commit be0edb898a2e7653fb42781f21851878f0f97576 Author: Heikki Linnakangas [email protected] Date: Thu Jun 25 19:56:33 2020 +0300

    A "point hole" between partitions is now allowed, fix test.
    
    I don't understand why it was not allowed before. But I see no reason
    to not allow it now.
    

    commit f31ceca3f5eb17740ae4314621d725a815fe9371 Author: Ashwin Agrawal [email protected] Date: Tue Jun 9 11:38:22 2020 -0700

    Store sub-partition template in catalog
    
    New catalog table gp_partition_template now stores the sub-partition
    template, if specified during CREATE TABLE. This table directly stores
    the top parent's relid, partition level and serialized version of
    `GpPartitionDefinition` node corresponding to template for each level
    of partition specified. That's easiest and most convenient, given it
    includes all the information needed for template like encoding
    clauses, with options, tablespace,.... CREATE TABLE is only writer of
    this template table.
    
    ADD PARTITION is only reader for this template table and uses it to
    create new partitions.
    
    Entry from gp_partition_template is removed when partition table is
    deleted.
    

    Partition Selectors

    GPDB-specific Partition Selector nodes are still created, for partition pruning on joins, but the planner code to create them was largely rewritten, to leverage the upstream partition pruning code as much as possible:

    • In executor, Partition Selectors now use the upstream partition pruning infrastructure to compute which partitions can contain tuples that pass the join qual. The partitions are recorded in a Bitmapset, like the upstream partition pruning code does, and the Bitmapset is passed from the PartitionSelector to the Append node using a special executor Param. The old hash table that contained the partition OIDs is gone, as is the PartSelected expression. The Append node now directly skips the pruned partitions

    • In planner, use the upstream code to construct the PartitionPruneInfo steps to do the pruning. Thanks to this, join pruning now also works for quals with <, > operators, and for tables partitioned on expressions rather than plain columns.

    • In planner, refactor where the decision to do join pruning is made. It's now decided in create_append_plan(), where run-time pruning is decided, too. To pass the partition pruning info to the Partition Selector above the Append, maintain a stack of joins that are candidates for Partition Selectors as we recurse in create_plan().

    ORCA only supports very simple static partition pruning currently, and falls back for most cases involving partitioned tables. Work is scheduled to re-integrate more partitioning support later.

    Append-optimized and AOCO tables and the new Table AM API

    PostgreSQL v12 introduced a new API for table access methods that allows replacing the usual heap storage with something else. The AO and AOCO table code was refactored to sit behind the new API. There are now two extra table AMs in GPDB in addition to the 'heap': 'ao_row' and 'ao_column'. The old syntax, "WITH (appendonly=true)", is now mapped to "CREATE TABLE ... USING ao_row", and similarly for column-orinted AOCO tables.

    The 'appendonly' and 'orientation' are no longer accepted in the 'gp_default_storage_options' GUC. Its interactions with 'default_table_access_method' and options given in the DDL command became too unwieldy, so it was deemed better to stop supportin git. You can use the upstream 'default_table_access_method' GUC instead.

    There are some other small differences in what table options are accepted, where the old behavior was too difficult to maintain. For example, 'compresslevel' and 'compresstype' options are no longer accepted on AO row-oriented tables. It used to be accepted, but had no effect, except if it was a partitioned table and some partitions were column-oriented.

    Table inheritance is now allowed for AOCO tables. It was previously disallowed for no discernible reason, and it seems to just work now.

    The table AM API doesn't cover all the places where we had had to modify upstream code previously. There are still some RelationIsAppendOptimized() calls sprinkled in the backend code, but much fewer than before. One big difference is that updates to AO tables need a call to appendonly_dml_init() or aoco_dml_init() to be called before modifying the table, and appendonly_dml_finish() or aoco_dml_finish() afterwards.

    The access method is now represented in the catalogs like in upstream, in the pg_class.relam field. The old GPDB-specific pg_class.relstorage column has been removed.

    • CLUSTER on AO status? Mention Andrey as author.

    • ANALYZE?