Skip to content

Conversation

@gertjanal
Copy link
Contributor

@gertjanal gertjanal commented Oct 28, 2025

Description

Since Trino 477, the Hive connector has support for reading sequence files containing protobuf bytes; #26353. With that PR, records could be read using the protobuf deserializer.

The code was created for Hive tables that were created using Hive Twitter Elephantbird: https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive. In Hive, the tables use the serialization.class protobuf file to determine the columns for the table as well, which is named "Dynamic Schemas". An example for such a table in Hive is:

create table users
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
  with serdeproperties (
    "serialization.class"="com.example.proto.gen.Storage$User")
  stored as
    inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat";

When using show create table X; in Hive, all resolved columns are shown. When the protobuf file is updated with new fields, the columns are automatically added (in Hive). However, in Trino the Hive Metastore is asked for the table columns and I found out that the columns returned by the Hive Metastore (via Trino) were the columns that were determined upon creation of the original table. Later added fields to the protobuf were shown when querying Hive directly, but Trino received the original version via Hive Metastore. Hence this PR.

This PR adds the dynamic schema support to Trino. When the storage format SEQUENCEFILE_PROTOBUF is detected, the columns are derived from the protobuf fields, as is done in the Hive implementation.

Release notes

( X ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Summary by Sourcery

Enable dynamic schema support for Hive SEQUENCEFILE_PROTOBUF tables by deriving column definitions from Protobuf descriptors instead of relying on static metastore schemas.

New Features:

  • Derive table and partition column schemas at runtime for SEQUENCEFILE_PROTOBUF storage by reading Protobuf descriptor fields
  • Introduce DynamicSchemaHiveMetastore wrapper that fetches and caches dynamic schemas from Protobuf descriptors
  • Add DynamicSchemaLoader utility to map Protobuf FieldDescriptors to Trino Column definitions

Enhancements:

  • Refactor ProtobufDeserializerFactory to expose descriptor retrieval via getDescriptor and introduce ProtobufDeserializerFactoryLoader for dependency injection

Build:

  • Add protobuf-java dependency and configure protoc-jar-maven-plugin to generate test Protobuf descriptors

Tests:

  • Add TestDynamicSchemaHiveMetastore to validate dynamic schema extraction and include Protobuf proto test fixtures

@cla-bot cla-bot bot added the cla-signed label Oct 28, 2025
@gertjanal gertjanal requested a review from dain October 28, 2025 16:13
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 28, 2025

Reviewer's Guide

This PR implements dynamic schema support for SEQUENCEFILE_PROTOBUF Hive tables by wrapping the Hive metastore to load and cache protobuf descriptors at runtime, inferring table and partition columns from descriptor fields instead of relying on static metastore columns. It adds protobuf build and runtime dependencies, introduces a ProtobufDeserializerFactoryLoader for DI, modifies HiveMetadataFactory to use DynamicSchemaHiveMetastore, implements dynamic schema classes (wrapper, loader, cache), and includes tests verifying inferred schemas.

Sequence diagram for dynamic schema column resolution on table read

sequenceDiagram
    participant Client
    participant HiveMetadataFactory
    participant DynamicSchemaHiveMetastore
    participant ProtobufDeserializerFactory
    participant DescriptorFile
    Client->>HiveMetadataFactory: Request table metadata
    HiveMetadataFactory->>DynamicSchemaHiveMetastore: getTable(databaseName, tableName)
    alt Table is SEQUENCEFILE_PROTOBUF
        DynamicSchemaHiveMetastore->>ProtobufDeserializerFactory: getDescriptor(serializationClass)
        ProtobufDeserializerFactory->>DescriptorFile: Load protobuf descriptor
        ProtobufDeserializerFactory-->>DynamicSchemaHiveMetastore: Descriptor
        DynamicSchemaHiveMetastore->>DynamicSchemaLoader: fieldToColumn(descriptor.fields)
        DynamicSchemaHiveMetastore-->>HiveMetadataFactory: Table with dynamic columns
    else Table is not SEQUENCEFILE_PROTOBUF
        DynamicSchemaHiveMetastore-->>HiveMetadataFactory: Table with static columns
    end
    HiveMetadataFactory-->>Client: Return table metadata
Loading

ER diagram for dynamic schema column inference

erDiagram
    TABLE {
        string databaseName
        string tableName
        map storage
        list columns
    }
    PROTOBUF_DESCRIPTOR {
        string fullName
        list fields
    }
    COLUMN {
        string name
        string type
        string comment
    }
    TABLE ||--o{ COLUMN : has
    PROTOBUF_DESCRIPTOR ||--o{ COLUMN : infers
Loading

Class diagram for dynamic schema metastore components

classDiagram
    class HiveMetastore {
        <<interface>>
    }
    class HiveMetastoreWrapper {
        - delegate : HiveMetastore
        + HiveMetastoreWrapper(HiveMetastore)
        ...
    }
    class DynamicSchemaHiveMetastore {
        - dynamicSchemaCache : LoadingCache<TableReference, List<Column>>
        + DynamicSchemaHiveMetastore(HiveMetastore, ProtobufDeserializerFactory, Duration)
        + getTable(databaseName, tableName)
        + getPartition(table, partitionValues)
        + getPartitionsByNames(table, partitionNames)
    }
    class TableReference {
        - table : Table
        + getFullTableName()
        + getSerializationClass()
    }
    class DynamicSchemaLoader {
        + fieldToColumn(FieldDescriptor) : Column
    }
    class ProtobufDeserializerFactory {
        + getDescriptor(serializationClass) : Descriptor
    }
    class ProtobufDeserializerFactoryLoader {
        - factory : ProtobufDeserializerFactory
        + get() : ProtobufDeserializerFactory
    }
    HiveMetastoreWrapper --|> HiveMetastore
    DynamicSchemaHiveMetastore --|> HiveMetastoreWrapper
    DynamicSchemaHiveMetastore --> TableReference
    DynamicSchemaHiveMetastore --> DynamicSchemaLoader
    DynamicSchemaHiveMetastore --> ProtobufDeserializerFactory
    ProtobufDeserializerFactoryLoader --> ProtobufDeserializerFactory
Loading

File-Level Changes

Change Details Files
Add protobuf build and DI setup for descriptor-based schema loading
  • Add protobuf-java dependency and protoc-jar-maven-plugin configuration to the Hive plugin POM
  • Include build-helper-maven-plugin to register generated test sources
  • Bind ProtobufDeserializerFactoryLoader as a singleton in HiveModule
plugin/trino-hive/pom.xml
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveModule.java
Expose descriptor access in ProtobufDeserializerFactory and update sequence-file reader injection
  • Define SERIALIZATION_CLASS constant and add getDescriptor method
  • Use SERIALIZATION_CLASS constant when reading serdeproperties
  • Inject ProtobufDeserializerFactoryLoader into ProtobufSequenceFilePageSourceFactory instead of direct instantiation
  • Adjust HiveTestUtils to use the loader for page source factory
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/protobuf/ProtobufDeserializerFactory.java
plugin/trino-hive/src/main/java/io/trino/plugin/hive/line/ProtobufSequenceFilePageSourceFactory.java
plugin/trino-hive/src/test/java/io/trino/plugin/hive/HiveTestUtils.java
Wire DynamicSchemaHiveMetastore into HiveMetadataFactory
  • Inject ProtobufDeserializerFactoryLoader into HiveMetadataFactory and store factory
  • In create(), wrap the delegate metastore in DynamicSchemaHiveMetastore with a fixed cache duration
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadataFactory.java
Implement dynamic schema support via metastore wrappers and loaders
  • Add HiveMetastoreWrapper as a base decorator
  • Implement DynamicSchemaHiveMetastore with a LoadingCache to infer columns from protobuf descriptors and override getTable/getPartition methods
  • Create DynamicSchemaLoader to map protobuf FieldDescriptor to Trino Column definitions
  • Add ProtobufDeserializerFactoryLoader to centralize factory construction
lib/trino-metastore/src/main/java/io/trino/metastore/HiveMetastoreWrapper.java
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/dynamic/DynamicSchemaHiveMetastore.java
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/dynamic/DynamicSchemaLoader.java
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/dynamic/ProtobufDeserializerFactoryLoader.java
Add integration tests and protobuf resources for dynamic schema
  • Introduce TestDynamicSchemaHiveMetastore to verify table- and partition-level dynamic schema inference
  • Provide sample person.proto in test resources for descriptor generation
plugin/trino-hive/src/test/java/io/trino/plugin/hive/metastore/dynamic/TestDynamicSchemaHiveMetastore.java
plugin/trino-hive/src/test/resources/protobuf/sources/person.proto

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions bot added the hive Hive connector label Oct 28, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Avoid hardcoding the dynamic schema cache expiration in HiveMetadataFactory (currently Duration.valueOf("1h"))—expose it via HiveConfig instead of leaving it as a TODO.
  • The new HiveMetastoreWrapper duplicates every HiveMetastore method and adds a lot of boilerplate; consider using a dynamic proxy or default interface methods to reduce manual delegation.
  • The ProtobufDeserializerFactoryLoader is just a thin wrapper around ProtobufDeserializerFactory and HiveConfig—inject the factory directly to simplify bindings and remove the extra loader class.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Avoid hardcoding the dynamic schema cache expiration in HiveMetadataFactory (currently `Duration.valueOf("1h")`)—expose it via HiveConfig instead of leaving it as a TODO.
- The new HiveMetastoreWrapper duplicates every HiveMetastore method and adds a lot of boilerplate; consider using a dynamic proxy or default interface methods to reduce manual delegation.
- The ProtobufDeserializerFactoryLoader is just a thin wrapper around ProtobufDeserializerFactory and HiveConfig—inject the factory directly to simplify bindings and remove the extra loader class.

## Individual Comments

### Comment 1
<location> `plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadataFactory.java:241-244` </location>
<code_context>
     public TransactionalMetadata create(ConnectorIdentity identity, boolean autoCommit)
     {
-        HiveMetastore hiveMetastore = createPerTransactionCache(metastoreFactory.createMetastore(Optional.of(identity)), perTransactionCacheMaximumSize);
+        HiveMetastore hiveMetastore = new DynamicSchemaHiveMetastore(
+                createPerTransactionCache(metastoreFactory.createMetastore(Optional.of(identity)), perTransactionCacheMaximumSize),
+                protobufDeserializerFactory,
+                Duration.valueOf("1h")); // TODO make configurable

         DirectoryLister directoryLister = transactionScopeCachingDirectoryListerFactory.get(this.directoryLister);
</code_context>

<issue_to_address>
**suggestion:** Hardcoded cache expiration duration should be configurable.

Consider exposing the cache expiration setting through HiveConfig or a similar configuration option to support workload-specific tuning.

Suggested implementation:

```java
        this.protobufDeserializerFactory = requireNonNull(protobufDeserializerFactory, "protobufDeserializerFactory is null");
        this.dynamicSchemaCacheExpiration = requireNonNull(hiveConfig.getDynamicSchemaCacheExpiration(), "dynamicSchemaCacheExpiration is null");

```

```java
    @Override
    public TransactionalMetadata create(ConnectorIdentity identity, boolean autoCommit)
    {
        HiveMetastore hiveMetastore = new DynamicSchemaHiveMetastore(
                createPerTransactionCache(metastoreFactory.createMetastore(Optional.of(identity)), perTransactionCacheMaximumSize),
                protobufDeserializerFactory,
                dynamicSchemaCacheExpiration);

        DirectoryLister directoryLister = transactionScopeCachingDirectoryListerFactory.get(this.directoryLister);
        SemiTransactionalHiveMetastore metastore = new SemiTransactionalHiveMetastore(

```

1. In `HiveConfig` (likely in `plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveConfig.java`), add a property:
   ```java
   private Duration dynamicSchemaCacheExpiration = Duration.valueOf("1h");

   @Config("hive.dynamic-schema-cache-expiration")
   public HiveConfig setDynamicSchemaCacheExpiration(Duration duration) {
       this.dynamicSchemaCacheExpiration = duration;
       return this;
   }

   public Duration getDynamicSchemaCacheExpiration() {
       return dynamicSchemaCacheExpiration;
   }
   ```
2. Pass `hiveConfig` to the `HiveMetadataFactory` constructor if not already done.
3. Add a field to `HiveMetadataFactory`:
   ```java
   private final Duration dynamicSchemaCacheExpiration;
   ```
4. Update any relevant configuration documentation to describe the new property.
</issue_to_address>

### Comment 2
<location> `plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/dynamic/DynamicSchemaHiveMetastore.java:111-120` </location>
<code_context>
+    private record TableReference(Table table)
</code_context>

<issue_to_address>
**issue (bug_risk):** TableReference cache key only uses database and table name, ignoring storage/serde parameters.

Including storage and serde parameters in the cache key will help prevent incorrect cache hits when tables with the same name have different formats.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +241 to +244
HiveMetastore hiveMetastore = new DynamicSchemaHiveMetastore(
createPerTransactionCache(metastoreFactory.createMetastore(Optional.of(identity)), perTransactionCacheMaximumSize),
protobufDeserializerFactory,
Duration.valueOf("1h")); // TODO make configurable
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Hardcoded cache expiration duration should be configurable.

Consider exposing the cache expiration setting through HiveConfig or a similar configuration option to support workload-specific tuning.

Suggested implementation:

        this.protobufDeserializerFactory = requireNonNull(protobufDeserializerFactory, "protobufDeserializerFactory is null");
        this.dynamicSchemaCacheExpiration = requireNonNull(hiveConfig.getDynamicSchemaCacheExpiration(), "dynamicSchemaCacheExpiration is null");
    @Override
    public TransactionalMetadata create(ConnectorIdentity identity, boolean autoCommit)
    {
        HiveMetastore hiveMetastore = new DynamicSchemaHiveMetastore(
                createPerTransactionCache(metastoreFactory.createMetastore(Optional.of(identity)), perTransactionCacheMaximumSize),
                protobufDeserializerFactory,
                dynamicSchemaCacheExpiration);

        DirectoryLister directoryLister = transactionScopeCachingDirectoryListerFactory.get(this.directoryLister);
        SemiTransactionalHiveMetastore metastore = new SemiTransactionalHiveMetastore(
  1. In HiveConfig (likely in plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveConfig.java), add a property:
    private Duration dynamicSchemaCacheExpiration = Duration.valueOf("1h");
    
    @Config("hive.dynamic-schema-cache-expiration")
    public HiveConfig setDynamicSchemaCacheExpiration(Duration duration) {
        this.dynamicSchemaCacheExpiration = duration;
        return this;
    }
    
    public Duration getDynamicSchemaCacheExpiration() {
        return dynamicSchemaCacheExpiration;
    }
  2. Pass hiveConfig to the HiveMetadataFactory constructor if not already done.
  3. Add a field to HiveMetadataFactory:
    private final Duration dynamicSchemaCacheExpiration;
  4. Update any relevant configuration documentation to describe the new property.

Comment on lines +111 to +120
private record TableReference(Table table)
{
private String getFullTableName()
{
return String.format("%s.%s", table.getDatabaseName(), table.getTableName());
}

private String getSerializationClass()
{
String serializationClass = table.getStorage().getSerdeParameters().get(SERIALIZATION_CLASS);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): TableReference cache key only uses database and table name, ignoring storage/serde parameters.

Including storage and serde parameters in the cache key will help prevent incorrect cache hits when tables with the same name have different formats.

HiveMetastore hiveMetastore = new DynamicSchemaHiveMetastore(
createPerTransactionCache(metastoreFactory.createMetastore(Optional.of(identity)), perTransactionCacheMaximumSize),
protobufDeserializerFactory,
Duration.valueOf("1h")); // TODO make configurable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a proper config property of this TODO when reviewers agree with the rest of the code

@gertjanal gertjanal force-pushed the dynamic-schema-support-in-hive branch from 059ebf6 to a60ff5f Compare November 1, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector

Development

Successfully merging this pull request may close these issues.

1 participant