Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet reader #373

Merged
merged 1 commit into from
Mar 7, 2025
Merged

Add Parquet reader #373

merged 1 commit into from
Mar 7, 2025

Conversation

andythsu
Copy link
Contributor

@andythsu andythsu commented Mar 3, 2025

Successor of #91
This PR fixes the aforementioned PR with the latest libraries to streamline integration. The core logic remains unchanged.

@andythsu
Copy link
Contributor Author

andythsu commented Mar 3, 2025

I'm running into

Caused by: java.lang.ClassCastException: class org.apache.parquet.io.MessageColumnIO cannot be cast to class org.apache.parquet.io.PrimitiveColumnIO (org.apache.parquet.io.MessageColumnIO and org.apache.parquet.io.PrimitiveColumnIO are in unnamed module of loader 'app')
	at io.trino.parquet.ParquetTypeUtils.constructField(ParquetTypeUtils.java:359)
	at io.trino.parquet.ParquetTypeUtils.constructField(ParquetTypeUtils.java:286)
	at org.ebyhr.trino.storage.operator.ParquetPlugin.lambda$getReader$2(ParquetPlugin.java:155)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:570)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:636)
	at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:291)
	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:656)
	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:662)
	at java.base/java.util.stream.ReferencePipeline.toList(ReferencePipeline.java:667)
	at org.ebyhr.trino.storage.operator.ParquetPlugin.getReader(ParquetPlugin.java:156)
	... 37 more

when running the testSelectParquet unit test. Could you advise? @nineinchnick

@nineinchnick
Copy link
Collaborator

I don't remember where I got the alltypes_plain.parquet file, maybe we should create a new one using the latest Trino version, by inserting into a Hive/Iceberg table.

* this class is copied from io.trino.parquet.reader.FileParquetDataSource
* because it is in test scope
*/
public static class FileParquetDataSource
Copy link
Contributor Author

@andythsu andythsu Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comment. Instead of introducing test-jar into this repo, once we integrate this connector into trino we can then get rid of this class.

Copy link
Contributor Author

@andythsu andythsu Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we add test-jar to this pom and then include this connector to Trino, Trino complains during startup

java.lang.NullPointerException: Cannot invoke "org.sonatype.aether.artifact.Artifact.getGroupId()" because "artifact" is null
	at io.airlift.resolver.ArtifactResolver.getArtifactKey(ArtifactResolver.java:243)
	at io.airlift.resolver.ArtifactResolver.lambda$resolvePom$0(ArtifactResolver.java:184)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:570)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
	at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:315)
	at java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:735)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:570)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:727)
	at io.airlift.resolver.ArtifactResolver.resolvePom(ArtifactResolver.java:185)
	at io.trino.server.DevelopmentPluginsProvider.buildClassLoaderFromPom(DevelopmentPluginsProvider.java:101)
	at io.trino.server.DevelopmentPluginsProvider.doBuildClassLoader(DevelopmentPluginsProvider.java:88)
	at io.trino.server.DevelopmentPluginsProvider.buildClassLoader(DevelopmentPluginsProvider.java:76)
	at io.trino.server.DevelopmentPluginsProvider.lambda$loadPlugins$0(DevelopmentPluginsProvider.java:67)
	at io.trino.server.PluginManager.loadPlugin(PluginManager.java:161)
	at io.trino.server.DevelopmentPluginsProvider.lambda$loadPlugins$1(DevelopmentPluginsProvider.java:67)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1575)

Not sure the exact reason tho

@andythsu
Copy link
Contributor Author

andythsu commented Mar 6, 2025

What it looks like once it's integrated into Trino

image

Copy link
Collaborator

@nineinchnick nineinchnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash all commits

@andythsu andythsu force-pushed the add_parquet branch 2 times, most recently from 190eb8d to 5ed038c Compare March 6, 2025 18:40
@andythsu
Copy link
Contributor Author

andythsu commented Mar 6, 2025

Please squash all commits

@nineinchnick resolved all comments. Please have another review

@nineinchnick nineinchnick merged commit 563cc4f into snowlift:master Mar 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants