Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] AuthZ RowFilter causes org.apache.spark.sql.AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] in spark 3.5 #6889

Open
3 of 4 tasks
lanklaas opened this issue Jan 10, 2025 · 2 comments
Labels
kind:bug This is a clearly a bug priority:major

Comments

@lanklaas
Copy link

lanklaas commented Jan 10, 2025

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

Hi,

Using kyuubi 1.10.1 with spark 3.5.2 seems like it has a regression from kyuubi with spark 3.4.4. I have a view with a row filter and then when querying the view as 2 subqueries of itself I get the error mentioned in the engine log.

I was able to get this minimally reproducable using the source tag v1.10.1 and doing a default build of kyuubi with ranger running in docker.

To reproduce the error you have to create tables from these zipped parquet files:

test-data.zip

Here is the SQL to create the tables:

create table if not exists Album
  USING org.apache.spark.sql.parquet
  OPTIONS (
  path ("/tmp/chinook/alb.parquet")
  );

create table if not exists Artist
  USING org.apache.spark.sql.parquet
  OPTIONS (
  path ("/tmp/chinook/art.parquet")
  );

create table if not exists Track
  USING org.apache.spark.sql.parquet
  OPTIONS (
  path ("/tmp/chinook/trk.parquet")
  );

Then you create a view on top of these tables:

CREATE VIEW myview
as
SELECT
    `E95676`.`ArtistId` `ArtistId`
,   `E95676`.`Name`     `ArtistName`
,   `E95675`.`AlbumId`  `AlbumId`
,   `E95675`.`Title`    `AlbumTitle`
,   `E95685`.`TrackId`  `TrackId`
,   `E95685`.`Name`     `TrackName`
FROM
    `Album` `E95675`
LEFT OUTER JOIN
    `Artist`    `E95676`
ON
    `E95675`.`ArtistId` =   `E95676`.`ArtistId`
LEFT OUTER JOIN
    `Track` `E95685`
ON
    `E95685`.`AlbumId`  =   `E95675`.`AlbumId`

Then a row filter should be added to ranger like so:

image

The query that causes the error is this:

SELECT T0.C1, T1.F1
FROM (
select a.TrackName C1 from myview a
	) T0
LEFT OUTER JOIN (	
select b.TrackName F1 from myview b
) T1 ON T0.C1 = T1.F1

Strange thing is that changing the case of a single character in the second subquery then makes the query work:

SELECT T0.C1, T1.F1
FROM (
select a.TrackName C1 from myview a
	) T0
LEFT OUTER JOIN (	
select b.TrackName F1 from Myview b
) T1 ON T0.C1 = T1.F1

Unfortunately, I do not have control over this.

I tested in our k8s environment against spark 3.4.4 and the issue does not occur. I have not yet tested against a local build for spark 3.4. I will provide those details once the build completes

Affects Version(s)

1.10.1

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

org.apache.spark.sql.AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "TrackName" missing from "ArtistId", "ArtistName", "AlbumId", "AlbumTitle", "TrackId", "TrackName" in operator !Project [TrackName#331 AS F1#319]. Attribute(s) with the same name appear in the operation: "TrackName".
Please check if the right attribute(s) are used.; line 1 pos 88;
Project [C1#318, F1#319]
+- Join LeftOuter, (C1#318 = F1#319)
   :- SubqueryAlias T0
   :  +- Project [TrackName#331 AS C1#318]
   :     +- SubqueryAlias a
   :        +- SubqueryAlias spark_catalog.default.myview
   :           +- Filter (albumid#328L = cast(117 as bigint))
   :              +- RowFilterMarker
   :                 +- PermanentViewMarker
   :                       +- View (`spark_catalog`.`default`.`myview`, [ArtistId#326L,ArtistName#327,AlbumId#328L,AlbumTitle#329,TrackId#330L,TrackName#331])
   :                          +- Project [cast(ArtistId#320L as bigint) AS ArtistId#326L, cast(ArtistName#321 as string) AS ArtistName#327, cast(AlbumId#322L as bigint) AS AlbumId#328L, cast(AlbumTitle#323 as string) AS AlbumTitle#329, cast(TrackId#324L as bigint) AS TrackId#330L, cast(TrackName#325 as string) AS TrackName#331]
   :                             +- Project [ArtistId#91L AS ArtistId#320L, Name#92 AS ArtistName#321, AlbumId#88L AS AlbumId#322L, Title#89 AS AlbumTitle#323, TrackId#93L AS TrackId#324L, Name#94 AS TrackName#325]
   :                                +- Join LeftOuter, (AlbumId#95L = AlbumId#88L)
   :                                   :- Join LeftOuter, (ArtistId#90L = ArtistId#91L)
   :                                   :  :- SubqueryAlias E95675
   :                                   :  :  +- SubqueryAlias spark_catalog.default.album
   :                                   :  :     +- Relation spark_catalog.default.album[AlbumId#88L,Title#89,ArtistId#90L] parquet
   :                                   :  +- SubqueryAlias E95676
   :                                   :     +- SubqueryAlias spark_catalog.default.artist
   :                                   :        +- Relation spark_catalog.default.artist[ArtistId#91L,Name#92] parquet
   :                                   +- SubqueryAlias E95685
   :                                      +- SubqueryAlias spark_catalog.default.track
   :                                         +- Relation spark_catalog.default.track[TrackId#93L,Name#94,AlbumId#95L,MediaTypeId#96L,GenreId#97L,Composer#98,Milliseconds#99L,Bytes#100L,UnitPrice#101] parquet
   +- SubqueryAlias T1
      +- !Project [TrackName#331 AS F1#319]
         +- SubqueryAlias b
            +- SubqueryAlias spark_catalog.default.myview
               +- Filter (albumid#348L = cast(117 as bigint))
                  +- RowFilterMarker
                     +- PermanentViewMarker
                           +- Project [cast(ArtistId#326L as bigint) AS ArtistId#346L, cast(ArtistName#327 as string) AS ArtistName#347, cast(AlbumId#328L as bigint) AS AlbumId#348L, cast(AlbumTitle#329 as string) AS AlbumTitle#349, cast(TrackId#330L as bigint) AS TrackId#350L, cast(TrackName#331 as string) AS TrackName#351]
                              +- View (`spark_catalog`.`default`.`myview`, [ArtistId#326L,ArtistName#327,AlbumId#328L,AlbumTitle#329,TrackId#330L,TrackName#331])
                                 +- Project [cast(ArtistId#320L as bigint) AS ArtistId#326L, cast(ArtistName#321 as string) AS ArtistName#327, cast(AlbumId#322L as bigint) AS AlbumId#328L, cast(AlbumTitle#323 as string) AS AlbumTitle#329, cast(TrackId#324L as bigint) AS TrackId#330L, cast(TrackName#325 as string) AS TrackName#331]
                                    +- Project [ArtistId#335L AS ArtistId#320L, Name#336 AS ArtistName#321, AlbumId#332L AS AlbumId#322L, Title#333 AS AlbumTitle#323, TrackId#337L AS TrackId#324L, Name#338 AS TrackName#325]
                                       +- Join LeftOuter, (AlbumId#339L = AlbumId#332L)
                                          :- Join LeftOuter, (ArtistId#334L = ArtistId#335L)
                                          :  :- SubqueryAlias E95675
                                          :  :  +- SubqueryAlias spark_catalog.default.album
                                          :  :     +- Relation spark_catalog.default.album[AlbumId#332L,Title#333,ArtistId#334L] parquet
                                          :  +- SubqueryAlias E95676
                                          :     +- SubqueryAlias spark_catalog.default.artist
                                          :        +- Relation spark_catalog.default.artist[ArtistId#335L,Name#336] parquet
                                          +- SubqueryAlias E95685
                                             +- SubqueryAlias spark_catalog.default.track
                                                +- Relation spark_catalog.default.track[TrackId#337L,Name#338,AlbumId#339L,MediaTypeId#340L,GenreId#341L,Composer#342,Milliseconds#343L,Bytes#344L,UnitPrice#345] parquet

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:711)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:215)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:243)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:243)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:243)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:215)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:197)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:202)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:193)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:171)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:202)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:225)
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:222)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
        at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:90)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:174)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:158)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:85)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:113)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Kyuubi Server Configurations

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

## Kyuubi Configurations

#
# kyuubi.authentication                    NONE
#
kyuubi.frontend.bind.host                0.0.0.0
# kyuubi.frontend.protocols                THRIFT_BINARY,REST
# kyuubi.frontend.thrift.binary.bind.port  10009
# kyuubi.frontend.rest.bind.port           10099
#
# kyuubi.engine.type                       SPARK_SQL
# kyuubi.engine.share.level                USER
# kyuubi.session.engine.initialize.timeout PT3M
#
kyuubi.ha.addresses                      localhost:2181
# kyuubi.ha.namespace                      kyuubi
#

# Details in https://kyuubi.readthedocs.io/en/master/configuration/settings.html
spark.sql.extensions=org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension
spark.executor.extraClassPath=/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/apiguardian-api-1.1.2.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/gethostname4j-1.0.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-annotations-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-core-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-databind-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-1.9.13.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-base-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-json-provider-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jersey-bundle-1.19.4.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jna-5.13.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jna-platform-5.13.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/kyuubi-spark-authz-shaded_2.12-1.10.1.jar
# spark.executor.extraClassPath=/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/kyuubi-spark-authz-shaded_2.12-1.10.1.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-1.9.13.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-json-provider-2.15.0.jar
spark.driver.extraClassPath=/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/apiguardian-api-1.1.2.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/gethostname4j-1.0.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-annotations-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-core-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-databind-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-1.9.13.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-base-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-json-provider-2.15.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jersey-bundle-1.19.4.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jna-5.13.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jna-platform-5.13.0.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/kyuubi-spark-authz-shaded_2.12-1.10.1.jar
# spark.driver.extraClassPath=/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/kyuubi-spark-authz-shaded_2.12-1.10.1.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-1.9.13.jar:/home/me/git/hub/kyuubi/extensions/spark/kyuubi-spark-authz-shaded/target/jackson-jaxrs-json-provider-2.15.0.jar

Kyuubi Engine Configurations

## ranger-spark-security.xml


<configuration>
  <property>
    <name>ranger.plugin.spark.policy.rest.url</name>
    <value>http://localhost:6080</value>
  </property>
  <property>
    <name>ranger.plugin.spark.service.name</name>
    <value>spark</value>
  </property>
  <property>
    <name>ranger.plugin.spark.policy.cache.dir</name>
    <value>/tmp/policycache</value>
  </property>
  <property>
    <name>ranger.plugin.spark.policy.pollIntervalMs</name>
    <value>1000</value>
  </property>
  <property>
    <name>ranger.plugin.spark.policy.source.impl</name>
    <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
  </property>
  <property>
    <name>ranger.plugin.spark.enable.implicit.userstore.enricher</name>
    <value>true</value>
    <description>Enable UserStoreEnricher for fetching user and group attributes if using macros or scripts in row-filters since Ranger 2.3</description>
  </property>
  <property>
    <name>ranger.plugin.hive.policy.cache.dir</name>
    <value>/tmp/policycache</value>
    <description>As Authz plugin reuses hive service def, a policy cache path is required for caching UserStore and Tags for &quot;hive&quot; service def, while &quot;ranger.plugin.spark.policy.cache.dir config&quot; is the path for caching policies in service. </description>
  </property>
</configuration>

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.
@lanklaas lanklaas added kind:bug This is a clearly a bug priority:major labels Jan 10, 2025
Copy link

Hello @lanklaas,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

@lanklaas
Copy link
Author

Just did a build for spark 3.4 with ./build/mvn clean package -Pspark-3.4 -DskipTests and can confirm the error does not happen

@lanklaas lanklaas changed the title AuthZ RowFilter causes org.apache.spark.sql.AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] in spark 3.5 [Bug] AuthZ RowFilter causes org.apache.spark.sql.AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] in spark 3.5 Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug priority:major
Projects
None yet
Development

No branches or pull requests

1 participant