Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8836][CH] Fix partition values with escape char #8840

Merged
merged 1 commit into from
Mar 5, 2025

Conversation

lwz9103
Copy link
Contributor

@lwz9103 lwz9103 commented Feb 27, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8836)

How was this patch tested?

unit tests

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten ClickHouse CI on ARM

@lwz9103
Copy link
Contributor Author

lwz9103 commented Feb 27, 2025

Run Gluten ClickHouse CI on x86

@PHILO-HE PHILO-HE changed the title GLUTEN-8836 Fix partition values with escape char [GLUTEN-8836][CH] Fix partition values with escape char Feb 27, 2025
Copy link

#8836

Copy link

github-actions bot commented Mar 4, 2025

Run Gluten ClickHouse CI on ARM

Copy link

github-actions bot commented Mar 4, 2025

Run Gluten ClickHouse CI on ARM

Copy link

github-actions bot commented Mar 5, 2025

Run Gluten ClickHouse CI on ARM

@lwz9103
Copy link
Contributor Author

lwz9103 commented Mar 5, 2025

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

8836 - Partially compliant

Compliant requirements:

  • Added tests covering escaped characters in partition values.
  • Implemented URI decoding for partition names in Scala.
  • Introduced a dedicated C++ function to perform escaping.

Non-compliant requirements:

[]

Requires further human verification:

  • Edge-case behavior for malformed URIs and non-ascii characters.
  • Consistency of escaping logic between Scala and C++ implementations.
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

URI Decoding Safety

The new apply method in MergeTreePartSplit decodes the name and directory using URI. It would be beneficial to confirm that all invalid or unexpected URI inputs are handled gracefully.

object MergeTreePartSplit {
  def apply(
      name: String,
      dirName: String,
      targetNode: String,
      start: Long,
      length: Long,
      bytesOnDisk: Long
  ): MergeTreePartSplit = {
    // Ref to org.apache.spark.sql.delta.files.TahoeFileIndex.absolutePath
    val uriDecodeName = new Path(new URI(name)).toString
    val uriDecodeDirName = new Path(new URI(dirName)).toString
    new MergeTreePartSplit(uriDecodeName, uriDecodeDirName, targetNode, start, length, bytesOnDisk)
  }
}
Test Coverage

The added test "test partitioned with escaped characters" covers various escape characters. Verify that edge cases, such as null or empty strings and multiple escape scenarios, are adequately validated.

test("test partitioned with escaped characters") {

  val schema = StructType(
    Seq(
      StructField.apply("id", IntegerType, nullable = true),
      StructField.apply("escape", StringType, nullable = true),
      StructField.apply("bucket/col", StringType, nullable = true),
      StructField.apply("part=col1", DateType, nullable = true),
      StructField.apply("part_col2", StringType, nullable = true)
    ))

  val data: Seq[Row] = Seq(
    Row(1, "=", "00000", Date.valueOf("2024-01-01"), "2024=01/01"),
    Row(2, "/", "00000", Date.valueOf("2024-01-01"), "2024=01/01"),
    Row(3, "#", "00000", Date.valueOf("2024-01-01"), "2024#01:01"),
    Row(4, ":", "00001", Date.valueOf("2024-01-02"), "2024#01:01"),
    Row(5, "\\", "00001", Date.valueOf("2024-01-02"), "2024\\01\u000101"),
    Row(6, "\u0001", "000001", Date.valueOf("2024-01-02"), "2024\\01\u000101"),
    Row(7, "", "000002", null, null)
  )

  val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
  df.createOrReplaceTempView("origin_table")
  spark.sql("select * from origin_table").show()

  nativeWrite {
    format =>
      val table_name = table_name_template.format(format)
      spark.sql(s"drop table IF EXISTS $table_name")
      writeAndCheckRead("origin_table", table_name, schema.fieldNames.map(f => s"`$f`")) {
        _ =>
          spark
            .table("origin_table")
            .write
            .format(format)
            .partitionBy("part=col1", "part_col2")
            .bucketBy(2, "bucket/col")
            .saveAsTable(table_name)
      }

      val table_name_vanilla = table_name_vanilla_template.format(format)
      spark.sql(s"drop table IF EXISTS $table_name_vanilla")
      withSQLConf((GlutenConfig.NATIVE_WRITER_ENABLED.key, "false")) {
        withNativeWriteCheck(checkNative = false) {
          spark
            .table("origin_table")
            .write
            .format(format)
            .partitionBy("part=col1", "part_col2")
            .bucketBy(2, "bucket/col")
            .saveAsTable(table_name_vanilla)
        }
        compareWriteFilesSignature(format, table_name, table_name_vanilla, "sum(id)")
      }
  }
}

test("test bucketed by constant") {
Escaping Logic

The C++ implementation for escaping uses a bitset and stream formatting. It is recommended to review edge cases (e.g., handling of non-ASCII characters) and ensure the formatting flags do not inadvertently affect later operations.

}();

static bool needsEscaping(char c) {
    return c >= 0 && c < SparkPartitionEscape::ESCAPE_BITSET.size()
        && SparkPartitionEscape::ESCAPE_BITSET.test(c);
}

static std::string escapePathName(const std::string & path) {
    std::ostringstream builder;
    for (char c : path) {
        if (needsEscaping(c)) {
            builder << '%' << std::uppercase << std::setw(2) << std::setfill('0') << std::hex << (int)c;
        } else {
            builder << c;
        }
    }

@baibaichen baibaichen merged commit be909b6 into apache:main Mar 5, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Unsupported partition values with escape char
3 participants