Skip to content

Conversation

@khj809
Copy link
Contributor

@khj809 khj809 commented Jan 7, 2026

Summary

Fixes a critical bug in StringViewToDataType_ that causes undefined behavior when parsing dtype strings from non-null-terminated std::string_view. This bug particularly affects Electron applications due to aggresive memory reuse in Chromium's allocators.

Problem

The original implementation used strtoul() to parse numeric values (bits and lanes) directly from std::string_view::data(). However, strtoul() requires null-terminated strings and will continue reading beyond the string_view bounds until it finds a null terminator or non-digit character.

Solution

Replace strtoul() with manual digit parsing that respects string_view boundaries:

auto parse_digits = [](const char*& ptr, const char* end) -> uint32_t {
  uint32_t value = 0;
  while (ptr < end && *ptr >= '0' && *ptr <= '9') {
    value = value * 10 + (*ptr - '0');
    ptr++;
  }
  return value;
};
  • Never reads beyond str.data() + str.length()
  • Works correctly with non-null-terminated strings
  • No memory allocation overhead
  • Same performance as strtoul but safe

Testing

New test TEST(DType, NonNullTerminatedStringView) validates:

  • Parsing with digit garbage after the string ("float16999888777" length 7 -> bits=16, not 16999888777)
  • Parsing with lane specifications outside bounds ("int32x4extradata" length 5 -> lanes=1, not 4)
  • Correct parsing when lanes are within bounds ("bfloat16x2" length 10 -> lanes=2)
  • Non-null-terminated buffers ("float64AAAAA" length 7 -> float64)
  • Edge cases with various dtype types (int, uint, float, bfloat)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @khj809, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical data type parsing bug that could cause crashes or incorrect behavior when std::string_view instances, particularly those from aggressively reused memory, are not null-terminated. The solution involves a robust, boundary-aware parsing mechanism and is backed by new tests to ensure the integrity and safety of data type conversions.

Highlights

  • Critical Bug Fix: Addresses an out-of-bounds read vulnerability in the StringViewToDLDataType_ function, which could lead to undefined behavior when processing non-null-terminated std::string_view inputs.
  • Safe Parsing Implementation: Replaces the unsafe strtoul() calls with a custom parse_digits lambda function. This new utility ensures that numeric parsing respects the explicit boundaries of std::string_view, preventing reads beyond its allocated length.
  • Enhanced Test Coverage: Introduces a new comprehensive test suite, TEST(DType, NonNullTerminatedStringView), specifically designed to validate the correct behavior of DLDataType parsing under various non-null-terminated string scenarios, including those with trailing garbage data.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a critical out-of-bounds read bug in StringViewToDLDataType_ by replacing strtoul with a custom, bounds-aware digit parser. The fix is sound and is accompanied by a comprehensive set of new tests. My review includes suggestions to further improve the robustness of the parser by adding checks for integer overflows and validating the parsed values for bits and lanes to prevent silent data truncation and invalid configurations.

@junrushao
Copy link
Member

Interesting! @tqchen could you take a look?

@tqchen
Copy link
Member

tqchen commented Jan 7, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical out-of-bounds read vulnerability in StringViewToDLDataType_ by replacing the unsafe strtoul with a custom, bounds-aware parsing function for non-null-terminated std::string_views. The accompanying tests are thorough and validate the fix for various edge cases. My review focuses on ensuring the new code is fully robust. I've identified one potential out-of-bounds read that was introduced and an opportunity to reduce code duplication for better maintainability.

@junrushao
Copy link
Member

/gemini Review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent fix for a critical out-of-bounds read vulnerability. Replacing strtoul with a custom, bounds-aware parsing implementation is the right approach to ensure safety with non-null-terminated std::string_views. The new helper functions parse_digits and parse_lanes are well-structured and make the code cleaner. The accompanying tests are very thorough and cover a wide range of edge cases, which gives high confidence in the fix. I have one suggestion to make the digit parsing even more robust against potential overflows.

Comment on lines +216 to +219
while (*ptr < end && **ptr >= '0' && **ptr <= '9') {
value = value * 10 + (**ptr - '0');
(*ptr)++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of parse_digits has a subtle bug where value can overflow uint64_t if the input string contains a very large number (more than 19 digits). If value overflows, it will wrap around (which is defined behavior for unsigned integers), and the subsequent check value > UINT32_MAX might fail to detect the overflow, leading to incorrect parsing.

To make this function more robust, you should check for potential overflow before performing the multiplication and addition. Since the final value should fit in a uint32_t, you can check against UINT32_MAX within the loop.

    while (*ptr < end && **ptr >= '0' && **ptr <= '9') {
      uint8_t digit = **ptr - '0';
      if (value > UINT32_MAX / 10 || (value == UINT32_MAX / 10 && digit > UINT32_MAX % 10)) {
        // Number is too large for uint32_t, set to overflow and consume rest of digits.
        value = (uint64_t)UINT32_MAX + 1;
        while (*ptr < end && **ptr >= '0' && **ptr <= '9') {
          (*ptr)++;
        }
        break;
      }
      value = value * 10 + digit;
      (*ptr)++;
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini is getting too much

@junrushao junrushao merged commit ae30cd6 into apache:main Jan 8, 2026
7 checks passed
@khj809 khj809 deleted the fix/dtype-from-non-null-terminated-string branch January 9, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants