Skip to content

Change read_text() return value to BytesText#944

Merged
Mingun merged 1 commit intotafia:masterfrom
Mingun:flexible-read-text
Feb 22, 2026
Merged

Change read_text() return value to BytesText#944
Mingun merged 1 commit intotafia:masterfrom
Mingun:flexible-read-text

Conversation

@Mingun
Copy link
Collaborator

@Mingun Mingun commented Feb 20, 2026

Previously we decode the bytes that was read by read_text(), but correct processing probably should also include EOL normalization (because according to the specification, XML parser should operate at normalized input). So now the user can choose how to process the content:

  • use decode() to get only decoded text (as it was before)
  • use xml_content() to get the text according to the XML standard
  • use Deref implementation to get the underlying bytes

@codecov-commenter
Copy link

codecov-commenter commented Feb 20, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.38%. Comparing base (2b21d40) to head (c5506c3).
⚠️ Report is 24 commits behind head on master.

Files with missing lines Patch % Lines
examples/read_nodes.rs 0.00% 4 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #944      +/-   ##
==========================================
+ Coverage   55.00%   56.38%   +1.38%     
==========================================
  Files          44       44              
  Lines       16816    17580     +764     
==========================================
+ Hits         9249     9913     +664     
- Misses       7567     7667     +100     
Flag Coverage Δ
unittests 56.38% <42.85%> (+1.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dralley
Copy link
Collaborator

dralley commented Feb 21, 2026

Tangential to this PR, the XML spec reads as though EOL handling (replacing CR-LF sequences, etc.) ought to be a step that precedes any parsing of the XML. That is, it's not really intended to be something which happens during the processing of individual elements.

Is that understanding correct?

This PR would still be a reasonable change given that no such mechanism currently exists (same as with decoding in general), but it might be relevant to e.g. #441 (comment)

I could imagine having a series of layers where

Layer 1) Decoding to utf-8 OR validating utf-8 OR assuming utf-8 (e.g. slice reader)
Layer 2) Preprocessing including EOL normalization, perhaps newline tracking for more useful errors
Layer 3) XML Parsing, which can universally assume pre-normalized utf-8

Pros:

  • simpler
  • probably more standard compliant

Cons:

  • slice reader becomes slightly less useful
    • you would either still need a step involving buffered processing, or enforce pre-processing the document using a function that returns some special type wrapped around Cown<str> that might allocate a new copy

@Mingun Mingun force-pushed the flexible-read-text branch from 5e7c0cf to c5506c3 Compare February 22, 2026 12:30
@Mingun
Copy link
Collaborator Author

Mingun commented Feb 22, 2026

Is that understanding correct?

Correct.

I could imagine having a series of layers where

Although this is logical from an architectural point of view, it seems to me that one of the key performance features when parsing non-UTF-8 encoded documents is to work with bytes, not characters. Since we only support encodings in which XML control bytes are encoded exactly as in ASCII (and in UTF-8), we can postpone the decoding step until it is actually needed. The same goes for the EOL normalization phase -- all encodings that we support are ASCII-compatible, so we can work with bytes and do normalization at the byte level too. Postpone all this work is especially noticeable when we skip a lot of events during processing. Naturally, there are also disadvantages -- we do not check the correctness of the encoded stream. It would be ideal to leave the validation step optional so that those who do not need it can turn it off.

@dralley
Copy link
Collaborator

dralley commented Feb 22, 2026

Although this is logical from an architectural point of view, it seems to me that one of the key performance features when parsing non-UTF-8 encoded documents is to work with bytes, not characters. Since we only support encodings in which XML control bytes are encoded exactly as in ASCII (and in UTF-8), we can postpone the decoding step until it is actually needed. The same goes for the EOL normalization phase -- all encodings that we support are ASCII-compatible, so we can work with bytes and do normalization at the byte level too. Postpone all this work is especially noticeable when we skip a lot of events during processing. Naturally, there are also disadvantages -- we do not check the correctness of the encoded stream. It would be ideal to leave the validation step optional so that those who do not need it can turn it off.

I don't mean that we would be working with char or anything. The point is that if you know everything is UTF-8, and the parser knows where the relevant separators are (<, >,etc.) then any "raw bytes" between those known boundaries can be assumed valid UTF-8 (unsafe fn std::str::from_utf8_unchecked()) without any additional performance cost.

The other goal behind pre-decoding would be to be able to support encodings like UTF-16 in the first place, as right now we could not do so.

I don't have recent hard data to back this up, but a foundational premise is also that the overall performance cost of validating OR decoding would be reduced dramatically by doing it in-bulk rather than doing it repeatedly on many small items due to the nature of caches, vectorization, required allocations etc.

It would be ideal to leave the validation step optional so that those who do not need it can turn it off.

Anyway, that can be discussed further on my other draft PR, where there's something to look at.

Copy link
Collaborator

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my discussion is not strictly related

@Mingun Mingun merged commit 6238d8a into tafia:master Feb 22, 2026
7 checks passed
@Mingun Mingun deleted the flexible-read-text branch February 22, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants