-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serializations for mixed content documents #94
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
funkyfuture
added
enhancement
New feature or request
design
Proposals and discussion of API changes
labels
Sep 18, 2024
funkyfuture
commented
Sep 18, 2024
funkyfuture
commented
Sep 19, 2024
- moves example files to a separate folder - emphasizes annotated lines - fixes a glitch where i missed to create a metadata container
also sets a lower resource default
JKatzwinkel
previously approved these changes
Sep 24, 2024
JKatzwinkel
approved these changes
Oct 30, 2024
funkyfuture
commented
Oct 30, 2024
thanks! i'm resolving the conflicts locally and push that to the main branch. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
so, i'm mostly done with what took its departure in #54. given the lengths that this was on my desk and in my drawers and several moments where i was under the impression that what i seeked wasn't sanely doable, i'm very happy to eventually be at that point.
a review of these changes are imo sufficient by studying and criticising:
my guess is that the latter one is functioning as it continuously yielded then fixed errors on each code iteration from the 360k something documents with a total volume of ~4.1GB. to be explicit: all these documents were parsed, a non-altered and two whitespace-altering variants were produced, these were each reparsed (where the latter two received whitespace normalization as per TEI recommendation) and finally successfully compared against the originating documents.
(just two unimportant insights from the process: if one had tried to achieve that based on lxml's data model they'd certainly gone nuts and the
:=
operator can be a super powerful tool for concise expressions; what was all the fuzz about?)anyway, don't look to much on the implementation. it's architecture is fundamentally wrong (we really need an event based writer and some state machinish connectors) and inefficient.
but the current structure allows targeted debugging, that's what i did at length. and i would consider this as kind of a breakthrough (showing what is possible) and the establishment of a Distinktionsmerkmal for libraries that operate on the basic level. in that regard, you can pitch me other suited libraries (regardless their language) to include them in the comparison.
hence i'd say the implementation is good enough to move on.
i promise not to force-push to this branch. but i may consolidate and merge it locally at the end.
please contact me directly if you'd like an in-person discussion.