Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces XSD schema loading and validation wrappers #64

Closed
wants to merge 4 commits into from

Conversation

lweberk
Copy link

@lweberk lweberk commented Dec 11, 2019

In terms of level of abstraction of the API, it is kept as low as
possible without exposing intermediate structures which are required to
exist somewhere while validation process is in progress. Which is mostly
the Schema, wrapping over xmlSchemaPtr.

There are still improvements that can be made over some error handling
which will be coming as this module gets used in a downstream project.

In terms of level of abstraction of the API, it is kept as low as
possible without exposing intermediate structures which are required to
exist somewhere while validation process is in progress. Which is mostly
the Schema, wrapping over xmlSchemaPtr.

There are still improvements that can be made over some error handling
which will be coming as this module gets used in a downstream project.
@dginev
Copy link
Member

dginev commented Dec 11, 2019

@lweberk Thanks for a very notable upgrade! While I'm reading the code, do you feel upto quickly enumerating the changes as an addition to CHANGELOG.md and adding that to the PR? It's definitely worth publishing a 0.2.13 version of the crate, which I'll do once merged.

@lweberk
Copy link
Author

lweberk commented Dec 11, 2019

Sure thing

@dginev
Copy link
Member

dginev commented Dec 12, 2019

Sorry for the delay, I'll take a detailed look over the weekend - and I assume will merge - travelling days at the moment.

@lweberk
Copy link
Author

lweberk commented Dec 12, 2019

No worries. Take your time.

@lweberk
Copy link
Author

lweberk commented Dec 19, 2019

Anything on this one?

@dginev
Copy link
Member

dginev commented Dec 19, 2019

So far my only reaction is that you may want to run a cargo fmt --all and commit again, since you're not really using the rustfmt.toml conventions for the crate. It's one of those 2-vs-4-indent-spaces aesthetic details...

src/error.rs Outdated

/// Wrapper around xmlErrorPtr
#[derive(Debug)]
pub struct XmlStructuredError(*mut bindings::_xmlError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with the approach of "structured errors", which is my only hesitation in merging right away. Are these worth adopting for all errors produced by libxml2, or are they specific to the schema reporting? If the latter, they ought to be under libxml::schema::error, and if the former, I would need to think about which of the other error interfaces should be refactored to use these - and it they're the right abstraction.

You're also using a FFI-like name with XmlStructuredError, which is something nicely compartmentalized in bindings.rs -- all "higher" modules stay away from that naming convention. If these were to be the main error interface of the crate, simply naming the struct Error would be completely acceptable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They represent a generalized way of reporting errors in libxml2. But to be frank I myself find the whole error handling thing in libxml2 a little confusing and difficult to work with. I get the distinct feeling, that it also grew organically out of the necessities of each module being implemented along the way.

Perhaps we should dig deeper and come up with the cleanest unified approach possible.

As for changing the name, sure thing. No problem. On the other hand I'd prefer not to go for "Error" for it being to generic. Would "StructuredError" be ok? Error is something more reserved to error handling the Rust way, which is not the case unless we make it so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing; would you be opposed to me opening an issue for discussion on the level of abstraction being aimed at in this library wrapper? I'm currently having trouble getting a clear distinction between high and low levels and what they are. It feels a little awkward. Sometimes the wrapping is high level sometimes low. Perhaps it would be sensible to split the project into two layers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was just discussing the low vs high levels in #60 , and we may indeed benefit from a discussion and a convention. This wrapper is also growing organically, and is starting to try and service several competing use patterns, so it may be a good time to take a step back and discuss.

Copy link
Member

@dginev dginev Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also StructuredError is a less generic name than Error, but it is also a somewhat poor name from the libxml2 folks, more confusing than informative.

It is also available in bindings.rs even in master, so I am unsure what benefit there is to draw from recreating it as part of the wrapper interfaces - might as well unify the error-handling for the whole crate before we move forward here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was just discussing the low vs high levels in #60 , and we may indeed benefit from a discussion and a convention. This wrapper is also growing organically, and is starting to try and service several competing use patterns, so it may be a good time to take a step back and discuss.

Ok, I'll draft a proposal for discussion in a new issue. Will take some time, since I do not want to jump into discussion without having a clear picture on the topic myself.

Also StructuredError is a less generic name than Error, but it is also a somewhat poor name from the libxml2 folks, more confusing than informative.

Agreed.

It is also available in bindings.rs even in master, so I am unsure what benefit there is to draw from recreating it as part of the wrapper interfaces - might as well unify the error-handling for the whole crate before we move forward here.

You are right. Clarity in the way forward might help to answer the question whether it is a useful approach or not.

@lweberk
Copy link
Author

lweberk commented Dec 20, 2019

So far my only reaction is that you may want to run a cargo fmt --all and commit again, since you're not really using the rustfmt.toml conventions for the crate. It's one of those 2-vs-4-indent-spaces aesthetic details...

Sorry did not see the rustfmt file. Sure thing. Will do.

Leonhard Weber added 2 commits December 20, 2019 09:04
to stay away from naming schema used by the bindings::*
@dginev
Copy link
Member

dginev commented Dec 20, 2019

Least pleasant feedback last - it also looks like the memory model has to be carefully analyzed and valgrind be made happy.

I'll check if I can quickly spot the cause.

Here are the couple of minor leaks revealed by running the test:

$ valgrind --leak-check=full target/debug/schema_tests-576da011e655fa22

==28136== 
==28136== HEAP SUMMARY:
==28136==     in use at exit: 16,897 bytes in 193 blocks
==28136==   total heap usage: 631 allocs, 438 frees, 190,142 bytes allocated
==28136== 
==28136== 56 (8 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 138 of 193
==28136==    at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28136==    by 0x15F99B: alloc::alloc::alloc (alloc.rs:84)
==28136==    by 0x15F90B: alloc::alloc::exchange_malloc (alloc.rs:206)
==28136==    by 0x15A3A5: new<alloc::rc::Weak<core::cell::RefCell<alloc::vec::Vec<libxml::error::StructuredError>>>> (boxed.rs:123)
==28136==    by 0x15A3A5: libxml::schemas::parser::SchemaParserContext::from_raw (parser.rs:82)
==28136==    by 0x11BE80: libxml::schemas::parser::SchemaParserContext::from_buffer (parser.rs:45)
==28136==    by 0x11B570: schema_tests::schema_from_string (schema_tests.rs:43)
==28136==    by 0x119CE9: schema_tests::schema_from_string::{{closure}} (schema_tests.rs:37)
==28136==    by 0x11999D: core::ops::function::FnOnce::call_once (function.rs:227)
==28136==    by 0x126B9E: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once (boxed.rs:922)
==28136==    by 0x1755A9: __rust_maybe_catch_panic (lib.rs:80)
==28136==    by 0x1413AD: try<(),std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>> (panicking.rs:271)
==28136==    by 0x1413AD: catch_unwind<std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>,()> (panic.rs:394)
==28136==    by 0x1413AD: test::run_test::run_test_inner::{{closure}} (lib.rs:1413)
==28136==    by 0x11C414: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:126)
==28136== 
==28136== 56 (8 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 139 of 193
==28136==    at 0x483A7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28136==    by 0x15F99B: alloc::alloc::alloc (alloc.rs:84)
==28136==    by 0x15F90B: alloc::alloc::exchange_malloc (alloc.rs:206)
==28136==    by 0x15B854: new<alloc::rc::Weak<core::cell::RefCell<alloc::vec::Vec<libxml::error::StructuredError>>>> (boxed.rs:123)
==28136==    by 0x15B854: libxml::schemas::validation::SchemaValidationContext::from_raw (validation.rs:104)
==28136==    by 0x15B3C7: libxml::schemas::validation::SchemaValidationContext::from_parser (validation.rs:41)
==28136==    by 0x11B5BB: schema_tests::schema_from_string (schema_tests.rs:44)
==28136==    by 0x119CE9: schema_tests::schema_from_string::{{closure}} (schema_tests.rs:37)
==28136==    by 0x11999D: core::ops::function::FnOnce::call_once (function.rs:227)
==28136==    by 0x126B9E: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once (boxed.rs:922)
==28136==    by 0x1755A9: __rust_maybe_catch_panic (lib.rs:80)
==28136==    by 0x1413AD: try<(),std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>> (panicking.rs:271)
==28136==    by 0x1413AD: catch_unwind<std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>,()> (panic.rs:394)
==28136==    by 0x1413AD: test::run_test::run_test_inner::{{closure}} (lib.rs:1413)
==28136==    by 0x11C414: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:126)
==28136== 
==28136== LEAK SUMMARY:
==28136==    definitely lost: 16 bytes in 2 blocks
==28136==    indirectly lost: 96 bytes in 2 blocks
==28136==      possibly lost: 0 bytes in 0 blocks
==28136==    still reachable: 16,785 bytes in 189 blocks
==28136==         suppressed: 0 bytes in 0 blocks

@lweberk
Copy link
Author

lweberk commented Dec 20, 2019

Oha, looks like pointers are being leaked. Let me have a look.

@dginev
Copy link
Member

dginev commented Dec 20, 2019

Hey @lweberk just found a great simplification that also leaks nothing under valgrind. You can actually frame the errors without any of the Rc<RefCell<>> machinery at all, as follows:

    let mut errors: Vec<StructuredError> = Vec::new();

    unsafe {
      bindings::xmlSchemaSetParserStructuredErrors(
        parser,
        Some(common::structured_error_handler),
        errors.as_mut_ptr() as *mut c_void,
      );
    }
    //...

and the errlog field is just the minimal:

  errlog: Vec<StructuredError>,

This reframing keeps things simple in C land - and I don't think you ever need to clone errlog anywhere really. You could also rewrite the handler as:

pub fn structured_error_handler(ctx: *mut c_void, error: bindings::xmlErrorPtr) {
  let vec_mut = ctx as *mut Vec<StructuredError>;
  if let Some(vec_ptr) = unsafe { vec_mut.as_mut() } {
    let error = StructuredError::from_raw(error);
    vec_ptr.push(error);
  } else {
    panic!("Underlying error log should not have outlived callback registration");
  }
}

Which avoids any ownership questions, as you only work on the pointer.
I'm guessing the lost feature is what RefCell's ".borrow_mut()" gives you, which is a more readable runtime panic - but I wonder if it's truly worth the complexity at that end? Wouldn't it be cleaner to guard the SchemaParserContext and SchemaValidationContext so that only a single mutating call could be executed at a time? That is in fact very easy to do, even already done - one only needs the methods to require &mut selfas the receiver, in which case you could only execute a single op at a time.

@lweberk
Copy link
Author

lweberk commented Dec 21, 2019

There is something that worries me about that approach;

Modifying the vector may cause its buffer to be reallocated, which would also make any pointers to it invalid.

Since xmlSchemaValidCtxt keeps the pointer for as long as it lives, the chances of the underlying vector buffer to get invalidated are arbitrary (not guaranteeable). It is not the Vec that gets its type erased and reinterpreted but actually the underlying buffer. Though the approach does give a lovely way of removing all that complexity, it makes a ton of assumptions.

Will try to elaborate on your proposal, perhaps there is a close enough way at will allow us to remove all that RefCell machinery.

@dginev
Copy link
Member

dginev commented Dec 21, 2019

Could you actually devise a test for this reallocation scenario, so that we have a clear target of the behavior we're guarding against? It sounds a bit like a vague threat at the moment ... And I would always be tempted to patch the memory leak I see against potential undefined behavior I can't emulate.

Thanks for looking into the details again!

@lweberk
Copy link
Author

lweberk commented Dec 23, 2019

Could you actually devise a test for this reallocation scenario, so that we have a clear target of the behavior we're guarding against? It sounds a bit like a vague threat at the moment ... And I would always be tempted to patch the memory leak I see against potential undefined behavior I can't emulate.

Unfortunately its not a potential memory leak, but a segfault due to access after free. Sorry have not been able to get to it yet.

@cbarber
Copy link
Contributor

cbarber commented Jan 16, 2020

@lweberk @dginev thank you both for your effort implementing schema validation support. I've resolved the memory leak issue and added some test coverage here: #67

@lweberk
Copy link
Author

lweberk commented Jan 16, 2020

@cbarber Thanks a lot. I learned a lot from your changes. I'm still relatively new to the dark arts of FFI in Rust :D.

@dginev
Copy link
Member

dginev commented Jan 16, 2020

Lovely, moving to a review in #67 then, thanks again to everyone for the contributions!

@dginev dginev closed this Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants