Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/Fix the get method to accept a Reader #104

Open
Aargonian opened this issue Dec 6, 2024 · 1 comment
Open

Improve/Fix the get method to accept a Reader #104

Aargonian opened this issue Dec 6, 2024 · 1 comment

Comments

@Aargonian
Copy link

Aargonian commented Dec 6, 2024

Hello! I see a potential issue with the current interface for get that I think should be solvable. I have noticed the get function the crate seems to be based around only accepts an &[u8]. I'm not as super familiar with Rust at this level yet as I would like to be, so forgive me if I have misunderstood, but that implies I need to read (potentially) the entire file into memory before it can be matched. Looking at the implementation for get_from_file(), this appears to be what that function abstracts.

#58 makes note that all tests seem to pass when test files have been truncated to only 4kb, even if the files themselves are now invalid. DOC/PPT/XLS Seem to be the exceptions as they are parsed rather than read as binary.

Of course, that is a much easier ask, to read only 4kb, but there is no guarantee this will hold into the future, and the crate interface does not make clear that this is all that may be needed to identify a file.

My thought is it would certainly be much better if we could either pass an iterator or Reader of some kind to get instead, allowing us to control the amount of data kept in memory. Even better, it seems like it may be possible (though perhaps not trivial/easy, I cannot say with my current understanding) to add some information such as the "max required size" to each matcher, denoting that the matcher should never need more than X bytes to determine if the file matches or not. Obviously this does not fly for all file types (although, again, the previous pull request seems to imply it works for pretty much all currently supported types), but that can be abstracted behind some enum or other interface for those looking to use the optimization. Perhaps by making the matcher a trait with an associated constant, the default of which is 0 to denote requiring the whole file?

@bojand
Copy link
Owner

bojand commented Dec 8, 2024

I think #58 is really a separate issue in some regards... I am not a fan of arbitrarily truncating all the test files, as to me philosophically, they should be real world representations of test data... So then the question is whether the test data should be included when publishing or not... Originally they were not included, but then they were required for packaging (see #47 and #54), so they were included. Unfortunately I am not aware of best practice or prescribed approach to this, unless something has changed. Quick search yields similar questions on the matter [1, 2, 3].

With respect to improving the API, yup I've had this idea for a while, just haven't been able to get to it.

Yes we should support a Reader and that would improve the API and make it more flexible.

Along with that, if we did support dynamic reading of data, then we could also define the minimum required length for each matcher where we know, and also know how much we've read, and and if we need to, we can read more from the input data.

In fact there are some branches where I experimented with some of this...

Another option that's possible which doesn't seem like it's in any of the branches, and should be backwards compatible I think, is something like this:

use std::fs;
use std::io;
use std::io::Read;

fn main() -> std::io::Result<()> {
    let bytes = fs::read("sample.exe").unwrap();
    println!("is_exe: {}", is_exe(&bytes));
    println!("is_exe_read: {}", is_exe_read(&bytes[..])?);

    let buf = [0xFF, 0xD8, 0xFF, 0xAA];

    println!("is_exe: {}", is_exe(&buf));
    println!("is_exe_read: {}", is_exe_read(&buf[..])?);
    Ok(())
}

macro_rules! build_fn_api {
    (
        $(#[$outer:meta])*
        ($name:tt, $impl_fn:ident, $sz:literal, $sz_check:literal)
    ) => {

        $(#[$outer])*
        pub fn $name(buf: &[u8]) -> bool {
            if ($sz_check && buf.len() <= $sz) {
                return false;
            }

            match $impl_fn(buf) {
                Ok(val) => val,
                Err(_) => false,
            }
        }
    };
}

/// Returns whether a data from reader is an EXE.
/// DLL and EXE have the same magic number, so returns true also for a DLL.
///
/// # Examples
///
/// ```rust
/// use std::fs;
/// use std::io::prelude::*;
/// use std::fs::File;
///
/// fn main() -> std::io::Result<()> {
///     let mut f = File::open("testdata/sample.exe")?;
///     let exe = infer::app::is_exe_read(&mut f).unwrap();
///     assert!(exe);
///     Ok(())
/// }
/// ```
fn is_exe_read<T: Read>(reader: T) -> io::Result<bool> {
    let mut buf = [0; 2];
    reader.take(2).read_exact(&mut buf)?;
    Ok(buf.len() > 1 && buf[0] == 0x4D && buf[1] == 0x5A)
}

build_fn_api!(
    /// Returns whether a buffer is an EXE. DLL and EXE have the same magic number, so returns true also for a DLL.
    ///
    /// # Example
    ///
    /// ```rust
    /// use std::fs;
    /// assert!(infer::app::is_exe(&fs::read("testdata/sample.exe").unwrap()));
    /// ```
    (is_exe, is_exe_read, 2, true)
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants