Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for v1.6.0 #145

Open
kspalaiologos opened this issue Dec 15, 2024 · 7 comments
Open

Roadmap for v1.6.0 #145

kspalaiologos opened this issue Dec 15, 2024 · 7 comments

Comments

@kspalaiologos
Copy link
Owner

kspalaiologos commented Dec 15, 2024

This ticket is meant to be a collective TODO list for the v1.6.0 release with all the major features that I am planning.

  • Bite into the support of machine-specific and OS-specific code. Copy over some code from xpar - particularly NASM detection in m4 files, CRC32C implementations, CPU feature detection. We could gate it behind the same architecture-specific flag options as xpar. Unfortunately, xpar is currently packaged only for NixOS and it is impossible, due to the NixOS specifics, to allow machine-specific optimisations to ever execute. As such it is actually more portable(!) to add assembly source units to the program. Not to mention problems with the compilers miscompiling the hot loops.
    • Add hardware CRC32 support.
    • Add some code to detect the amount of available processors to use with -j 0. We want a C analogue of std::thread::hardware_concurrency(). Maybe determine the amount of CPUs by task affinity (sched_getaffinity - Linux-specific), sysconf (GNU-only), get_nprocs (also GNU), or maybe read /proc/cpuinfo... Another possibility is pthread_getaffinity_np, or NetBSD 5+ GNU's sched_getaffinity_np, on Windows we would want GetProcessAffinityMask. In practice, the sysconf appears to work on glibc, Mac OS X 10.5, FreeBSD, AIX, OSF/1, Solaris, Cygwin, Haiku. HP-UX would require pstat_getdynamic, IRIX uses sysmp; as a fallback on Windows platforms we could use GetSystemInfo. Possible m4 code of interest when it comes to making the heads and tails of this mess.
    • Improved work-stealing concurrency for parallel encoding and decoding in the CLI tool. Currently we read the blocks in one go, then encode them in one go and write them back again as a single whole stage. It would be an improvement to perform the encoding and decoding in parallel with I/O operations to get more mileage out of parallelism on slow disks.
    • Preserve the ownership, permissions, atime/ctime/mtime metadata of input files into output files. I am skeptical because I don't want to spend a lot of time catering to portability and making sure that we also respect Windows ACLs, for example. I think that doing so in certain cases requires elevated privileges too.
    • Memory mapped I/O for faster operation in the CLI stub.
  • Stuff that requires a new format:
    • Undo the arithmetic coding stage if it doesn't yield satisfactory results. This way we store the data verbatim and at some overhead in the encoding speed department we improve the decode performance on incompressible segments drastically.
    • Add a special "end of file" block marker that preserves data integrity on truncated streams.
    • Unify frame/CLI tool formats.
  • Miscellany
    • Document the current file format better getting us a few steps closer to a 3rd party being able to produce a valid encoder or decoder independently of the current source code.
    • Investigate libcubwt.
    • Clean up the code.
  • Finished tasks:
    • Update the soname for ABI-breaking versions (DONE IN COMMIT f3b4730).
    • Rework the code to use yarg instead of the getopt_long shim (DONE IN COMMIT 249b173).
      • Potentially handle OOMs in yarg within the CLI tool.
@Sewer56
Copy link
Contributor

Sewer56 commented Dec 15, 2024

It would be an improvement to perform the encoding and decoding in parallel with I/O operations to get more mileage out of parallelism on slow disks.

If you were to ask me, ideally, on supported platforms you would want to decode directly into memory mapped files. Both easiest and simplest.

Technically the data may not hit disk immediately, but it will be available to use immediately, and the user sees a (negligibly unless HDD or huge block) faster completion. That end marker would be useful here.

@kspalaiologos
Copy link
Owner Author

kspalaiologos commented Dec 15, 2024

@Sewer56 TL;DR: All of this that you mention is what xpar already does. It just lacks use/evaluation for me to tell if it's disruptive to people or not.

To give you some more context: I started working on bzip3 when I was in high school (2022). Being a schoolkid I had plenty of time. Now I have applied to a PhD program :-). The codebase has some rough edges and I fixed most of the problems that I have with the current shortcomings in xpar, but the problem is that I am simply wary of introducing changes that could hose the whole thing for someone. As the other project lacks review/evaluation especially from distro maintainers, I am not entirely sure if I could move over all the nice stuff from there to here.

So what is blocking us is not knowing if the changes will break someone's workflow or introduce unnecessary complexity that will be difficult to work around for maintainers or users. Writing the code is the easy part.

@Sewer56
Copy link
Contributor

Sewer56 commented Dec 15, 2024

I cannot be expected to know the details of your other projects 😅 , so I was commenting what came to mind without that knowledge.

Edit: Above response was edited with more context.

@Sewer56
Copy link
Contributor

Sewer56 commented Dec 15, 2024

Hey, you're doing pretty well :p

Although I like being involved in making the cutting edge, I never felt like academia was for me personally at least. I think I just don't like crunching through papers enough, haha.

I personally started by tinkering with games from a young age. First as an end user, and then I learned code via reversing to data mine. Also can get a bit competitive hence emphasis on optimization.

In any case, I've not previously packaged software for distros (I generally do libraries more); so I'm not very experienced on the subject matter.

In any case, is the concern interop/user confusion? i.e. Produce a file on one distro, read from another with outdated bzip. Or, as another example, a bash script which pipes result of bzip3 to another program; and the other program getting a bytestream it doesn't understand because it suddenly received a newer format.

I believe the Unix philosophy is 'do one thing and do it well' (and the other 2 relevant points). So if that's the worry, I think the approach here would be a separate CLI tool/command for the new format, e.g. bzip3v2 or bz3v2.

@kspalaiologos
Copy link
Owner Author

kspalaiologos commented Dec 15, 2024

In any case, I've not previously packaged software for distros (I generally do libraries more); so I'm not very experienced on the subject matter.

The problem is that every distro comes with its own set of hacks and limbs and hairs that make everything more difficult than it should be.

Easy example of memory mapping: some environments (all the Windows GNUs like Cygwin or MinGW or whatever) provide MapViewOfFile and mmap simultaneously. Some environments don't support AT&T assembly syntax (Intel Mac with Xcode). Some environments use different conventions for CPU feature support (you can't just use CPUID because the kernel could have disabled features that the CPU technically supports) - e.g. aarch64 is a very particular kind of wild west - _sysctlbyname, getauxval, etc - all disjoint. MacOS compilers/linkers will do funny stuff with export prefixing with the underscore _symbol, other platforms won't. With separate assembly units you need the .GNU-stack section to mark the stack as non-executable on ELF targets. Windows will translate newlines in stdin/stdout so we need a hack to allow pipe operation there. Windows has commit, Linux & co. have fsync - and sometimes there's both. Windows and Linux use different calling conventions (sysv abi and msabi) and the compiler support for enforcing conventions is bad because the C code running on Windows has to interface with the libc/winapi which uses msabi. This breaks assembly code in some circumstances and requires more unportable hacks. Querying the amount of CPUs sucks and every target has its own function (there's like 12 in total that I counted).

If you are tempted to not use assembly and think that intrinsics are enough, you are deeply mistaken. I made xpar about 4 times faster by rewriting its hot loops in assembly. Compiler optimisations are brittle, distros will insist on using ancient compilers on low optimisation settings (e.g. -O2 -march=generic), etc. - negating all of the work you put into making the damn thing fast.

image

Etc Etc...

If I once again have to figure out why the thing segfaults on FreeBSD 13.0 mipsel64 specifically without any feedback from the maintainers the amount of grey hairs on my head will quadruple and I don't want to inconvenience the users because there's packaging problems. All of these platform specific hacks add liability because I can't reasonably test them all. They introduce exponential blowup into the amount of possible software configurations. On top of that, I don't want to shift the liability of testing to the maintainers, because they never do, and if there's a breaking problem on their specific platform they might be tempted to inexpertly fix and patch it instead of asking for upstream help. Which is dangerous for a compression program.

@kspalaiologos
Copy link
Owner Author

If I was to release a new format version then bzip3 -d and the frame APIs automatically detect which decoder to dispatch to, while bzip3 -e encodes in new format and bzip3 -e --legacy-format encodes in the legacy format.

@Sewer56
Copy link
Contributor

Sewer56 commented Dec 15, 2024

Ah, I see. So you're trying to support all of the targets directly from upstream; from pure C (no higher level abstract library, etc), no abstractions, straight to the target.

Yeah, that's a lot of portability caveats; but it makes sense since you've gone all the way to standardize the format even. You have my condolences; I've been in similar situations. Sometimes having to fix something for macOS aarch64 for instance can be massively painful without the hardware; so I rely on CI to run that environment, and it's usually 3-4 minutes to get feedback on any change.

I also get the appeal of writing code without overhead. I wrote a Rust library the other week for opening handles and mmap(s) just to save around 3-8KB of code, as std's implementation internally used a builder pattern and therefore included unnecessary logic. (It frustrated me, I don't like overhead)

Intrinsics are hit and miss, they may introduce extra bloat, they may not, just depends on your luck. If you're working with C, I guess you have it extra bad, since there's many compilers, and some distros will ship ancient ones; that is a yikes. Better off with ASM at that point, even if it looks fine with your Clang.

The call convs I imagine are awkward. I'm not sure how far your options go as far as forcing a call conv in a compiler agnostic way go in C. Even if you can set it, I imagine you'd need an ugly hack like making function with forced no inline and a different convention just to perform a 'switch' for a small amount of 'context'.

On that note, I've never really understood the whole 'Shadow Space' thing on Microsoft x64 CallConv. It's a pain to work with, not only must the stack be aligned 16 (presumably for SSE), but now there must be at least 32-bytes before the return address, damn. Thing is, while I understand it's supposed to make debugging easier by force spilling the 4 regs onto the stack, at the same time, it's just unnecessary overhead for release builds. Now function calls require an additional stack allocation (sub rsp) in a world where register spills aren't that common.

On a fun note I wrote a small JIT that generates optimal stubs for converting between calling conventions as part of my 80% done WIP xplat xarch hooking library (on a bit of a long hiatus though, in favour of other more urgent projects). I don't have any background or study in compiler construction whatsoever, but that's one part that came out quite well.

This breaks assembly code in some circumstances and requires more unportable hacks

I actually ran across this just last week.
I was making rust bindings for 7-zip's LZMA Decoder which features some assembly files for decode speedup. Those assembly files have a bunch of defines specifically for handling ABIs all over (it's a small mess) and require you to install a nasm-like assembler (preprocesses the defines). On some setups, a dev might even need to self-compile a compatible assembler, pain.

At that point, thinking about how much it would be a hassle for bindings users (in this case equivalent to maintainers I suppose), I just chose to pre-compile every single variation myself and link against that. So I totally understand the pain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants