fix Android support #3396

jtracey · 2022-04-13T17:34:47Z

This PR is to fix and improve Android support. There are some final points that still need discussion that I'll add in another comment, but it's basically done.

I've broken the PR into four commits, each of which can, if need be, be broken out into their own PRs, but make the most sense together.

Commit 1 fixes a potential null pointer deref from an assumption that the getpw* family of syscalls always returns valid C strings or errors, when null pointer strings are also a possibility. So far, Android is the only platform I know of that does this.
Commit 2 is the result of going through and finding everywhere Android diverges from Linux, figuring out why, and patching to accommodate those differences or lack thereof. This includes fixing everything preventing Android builds (modulo Commit 3), and enabling/modifying tests to take Android into account. There are a few tests that don't pass yet because some design decisions need to be made, which is the main point that needs discussion.
Commit 3 performs some fixes that are specific to libc/nix with Android on 32 bit arches, which have some weird edge cases not found on other platforms (it's possible some are upstream bugs, but for now they're easy to work around).
Commit 4 adds Android to the CICD to make sure it stays supported into the future. It won't be able to check that Android maintains parity with Linux, so e.g. someone may add a Linux-specific performance enhancement they forgot to also enable on Android, but it will make sure that builds continue to work and all the enabled tests pass.

In case anyone reading isn't familiar: Android is built on the Linux kernel, but is set up very differently from most Unix environments. Perhaps most notably, each application is installed as and runs under its own Unix user, with very few permissions. Access control is supposed to be done through Android's permissions framework, not the classic file mode bits or capabilities. That said, there is a shell environment, with access to a Unix-style interface, but a lot of things one might expect to work, don't. E.g., hard links are disabled for everyone but root (and even then, won't work in shared directories), there is no privilege escalation mechanism like sudo/su by default, parts of the standard Unix directory layout aren't accessible (e.g., /proc) or don't exist at all (e.g., /tmp).

The last commit is somewhat hacky. There isn't a lot of precedence for CI of terminal-based applications on an Android emulator. The biggest problem is the limitations of the default shell, which would be difficult to set up an environment to run the tests in. Instead, we install the Termux app, which provides something more closely approaching a typical Linux environment (including the GNU coreutils). One of the most useful things Termux provides is a package manager, which allows us to install dependencies, as well as rust/cargo itself, so we can test building on the emulated device (rustup doesn't seem to work). However, there is no mechanism for using the Termux shell directly, so we use the debug bridge to enter keyboard strokes, then query for files to tell when the commands terminate, what their exit codes were, etc.. One adjustment to this flow we could make would be adding a step setting up sshd in Termux, which would allow us to run commands in a more typical manner for a remote host. The reason I haven't done so is we would still need the key strokes + query style of interaction to set up sshd, and keeping to one method of interaction means less code to debug (preventing problems if future changes to Termux, the emulator, or the runner require reconfiguring ssh or something). However, if it is preferred, I can rewrite that commit to use that instead, and the key strokes + query stuff would be contained to the cache setup step. Which, on that note, the CI step is set up to cache the emulator image after installing Termux, Rust, and dependencies. This means we don't have to wait for that every time, we reduce downloads from the package repos, and we reduce noise from changing installed package versions. Any time we want to update the cache, we just need to change the cache key string, e.g. by using a different Termux version, or just adding something like "-v2" to the end.

The action we use to run the Android emulator is popular, with a high install rate (69th most popular action of 12,708 as of this writing) and a good list of notable projects relying on it (including projects from Google and the Android team itself). I have, however, found it to be a bit temperamental in getting it properly configured. Android versions that work on my local machine don't work with it, I couldn't get either ARM architecture working on it, and the shutdown step seems to hang most runs when performing the caching step. However, the configuration in the commit seems to work consistently (with the hang addressed by an unfortunate but effective pkill -9). I suspect the configuration won't need changing often, and the project is still actively maintained, so hopefully some of these issues are worked out by the time it does.

I tried to make sure to leave code comments on anything potentially confusing, but feel free to request more anywhere, or to ask questions here why something was necessary. Stay tuned for a comment on the minimum that still needs to be addressed before merging...

jtracey · 2022-04-13T17:35:56Z

Misc:

The Android CI job currently depends on fewer jobs than it should for consistency with the other platforms. I'll fix this once we're closer to merging, right now it's just convenient for the faster turnaround time.

Failing CI tests:

First, there's the quick and dirty blanket solution to discuss: just disable all of these on Android, at least for now. Simply not worrying about the fact that something could be tested on Android, but isn't, until someone actually encounters a bug as a result seems like a possibly acceptable solution, given how broken things are now anyway. The rest of this discussion assumes that's not the tack we're taking; I've listed "don't test on Android" as an option for the failing tests where I think that could make sense on an individual basis.

Second, there's the matter of root on Android. Some of the tests below could pass by only running them as the root user. We should make a decision on what our policy is on this -- namely, do we support rooted Android? Root shells are somewhat rare on Android, used almost exclusively by enthusiasts who want to tweak their ROMs in various ways. On the one hand, it seems excessive to cater support to such a small minority of a use case, but on the other, such enthusiasts are perhaps the most likely to need a well-tested set of coreutils on Android. Were we to decide to run these tests when root, we would also need to decide whether that should be accommodated in the CICD. I haven't yet looked at what it takes to get a root Termux shell in the emulator, but I suspect it would at least be possible.

Upstream bug in xattr: test_cp::test_cp_archive, test_cp::test_cp_archive_recursive: The xattr library intentionally fails at runtime for any platform not explicitly supported, and Android is not on the list of supported platforms. This is a bit of a tough call; the library hasn't been touched since 2018, and the only link in the README is broken, so I suspect it's just not maintained. Our options are:
- try to patch the upstream library: add Android as supported platform Stebalien/xattr#23
- remove the dependency, and implement xattr support ourselves
- don't support xattrs on Android
- don't test on Android for now, file a new PR to enable the tests if/when a new xattr release is available
Seccomp restrictions: test_date::test_date_set_permissions_error, test_chroot::test_enter_chroot_fails: These rely on syscalls blocked by Android's seccomp filter, triggering an unhandled SIGSYS. Our options are:
- handle SIGSYS signals and do something with them (most likely, print an error)
- remove the seccomp filter in the CICD emulator (would still fail on any device with the filter enabled, which is basically all of them)
- don't test on Android for now, leave the first option for a later PR
- don't test on Android ever, the SIGSYS crash is enough for a user to figure out the problem
Hard links: test_du::test_du_hard_link, test_link::test_link_existing_file, test_cp::test_cp_arg_link, test_cp::test_cp_link_backup: Android only supports creating hard links by the root user, and on non-shared directories (e.g., not /sdcard). Our options are:
- require that the test run as root on Android (and in a non-shared directory)
- don't test on Android
test_cp_overriding_arguments: Similar to "Hard links", but the test also tests options aside from ones involving links. Our options are:
- make --symbolic-link --link argument in this test conditional on not Android
- require that the test run as root on Android (and in a non-shared directory)
- don't test at all on Android
test_ln::test_backup_force: The test currently uses a hard link (see "Hard links" above), but I think if a symlink was used instead, it would still test what was intended -- the fact it's using a symlink then a hard link just made me cautious enough to double check first. Our options are:
- make the test use symlinks for both commands
- require that the test run as root on Android (and in a non-shared directory)
- don't test on Android
test_touch::test_touch_no_dereference: Comes from a bug in the filetime crate. I've filed a PR, which has already been merged. Our options are:
- wait for upstream to publish a new release, then bump in Cargo.toml
- don't test on Android for now, and file another PR once upstream is ready
Use of /dev/stdout: test_touch::test_touch_changes_time_of_file_in_stdout, test_ls::test_ls_devices: /dev/stdout doesn't exist or isn't accessible on some versions of Android (I think older versions, the API 28 emulator doesn't allow it but my API 30 device does). Our options are:
- track down the code accessing /dev/stdout and make it use /proc/self/fd/1 on Android (maybe Linux too)
- only support Android versions that have /dev/stdout (would mean figuring out why the more recent Android emulators aren't working in the Github environment and fixing it)
- check at runtime for /dev/stdout before testing
- don't test on Android
Cleared env in tests: test_install::test_install_and_strip, test_install::test_install_and_strip_with_program: install -s invokes an external call to "strip". Part of the process of setting up the test environment is to clear all environment variables. This means that $PATH becomes unset. I'm not sure how it gets reset before the tests are run, but whatever is resetting it seems to be giving the $PATH of the Android root shell, since strace shows a bunch of failed execve searches for "strip" in paths we don't have access to (e.g., /apex/com.android.runtime/bin/strip). I'd have to investigate further, but I suspect something similar is causing the BSD failure mentioned in the "FixME"s. Some possible solutions are:
- preserve the $PATH environment variable in the test environment
- preserve the $LD_PRELOAD environment variable in the test environment (Termux populates it with a library that fixes these sorts of things)
- either of the above on Android specifically

Failing non-CI tests:

These are tests that pass on the CI environment, but fail on at least one emulator or device. They're included here mostly for completion's sake, I don't think it should be necessary to resolve them before merging, and can be broken out into their own issues. If you'd like me to change anything to fix them now though, feel free to say so.

Empty but non-0 sized files: test_du::test_du_h_flag_empty_file, test_ls::test_ls_allocation_size, test_ls::test_ls_long_total_size: On one emulator image, empty files take 4K according to du (ours and GNU's). Something similar happens with the two ls tests.
test_du::test_du_time: Same as above, but could easily be fixed by checking for a contained string instead of a matching string.
test_du::test_du_threshold: Similar to the above, the problem here is that on these particular emulators, deeper_words.txt takes 8k, and the inode for the dir also takes 8K, bumping the total above the 16K, above 10K threshold. If deeper_words.txt were instead an empty file, we could bump the threshold to e.g. 14K; or we could increase the size of one of the files in links/ and bump the threshold to something else between 16K and whatever it is.
CPU counts: No failing tests right now, but our nproc is reporting the wrong number of CPUs on my physical device. This is because of an upstream bug in the num_cpus crate. See How to know the number of big and little cores in ARM big.LITTLE architecture? seanmonstar/num_cpus#115 and Should hyperthreads be counted when disabled? seanmonstar/num_cpus#118 for more, but the tl;dr is that most arm devices these days use big.LITTLE architectures, where there are two sets of cores (one for high power, one for low power), only one of which is active at any time. The library doesn't support querying for the actual number of usable CPUs (i.e., discounting disabled cores), so reports the wrong number.

sylvestre · 2022-04-13T19:56:58Z

wahou, terrific, bravo :)

sylvestre · 2022-04-14T20:00:52Z

I think you can ignore "root test" :)

jtracey · 2022-04-14T20:51:53Z

@sylvestre Noted, thanks. :) Could you also confirm that the test_ln::test_backup_force test is fine using symlinks for both commands? Here's the original commit.

sylvestre · 2022-04-14T21:10:28Z

Probably? :)

jtracey · 2022-04-15T00:20:21Z

I've filed a PR to fix the xattr problem upstream: Stebalien/xattr#23

jtracey · 2022-04-16T07:52:00Z

It's alive! \o/

I went ahead and picked what I thought were reasonable options for the remaining blockers (and also fixed one more problem uncovered: on x86, only 32 bit ELF files can be read by strip by default; I swapped out the relevant test file, since x86-64 can do both). Feel free to request changes to any of those decisions, or anything else.

sylvestre · 2022-04-16T13:26:12Z

I will have look asap.
Would it be possible to add code coverage in the ci too? Thanks

jtracey · 2022-04-16T17:50:02Z

Unfortunately, I don't think code coverage is possible yet. As I mentioned before, rustup doesn't seem to work on Android right now, so we're using what's available in the package manager. It's kept up to date, e.g. it has rust 1.60.0, but that means we can't use nightly features/flags. Rust 1.60.0 did add support for code coverage profiling, so I was hoping we could at least get something, but it didn't work:

$ RUSTFLAGS="-C instrument-coverage" cargo build
   Compiling libc v0.2.121
   Compiling cfg-if v1.0.0
error[E0463]: can't find crate for `profiler_builtins`
  |
  = note: the compiler may have been built without the profiler runtime

For more information about this error, try `rustc --explain E0463`.
error: could not compile `cfg-if` due to previous error
warning: build failed, waiting for other jobs to finish...
error: build failed

It sounds like this is somewhat expected, based on the documentation on the profiler. So we'll need to wait for (or fix ourselves) either rustup support, or a rustc with profiling enabled in the package manager.

sylvestre · 2022-04-20T06:44:31Z

Do you want to land this now and iterate from it? thanks

The code for creating a Passwd from the fields of the raw syscall result assumed that the syscall would return valid C strings in all non-error cases. This is not true, and at least one platform (Android) will populate the fields with null pointers where they are not supported. To fix this and prevent the error from happening again, this commit changes `cstr2string(ptr)` to check for a null pointer, and return an `Option<String>`, with `None` being the null pointer case. While arguably it should be the caller's job to check for a null pointer before calling (since the safety precondition is that the pointer is to a valid C string), relying on the type checker to force remembering this edge case is safer in the long run.

jtracey · 2022-04-20T17:22:21Z

If it looks good to you, then yeah I think that's the way to go, so we don't have to keep looking out for new breakage.

I think I almost have code coverage working, I'll just file a new PR for that in the near future.

jtracey force-pushed the android2 branch from 6bed875 to 8fd067d Compare April 13, 2022 17:57

jtracey force-pushed the android2 branch from 8fd067d to 5270a4d Compare April 14, 2022 23:18

jtracey mentioned this pull request Apr 14, 2022

Bump to 0.2.16 alexcrichton/filetime#78

Merged

jtracey force-pushed the android2 branch 4 times, most recently from a700910 to 2fdf388 Compare April 16, 2022 07:10

jtracey marked this pull request as ready for review April 16, 2022 07:46

jtracey mentioned this pull request Apr 16, 2022

rust: enable profiler termux/termux-packages#10051

Merged

jtracey added 5 commits April 20, 2022 08:44

get android builds to compile and pass tests

2a0d58d

address libc weirdness on 32 bit android

1f025c1

add Android to CICD

4caeb2f

preserve LD_PRELOAD in the test env

67a01a5

sylvestre force-pushed the android2 branch from 2fdf388 to 67a01a5 Compare April 20, 2022 06:44

sylvestre merged commit 48a5212 into uutils:main Apr 22, 2022

This was referenced Apr 27, 2022

instrument-coverage doesn't quite work #3452

Open

cp: --archive support on Android #3477

Open

jtracey mentioned this pull request Sep 26, 2022

Android CI: regression on test_chroot::test_preference_of_userspec #3936

Closed

jtracey deleted the android2 branch October 10, 2024 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Android support #3396

fix Android support #3396

jtracey commented Apr 13, 2022

jtracey commented Apr 13, 2022 •

edited

Loading

sylvestre commented Apr 13, 2022

sylvestre commented Apr 14, 2022

jtracey commented Apr 14, 2022

sylvestre commented Apr 14, 2022 via email •

edited

Loading

jtracey commented Apr 15, 2022 •

edited

Loading

jtracey commented Apr 16, 2022

sylvestre commented Apr 16, 2022

jtracey commented Apr 16, 2022

sylvestre commented Apr 20, 2022

jtracey commented Apr 20, 2022

fix Android support #3396

fix Android support #3396

Conversation

jtracey commented Apr 13, 2022

jtracey commented Apr 13, 2022 • edited Loading

Misc:

Failing CI tests:

Failing non-CI tests:

sylvestre commented Apr 13, 2022

sylvestre commented Apr 14, 2022

jtracey commented Apr 14, 2022

sylvestre commented Apr 14, 2022 via email • edited Loading

jtracey commented Apr 15, 2022 • edited Loading

jtracey commented Apr 16, 2022

sylvestre commented Apr 16, 2022

jtracey commented Apr 16, 2022

sylvestre commented Apr 20, 2022

jtracey commented Apr 20, 2022

jtracey commented Apr 13, 2022 •

edited

Loading

sylvestre commented Apr 14, 2022 via email •

edited

Loading

jtracey commented Apr 15, 2022 •

edited

Loading