Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeJS performance on alpine/arch running on v86 #1005

Closed
farhan-ct opened this issue Mar 19, 2024 · 28 comments
Closed

NodeJS performance on alpine/arch running on v86 #1005

farhan-ct opened this issue Mar 19, 2024 · 28 comments

Comments

@farhan-ct
Copy link

Hi,

I've been trying to run some nodejs programs and projects on v86. So far, it involved using npm/pnpm to install packages from online & offline (npm cache). Trying to run simple http and express server programs, etc.

Below are some of my experiences.

  1. Most of the stuff like installing packages or running node express programs takes ~2 minutes just to get started.
  2. Trying to install npm packages takes way too long even with a webproxy with removed throttle. But it's slow even when am trying to install packages from local npm cache.
  3. Even if the chunks of the os image are fetched from disk cache, the performance of nodejs programs is very slow.
    (ignore boot times, I am using a saved booted state to load)

I know all this sounds very vague and perhaps unrelated to v86 but I want to dig deeper, explore and perhaps fix/improve things if I can.

@copy @SuperMaxusa

I request you to kindly point me where I should start looking and studying to improve v86 performance, atleast for nodejs temporarily (though it's understandable that it's running on a web browser).

Thank you!

@copy
Copy link
Owner

copy commented Mar 19, 2024

Some ideas:

  • A minimal reproducing testcase would be nice. Ideally a small standalone JavaScript program.
  • Compare performance to qemu-tcg (without kvm). If it's the same speed as v86 then it's a not a v86 issue.
  • If you're running in the linux console, try an X session and run it in xterm instead
  • Try with chromium on localhost with COOP/COEP enabled (add_header Cross-Origin-Opener-Policy same-origin; add_header Cross-Origin-Embedder-Policy require-corp;)
  • Mount the directories used by npm to a tmpfs, to make sure it's not an issue with the 9pfs
  • Try a different guest OS, for example freebsd or haiku

If it's none of the above, it's probably cpu bound. Note that a 10x slowdown compared to native is expected. Profiling could help (but we really need that minimal testcase):

  • Use node's built-in profiler
  • Firefox profiler is best for wasm profiling (use with the release build: make all)
  • Build with make debug-with-profiler and run the program in debug.html. It has a bunch of profiling information at the bottom

@SuperMaxusa
Copy link
Contributor

SuperMaxusa commented Mar 20, 2024

running node express programs takes ~2 minutes just to get started

You probably need load node_modules into cache early or into tmpfs, like a suggested before.

About long start node.js themselves, I have some idea (tested this only on 9pfs) to caching node binary and libs manually, like this makes on cold boot: cat /usr/lib/libnode.* /usr/bin/node > /dev/null or via vmtouch, (if the problem in I/O bounds).

By the way, I found that node's binary compiled without --shared flag (for example, v20.11.1 from unofficial-builds.nodejs.org) much faster than with this flag (for example, like older version in Arch Linux (x32) - nodejs, commit: 26a5d0):

log
root@localhost:~# vmtouch -v /usr/bin/node
/usr/bin/node
[    ] 0/4

           Files: 1
     Directories: 0
  Resident Pages: 0/4  0/16K  0%
         Elapsed: 0.0654 seconds

root@localhost:~# vmtouch -v /usr/lib/libnode.so.111
/usr/lib/libnode.so.111
[                                                            ] 0/12781

           Files: 1
     Directories: 0
  Resident Pages: 0/12781  0/49M  0%
         Elapsed: 0.10016 seconds

root@localhost:~# time node /root/hello.js
Hello, world

real    1m53.508s
user    0m39.370s
sys     0m3.309s

root@localhost:~# time node /root/hello.js
Hello, world

real    0m30.253s
user    0m25.674s
sys     0m1.164s

root@localhost:~# vmtouch -v /node                  
/node
[                                                            ] 0/22540

           Files: 1
     Directories: 0
  Resident Pages: 0/22540  0/88M  0%
         Elapsed: 0.12426 seconds

root@localhost:~# time /node /root/hello.js
Hello, world

real    0m39.985s
user    0m6.341s
sys     0m2.483s

root@localhost:~# time /node /root/hello.js
Hello, world

real    0m5.739s
user    0m4.173s
sys     0m0.796s

root@localhost:~# /node -v
v20.11.1

root@localhost:~# node -v
v19.0.0

And I wrote a simple program that calculates Pi by Leibniz formula as benchmark:

pitest.js
var pi = 0, k = 1;

function picalc(n) {
    for (var i = 0; i < n; i++) {
        if (i % 2 == 0) {
            pi += (4 / k);
        } else {
            pi -= (4 / k);
        }
        
        k += 2;
    }
}

console.time("pi");
picalc(10000000);
console.timeEnd("pi");

Results:

root@localhost:~# node /pitest.js 
pi: 55.320s

root@localhost:~# node /pitest.js 
pi: 17.297s

root@localhost:~# /node /pitest.js 
pi: 15.025s

root@localhost:~# /node /pitest.js 
pi: 13.192s

And for interest, I tested on Void Linux (on an unpartitioned image), unlike Arch, it's builds with --shared flag (config from Void Linux), i. e. without separated libnode.so.

results
# cold start
[supermaxusa@void ~]$ time node hello.js
Hello, world


real    0m38.685s
user    0m15.197s
sys     0m10.290s

[supermaxusa@void ~]$ time node hello.js
Hello, world


real    0m24.024s
user    0m16.345s
sys     0m5.421s

[supermaxusa@void ~]$ node -v
v18.19.1

[supermaxusa@void ~]$ node pitest.js 
pi: 21.739s

[supermaxusa@void ~]$ node pitest.js 
pi: 20.642s

NPM also slow too, in my case, I installed express.js from global node_modules to project's node_modules (ext4 -> ext4), I guess it's I/O bottleneck.

package.json
{
  "name": "npm-test",
  "version": "1.0.0",
  "description": "",
  "main": "/dev/null",
  "scripts": {},
  "author": "",
  "license": "ISC",
  "dependencies": {
    "express": "*"
  }
}
[supermaxusa@void express-test]$ time npm install

up to date, audited 65 packages in 3m

12 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

real    4m19.227s
user    2m57.512s
sys     0m17.530s

...but with precached files, result is some better in different cases, for example, I tried to import express module, first - cold start, second - warm start

[supermaxusa@void ~]$ time NODE_PATH=/usr/lib/node_modules node -e "var e = require('express')"

real    1m4.244s
user    0m30.458s
sys     0m7.530s

[supermaxusa@void ~]$ time NODE_PATH=/usr/lib/node_modules node -e "var e = require('express')"

real    0m11.048s
user    0m7.352s
sys     0m1.954s

@farhan-ct
Copy link
Author

Thank you guys for the pointers, I'll go though each and every point, do some testing and update here on the results. Thanks agan!

@SuperMaxusa
Copy link
Contributor

@farhan-ct, for clarify, did you tried to test this on 9p root fs? Now I have thought only about I/O bounds (I think that CPU bounds don't matter yet).

  • If you're running in the linux console, try an X session and run it in xterm instead

You mean that needs run node with X session instead of running that in ttyS0 or tty1 sessions?

@copy
Copy link
Owner

copy commented Mar 20, 2024

You mean that needs run node with X session instead of running that in ttyS0 or tty1 sessions?

Yes, the Linux framebuffer console emulator is pretty slow (might be relevant if npm prints progress bars). Using ttyS0 could also work.

@farhan-ct
Copy link
Author

farhan-ct commented Mar 20, 2024

@SuperMaxusa

for clarify, did you tried to test this on 9p root fs?

I ran all these tests on alpine standard linux, both on my packer version and the docker version you shared in the discussions (both in v86). However, my packer version doesn't have 9pfs at all. So, the performance issues exist for both with and without 9p versions.

Also, I am having issues running the node v20.11.1 downloaded from the unofficial builds site. However, the nodejs version installed using apk add is the same version for me. Also, trying to load the node bin into memory using vmtouch also didn't help in my case.

I just saw that you edited your above answer, let me go through your results once again 😃

but with precached files, result is some better in different cases

What does 'precached files' here stand for? Is it the files that were loaded into memory when installing modules from global to project?

@farhan-ct
Copy link
Author

the Linux framebuffer console emulator is pretty slow (might be relevant if npm/npm#11283). Using ttyS0 could also work.

@copy I am using pnpm instead, not too much output but it gave me relatively better performance in most cases.

  • I am using serial0 (/dev/ttyS0), with 2 different setups in 2 different cases
  1. Using 'serial_container_xtermjs' option for v86Starter
  2. Adding a listener on emulator's bus for serial0-output-char event and writing them manually to an xterm (xterm js)

Both the cases have the same performance issues.

I am trying to prepare a well written document that includes testcases (minimal, simple imports, and a server) and benchmarks that covers the pointers that you and @SuperMaxusa provided.

@SuperMaxusa
Copy link
Contributor

Also, I am having issues running the node v20.11.1 downloaded from the unofficial builds site.

Maybe gcompat and some dependencies (ldd node) are missing?

Yes, the Linux framebuffer console emulator is pretty slow (might be relevant if npm prints progress bars). Using ttyS0 could also work.

Interesting, thanks for the info! My previous tests were run on ttyS0, about progress, on xterm it's runs with progress=false a bit faster than without it (warmed):

xterm

...but results (for some reason) on framebuffer were the opposite.

tty0

According to this results, I tried reduce output and logs from npm and this is helps.

xterm2

What does 'precached files' here stand for? Is it the files that were loaded into memory when installing modules from global to project?

This is file caching (or page cache) that reduce "repeated" access to I/O, see https://www.linuxatemyram.com/ and https://en.wikipedia.org/wiki/Page_cache.

@SuperMaxusa
Copy link
Contributor

3. Even if the chunks of the os image are fetched from disk cache, the performance of nodejs programs is very slow.
(ignore boot times, I am using a saved booted state to load)

Did you tried to use different chunk sizes for image? #710

@farhan-ct
Copy link
Author

farhan-ct commented Mar 21, 2024

Did you tried to use different chunk sizes for image?

I split my image into chunks each of fixed size 1MB with a filename format from one of the windows 95/98 demos in copy.sh.

While testing, I've been trying both the split version, as well as providing async: true and size which is like auto-chunking the image (by v86 itself, on the fly).

Maybe gcompat and some dependencies (ldd node) are missing?

Yes I had to install gcompat for the missing loader and the node binary atleast executed after that, but when I try any node commands, even node --help, I was constantly getting the below error Fatal javascript OOM in MemoryChunk allocation failed during deserialization.

@SuperMaxusa
Copy link
Contributor

Yes I had to install gcompat for the missing loader and the node binary atleast executed after that,

This node binary is for glibc, so it can be worked incorrectly with gcompat. Anyways, nodejs in Alpine compiles without --shared: https://git.alpinelinux.org/aports/tree/main/nodejs/APKBUILD#n200

@SuperMaxusa
Copy link
Contributor

If this is useful, node with mimalloc2 (gliderlabs/docker-alpine#161 (comment)) or jemalloc (https://stackoverflow.com/questions/53234410/how-to-use-node-js-with-jemalloc) is faster (in some cases) that without (tested on Alpine v3.18.6 and Node 18.19.1).

localhost:~# time node hello.js
Hello, world

real    0m 10.31s
user    0m 7.15s
sys     0m 2.23s

localhost:~# LD_PRELOAD=/usr/lib/libmimalloc.so.2 time node hello.js 
Hello, world

real    0m 7.81s
user    0m 6.34s
sys     0m 0.92s

localhost:~# LD_PRELOAD=/usr/lib/libjemalloc.so.2 time node hello.js
Hello, world

real    0m 7.43s
user    0m 6.06s
sys     0m 0.87s

localhost:~# time node pitest.js 
pi: 13.077s
real    0m 21.50s
user    0m 16.69s
sys     0m 2.51s

localhost:~# LD_PRELOAD=/usr/lib/libjemalloc.so.2 time node pitest.js 
pi: 9.576s
real    0m 16.20s
user    0m 14.06s
sys     0m 1.20s

localhost:~# LD_PRELOAD=/usr/lib/libmimalloc.so.2 time node pitest.js
pi: 9.961s
real    0m 16.19s
user    0m 14.28s
sys     0m 1.02s

@farhan-ct
Copy link
Author

Thank you @SuperMaxusa for the clear details and tests, will try it out :D

Repository owner deleted a comment from max25blu Apr 10, 2024
Repository owner deleted a comment from max25blu Apr 10, 2024
@copy
Copy link
Owner

copy commented Apr 24, 2024

I've pushed a fix for slow disk IO into the wip branch (dfacd61). Please test.

@farhan-ct
Copy link
Author

Thank you @copy

Will test now

@farhan-ct
Copy link
Author

I've pushed a fix for slow disk IO into the wip branch (dfacd61). Please test.

Hi @copy

Thanks for the update on the fix. I built on the wip branch and did some quick testing using the provided docker debian example and took some screenshots. I couldn't arrive at a fair conclusion after testing the changes.

@SuperMaxusa I request you to take a look at this and test/confirm if you are facing similar results.

image

All these 4 tests are performed under the same conditions

@SuperMaxusa
Copy link
Contributor

SuperMaxusa commented Apr 25, 2024

fix for slow disk IO into the wip branch (dfacd61)

Here is my results (browser: Chrome 124 on localhost, guest OS: Alpine Linux 3.19.1, fs: ext4 with relatime), after every test command I had sync and echo 3 > /proc/.../drop_caches:

Branch dd [write] (MB/s) dd [write] (sec) dd [write dsync] (MB/s) dd [write dsync] (sec) cat 128MB [read] dd [read] (MB/s) dd [read] (sec) time node hello.js [cold start, mostly read]
master 9a9d6f3 6.6 20 8.6 15 real 7.65s, user 0.14s, sys 6.17s 22.7 6 real 38.4s, user 18.66s, sys 9.02s
wip dfacd61 7.2 18 8.2 16 real 6.02s, user 0.14s, sys 5.29s 23.7 5.6 real 33.08s, user 20.91s, sys 9.20s
some graphs

bar1
full size: svg

bar2
full size: svg

test commands
# apk add coreutils
# hello.js is the same as in upper post

# 1. dd [write]
dd if=/dev/zero of=/testfile bs=1M count=128 status=progress

# 2. dd [write dsync]
dd if=/dev/zero of=/testfile bs=1M count=128 status=progress oflag=dsync

# 3. cat [read]
time cat /testfile > /dev/zero

# 4. dd [read]
dd if=/testfile of=/dev/zero bs=1M count=128 status=progress

# 5. start hello.js (cold start)
time node hello.js

I request you to take a look at this and test/confirm if you are facing similar results.

I changed version in Dockerfile to trixie and added apt install nodejs also (server: redbean 2.2.0 with gzip compression). First run takes about 1 minute.

Some ideas:

node debian

@copy
Copy link
Owner

copy commented Apr 26, 2024

To be clear, the fix mostly improves performance when files are actually downloaded from the server. So I wouldn't expect writes to be any faster, nor reads of any files that have been previously read (v86 caches downloaded files in most cases).

So the cold performance looks pretty solid now (in comparison to warm). I think much of the remaining cold start overhead can be explained by jit warmup.

The warm performance numbers you posted still look pretty bad, and much worse than my numbers. For example:

localhost:~# time node hello.js # cold (disk + jit warmup)
Hello, world!
real    0m 3.71s
user    0m 2.05s
sys     0m 0.97s
localhost:~# time node hello.js # warm
Hello, world!
real    0m 1.61s
user    0m 1.17s
sys     0m 0.34s
localhost:~# time node -v # cold (disk + jit warmup)
v20.12.1
real    0m 0.98s
user    0m 0.20s
sys     0m 0.36s
localhost:~# time node -v # warm
v20.12.1
real    0m 0.18s
user    0m 0.09s
sys     0m 0.07s

This is on Alpine 3.19.1 built with the docker script (I had to add modules=virtio_pci to the kernel command line).

@SuperMaxusa
Copy link
Contributor

Note: my previous test was tested on loaded from browser hard drive image, on 9pfs I not tested speed yet.

I have tested now on 9pfs rootfs with Alpine Linux 3.19.1 as you said (+ COOP/COEP headers).
Rechecks with others cache modes gives roughly the same results (test with cache=mmap below):

localhost:~# cat /proc/mounts 
< ... >
host9p / 9p rw,relatime,cache=5,access=client,msize=512000,trans=virtio 0 0
< ... >

localhost:~# time node hello.js # cold
Hello World!
real    0m 32.07s
user    0m 16.04s
sys     0m 7.63s

localhost:~# time node hello.js # warm
Hello World!
real    0m 30.33s
user    0m 15.17s
sys     0m 7.17s

localhost:~# time node -v
v20.12.1
real    0m 5.74s
user    0m 0.75s
sys     0m 1.28s

(v86 caches downloaded files in most cases)

It's a caching by browser or filestrorage's cache? (sorry if incorrect question)

/**
* Add a read-only file to the filestorage.
* @param {string} sha256sum
* @param {!Uint8Array} data
* @return {!Promise}
*/
FileStorageInterface.prototype.cache = function(sha256sum, data) {};

@copy
Copy link
Owner

copy commented Apr 27, 2024

It's a caching by browser or filestrorage's cache? (sorry if incorrect question)

Hmm, 9pfs data is not cached currently, my previous statement was incorrect. However, files are only re-downloaded if the 9pfs closes the file and it hasn't been written to.

Let's try to narrow down why node runs ~20x slower for you:

  • I've pushed my full alpine build process and examples/alpine.html to the wip branch (there's probably not much difference to yours)
  • I get roughly the same performance on firefox (dev edition) and chromium
  • What's your cpu and ram size? (I'm on a fairly fast cpu with 32GB of ram)
  • Could you add $(CLOSURE_READABLE)\ after $(CLOSURE_FLAGS)\ in the libv86.js target in the makefile, rebuild build/libv86.js (you may need to delete the file) and profile a run of node with firefox profiler?

@SuperMaxusa
Copy link
Contributor

SuperMaxusa commented Apr 27, 2024

I've pushed my full alpine build process and examples/alpine.html to the wip branch (there's probably not much difference to yours)

Thanks, previously I had added profile in src/browser/main.js and uses for my tests. Almost is same, atleast I've also in kernel params "irqpoll" because on loading it's stucks after nmi selftest.

  • What's your cpu and ram size?

I had now only test on i3 with 6 GB of ram (only running browser and HTTP server) on Windows, and this can be a main problem. However. test with qemu-tcg (singlethreaded, qemu-system-i386 -accel tcg,thread=single) gives this results:

localhost:~$ time node hello.js # cold (no attention in our case now)
Hello World!
real    0m 6.97s
user    0m 3.14s
sys     0m 1.53s

localhost:~$ time node hello.js # warm
Hello World!
real    0m 2.48s
user    0m 1.51s
sys     0m 0.85s

localhost:~$ time node -v # warm
v20.12.1
real    0m 0.33s
user    0m 0.17s
sys     0m 0.13s
  • Could you add $(CLOSURE_READABLE)\ after $(CLOSURE_FLAGS)\ in the libv86.js target in the makefile, rebuild build/libv86.js (you may need to delete the file) and profile a run of node with firefox profiler?

Tried with wip branch (34ebae1) and new alpine container, for some reasons, I get around 11 seconds of warm start node.js on Firefox (release and Developer Edition) and around 18 seconds on Chrome (can't record properly with performance profiler):

localhost:~# time node hello.js 
Hello, world!
real    0m 18.18s
user    0m 13.20s
sys     0m 3.06s

localhost:~# time node -v
v20.12.1
real    0m 1.23s
user    0m 0.54s
sys     0m 0.20s

No have idea why this happened, because I use same conditions for tests (just a server with COOP and browser with clean profile without extensions).

@copy
Copy link
Owner

copy commented Apr 30, 2024

I had now only test on i3 with 6 GB of ram

I think that explains most of the performance problems. You could try settings MAX_PAGES to 1 or 2, which might improve performance on low-end systems:

static mut MAX_PAGES: u32 = 3;

v86 isn't as fast as qemu here, but there are three layers of jit warmups going on (js->x86, x86->wasm, wasm->hostcpu). I don't think there is any single optimisation that will be an "easy win" here. The following could net a few percent:

@farhan-ct I'd suggest playing around with the alpine image and the various suggestions. Disk performance should be fine now

@farhan-ct
Copy link
Author

Will do @copy

Thank you for sharing lots of info that I can learn :D

I've been using debian trixie until now and it's fine so far. Except, I am creating states after a nodejs cold run and saving them without clearing caches. State size is big but works nice. I will try this updated alpine too with the optimizations and suggestions provided.

Maybe out of context but here are a few observations:

  1. State file size is smaller when nodejs v86 is run instead of browser
  2. Network isn't working properly in debian when running v86 on node

@SuperMaxusa
Copy link
Contributor

SuperMaxusa commented Apr 30, 2024

  • adding some sse instructions used by node to the jit

Seems that v8 uses SSE2 (v8/v8@3fa6100), SSE3 (can be turn on/off on runtime with popcnt), FPU (not sure how often this is used, at least required for all CPU arches: https://forum.openwrt.org/t/building-node-js-for-ramips/183164, https://github.com/v8/v8/blob/main/test/unittests/base/cpu-unittest.cc#L64) and SSSE3 (#1038).

State size is big but works nice.

With zstd -19 compression too?

@farhan-ct
Copy link
Author

With zstd -19 compression too?

I did it with -22 (zstd --ultra -22 state.bin), and for some reason, it seems that the loaded cache is being lost after compression.

Example: I boot for the 1st time and run some node command, and create a state without clearing vm caches and stuff. The state will obviously be big and when I boot using this intial_state, node is fast. But if I compress using zstd and then boot with initial_state, node is slower again.

@SuperMaxusa
Copy link
Contributor

it seems that the loaded cache is being lost after compression.

Can you try to compare free -h before and after zstd?

2. Network isn't working properly in debian when running v86 on node

Node.JS 22.0 (or 21.x with node --experimental-websocket) has a WebSocket client, so you can add network_relay_url to V86 construstor and run with working networking on this version (tested for me on Buildroot).

@farhan-ct
Copy link
Author

Node.JS 22.0 (or 21.x with node --experimental-websocket)

This worked, thank you!

Will update about the free -h too once I check

@copy
Copy link
Owner

copy commented Sep 18, 2024

From the latest tests, v86 takes about one second to run a warm nodejs hello world on a low-end machine, which looks reasonable to me considering the constraints.

I'll gently close this issue, but feel free to create a new issue if you have another testcase where v86 is significantly slower than qemu-tcg.

Also, if you have any tuning suggestions than improve fs9p performance on Linux, please send a PR to update examples/alpine.html or docs/archlinux.md.

I did it with -22 (zstd --ultra -22 state.bin), and for some reason, it seems that the loaded cache is being lost after compression.

Just a note on this. zstd -22 uses significantly more memory than -19 (by design). That wouldn't usually be a problem, however wasm currently doesn't have have an api to give memory back to the OS (and as wasm memory is a single chunk, the browser doesn't know which parts are "unused"). So the amount of memory used by zstd decompression will be used until the page is refreshed. In my tests, I didn't find the extra memory usage by zstd -22 worth it for the compression it provides.

@copy copy closed this as completed Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants