Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes required to run SIMX on HPCAC #71

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ztiffany
Copy link

Signed-off-by: Zach Tiffany [email protected]

This is a dirty set of changes that were made to set up MKT for run on HPCAI. Do not merge.

vim \
iperf \
crash \
zstd \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the development convenience.

Maybe it'd be cool to allow user the ability to provide his own docker file that will incrementally append needed things to the already existing image,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was always in back of my mind, but didn't investigate how to do it without rebuilding all images.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just apply something on top of the existing image?
Like running one additional docker file over the existing image extending it and making the new one the “current “?

RUN /root/basic-setup.sh && /root/kvm-setup.sh
RUN /root/basic-setup.sh

RUN /root/kvm-setup.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ignored

@@ -42,6 +42,12 @@ def make_simx(args):

subprocess.call(cmd + ['-j%d' %(args.num_jobs)])

def make_rdmo_app(args):
Copy link
Author

@ztiffany ztiffany Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started throwing in some stuff to make MKT build my rdmo app. I abandoned that, though. Ignore references to rdmo-app and the packages added to support.Dockerfile.

I added packages to the VM image to build rdmo-app inside my VM instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building inside VM looks simpler, but it misses the MKT concept. We wanted to separate build environment from run environment. It allows us to enjoy from specific optimizations and makes run fast.

# git_url: http://l-gerrit.mtl.labs.mlnx:8080/simx
# git_commit: 41f602dc05b3c115b176ac3f7869e8bd390cbd92
# git_url: /global/home/users/ztiffany/test/simx
# git_commit: 3f3c2c9338f3bbb73cf3bd298152e020e394086f

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ignored

@@ -18,7 +18,7 @@ From simx.git
%build
./mlnx_infra/config.status.mlnx --target=x86 --prefix=/opt/simx
make %{?_smp_mflags}
make %{?_smp_mflags} -C mellanox/
make %{?_smp_mflags} -C mellanox/ SIMX_PROJECT=mlx5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells SimX to only build the NIC part. I think it makes sense unless the switch part is planned to be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense for now. A long time ago, I pitched this project to switch team, they even tried it, but decided to stick with VMs because of differences in technical level expertise between development team and verification team.

# git_url: git://repo.or.cz/smatch.git
# git_commit: 9bb66fa2d7c73b3338a27fd6b38d7d509b2a1c1b
# git_url: /global/home/users/artemp/scratch/.cache/mellanox/mkt/smatch.git
# git_commit: 72c21a144a812cadbe349801da1b24bc331af256
Copy link

@artpol84 artpol84 Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, the site where we were building this can't access the original URL.
This is specific to that site and shouldn't be considered, probably.
Especially given that "mkt images" is not a must.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC if you preload the normal cache directory it doesn't require network access so long as the commit_id is already present. So these weird disconnected cases are solved by transfering the cache directory from some network connected machine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is normal cache directory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➜ kernel git:(master) ls ~/.cache/mellanox/mkt
iproute2-next.git rdma-core.git simx.git smatch.git sparse.git tc-build.git

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s the issue, it fails to download there from got saying connection refused.
Again, I don’t think we should consider it in MKT. this is obviously the issue on that site.

we just haven’t cleaned the version we ended up with for the sake of time consumption.

@@ -1,7 +1,7 @@
#!/bin/bash
# ---
# git_url: git://git.kernel.org/pub/scm/devel/sparse/sparse.git
# git_commit: 8af2432923486c753ab52cae70b94ee684121080
# git_url: /global/home/users/artemp/scratch/.cache/mellanox/mkt/sparse.git

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -79,9 +85,13 @@ def setup_from_pickle(args, pickle_params):
subprocess.check_output(['make', 'headers_install',
'INSTALL_HDR_PATH=/usr'], cwd=args.kernel)

if not os.path.isdir('/images/ztiffany/ccache'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely not needed based on my later experience

@@ -64,11 +64,17 @@ def remove_mounts():


def is_passable_mount(v):
print ("Checking mount: {}".format(v))
if v[2] == "nfs" or v[2] == "nfs4":
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qemu-system-x86_64: -fw_cfg etc/sercon-port,string=2: warning: externally provided fw_cfg item names should be prefixed with "opt/"
qemu-system-x86_64: -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0: cannot initialize fsdev 'host_bind_fs0': failed to open '<snip>': Permission denied

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Permission denied" - let's debug, it shouldn't

Copy link

@artpol84 artpol84 Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am root on the node, I cannot LS my users home directory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ls fails with Permission denied as well

if v[1].startswith("/images/"):
print ("YES!!!")
return True
if v[1].startswith("/plugins"):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPCAC nodes are diskless. Here is how plugins are mounted:

Evaluating: /plugins
v is: ['tmpfs', '/plugins', 'tmpfs', 'ro,relatime,mode=555', '0', '0']
Passing: /plugins

Here is from a working system:

['/dev/sda5', '/plugins', 'ext3', 'ro,relatime', '0', '0']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, docker can't mount tmpfs, need to think about workaround

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work if we add the above

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it should work as is

@@ -64,11 +64,17 @@ def remove_mounts():


def is_passable_mount(v):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On HPCAC, this was needed to get the rdma-core directory passed through:

mkt run --dir /images/ztiffany/src/rdma-core/

Is this expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it means your config file is incomplete or another bug, we mount whole src directory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be because /images is on tmpfs, but I don’t think we saw even an attempt to mount it

@ztiffany ztiffany changed the title Changes required to run SIMX on HPCAI Changes required to run SIMX on HPCAC Jun 15, 2021
@@ -97,3 +107,5 @@ def setup_from_pickle(args, pickle_params):
make_rdma_core(args)
if args.project == "simx":
make_simx(args)
if args.project == "rdmo-app":
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this.

@@ -67,6 +67,7 @@ def run_ci_cmd(self, supos):
"rdma": "iproute2",
"kernel": "kernel",
"mlnx_infra": "simx",
"rdmo-app": "rdmo-app",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore.

@@ -27,7 +27,7 @@ def get_cache_fn(fn):
an impact on the operation of mkt - at worst it will run slower."""
global cache_dir
if cache_dir is None:
cache_dir = os.path.expanduser("~/.cache/mellanox/mkt/")
cache_dir = '/images/ztiffany/.cache/mellanox/mkt/'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the home dir is insufficient to hold these caches,
Is there a way to point it somewhere else?

Copy link
Collaborator

@rleon rleon Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.cache is general mechanism, it is worth to make symlink

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@jgunthorpe
Copy link
Contributor

@ztiffany please fix your git config to use your nvidia address, thanks

@artpol84
Copy link

@jgunthorpe this is not intended for merge. It’s an FYI to indicate what we had to hack to make it work on the particular system.

we agreed with @rleon that we will open this one

@@ -69,4 +69,5 @@ cat <<EOF > /etc/sysctl.d/hugepages.conf
vm.nr_hugepages=2
EOF

rpm -U /opt/rpms/*.rpm
#rpm -U /opt/rpms/*.rpm
rpm -U --force /opt/rpms/*.rpm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c2f86ca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants