Skip to content
This repository has been archived by the owner on Mar 2, 2022. It is now read-only.

The second memory node not working when trying 1P-2M-1S with a GMM #10

Closed
fyc1007261 opened this issue Aug 29, 2019 · 10 comments
Closed

Comments

@fyc1007261
Copy link

Hi @lastweek ,

We have successfully deployed 1P-1M-1S on CloudLab and we are now trying to do some experiments on multiple processor/memory nodes. We tried with 5 nodes with #0 as processor; #1, #4 as memory; #2 as storage and #3 as global resource monitor. We have also correctly configured linux-modules/monitor/include/monitor_config.h to let the GMM know the IDs of the memory nodes. After rebooting processor and memory nodes, we tried make fit_install on storage and GMM, then make monitor_install on GMM node and make storage_install on storage. However, when we tried to run an application which required large memory, the #1 node (the default memory node configured on all machines) used up all its memory and panicked, while the #4 node seemed not to be working. Is there anything that we have left not configured or is there anything that we did wrong?

Thanks very much for your help!

@fyc1007261
Copy link
Author

By the way, when we tried multiple processor nodes with GPM configured as 'Y', LegoOS failed to compile. It seemed that some code is using the ibapi_reply_message function, which is defined in fit/ibapi.h. But the defination code only works when CONFIG_COMP_MEMORY is set (in line 88). Is there any way to solve this problem? Thanks so much!

@lastweek
Copy link
Contributor

@fyc1007261, I'm traveling this week, will get back on it next Wed/Thur. Sorry for the late. Meanwhile, @hythzz will you have time help out?

@fyc1007261
Copy link
Author

Hi @lastweek ,

We are still struggling with the multiple P/M problem. Could you please help check out where our configurations might be wrong or provide some instructions on this?

@lastweek
Copy link
Contributor

Hi @fyc1007261,

Sorry for the late. I was moving a lot recently. According to your first post, it seems at least all machines are connected. You mentioned "the #1 node (the default memory node configured on all machines) used up all its memory and panicked", did you see a OOM message? I might have a clue where the issue, but need to take a look at your .config files.

Could you share your P and M's .config files with me? Thank you.

@fyc1007261
Copy link
Author

Hi @lastweek ,

I have put all the config files and logs that I consider important on this link. We are running the programming test.cc which tries to allocate 20GB memory. As we have two memory nodes with 16GB memory each, it should have returned gracefully. Unfortunately, it failed. If there are any additional files that I should provide, please tell me.

Thanks a lot for your help!!

Config files and logs:
https://1drv.ms/u/s!ApeLgKxbjBKilr544vQG_9UQKxctfA?e=iCWf4I

@dothyt
Copy link
Member

dothyt commented Sep 16, 2019

Hi @fyc1007261 ,

I just checked your config files, looks like you didn't enable the CONFIG_DISTRIBUTED_VMA on both processor node and memory nodes, which is necessary for multiple Ms to run. Here are some sample configs needed for two Ms to work:

Processor node side:

CONFIG_DISTRIBUTED_VMA=y
CONFIG_DISTRIBUTED_VMA_PROCESSOR=y
CONFIG_VM_GRANULARITY_ORDER=30
CONFIG_MEM_NR_NODES=2

Memory node side:

CONFIG_DISTRIBUTED_VMA=y
CONFIG_DISTRIBUTED_VMA_MEMORY=y
CONFIG_VM_GRANULARITY_ORDER=30
CONFIG_MEM_NR_NODES=2
CONFIG_VMA_CACHE_AWARENESS=n

Please keep the CONFIG_VM_GRANULARITY_ORDER on all the nodes consistent. 30 means 2^30 which is the default settings.

CONFIG_VMA_CACHE_AWARENESS is an optional config which will make VMA allocation cache aware but increase virtual address fragmentation.

CONFIG_MEM_NR_NODES is used for each memory node awares the number of memory nodes exist in the cluster.

@fyc1007261
Copy link
Author

Hi @lastweek @hythzz ,

Thanks for your help! It works now!

@lastweek
Copy link
Contributor

Hi @fyc1007261, we are back on schedule and will update the repo more recently.

@dothyt
Copy link
Member

dothyt commented Sep 17, 2019

Hi @fyc1007261, If your issue has been solved, please close this thread.

@fyc1007261
Copy link
Author

Hi @hythzz,
Sorry I forgot it. I'll close it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants