Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

federated-learning-job run error: http://yolo-v5-aggregation.default:7363 connection failed #459

Open
victorming666 opened this issue Dec 26, 2024 · 16 comments

Comments

@victorming666
Copy link

I rebuilt the docker images for federated learning job, the pod run ok on both cloud node and edge nodes:
image

the pod on cloud node:
image

but the pod on edge node gives errors:
image
anybody can help? many tks!

@victorming666
Copy link
Author

This is an issue of DNS failure on k8s+kubedge+edgemesh+sedna cluster. The info of the cluster:

  1. kubernetes: 1.24.16
  2. kubeedge: 1.13.0
  3. edgemesh: 1.13.0
  4. sedna: 0.6.0
    Two cloud nodes and two edge nodes:
    image
    Why the edge nodes can't find the dns server of cloud node? I test with edgemesh's tcp-echo examples and it works!

@victorming666
Copy link
Author

Is this project dead? Why no replies for all these issues?

@victorming666
Copy link
Author

btw, the cluster is ok. as edgemesh's test case 'cloud-edge echo' is passed:
cloud call edge:
image
edge call cloud:
image

@victorming666
Copy link
Author

At last I runned OK this test, following is the logs of cloud node:
image
and here is the log of one of edge nodes:
image
many touch stuffs...

@MooreZheng
Copy link
Contributor

Is this project dead? Why no replies for all these issues?

Congratulations on another successful bug fixing. A complicated system deployment like OpenStack, K8S, KubeEdge, or Sedna is usually for real-world cloud services and is tackled by professional experts in large enterprises, indeed not an easy task for newcomers.

Nevertheless, we see that one might be confronted with urgent issues, but when participating in the KubeEdge Community, one should try to be understanding and show respect to others, following code of conduct. Experts usually have their important duties in the company and it is also infeasible to expect a 24-Hour on-Call reply from them, e.g., two hours in this case. For successful deployers, a submission of blogs or documents is encouraged and highly appreciated to help members use Sedna within this community.

@MooreZheng
Copy link
Contributor

MooreZheng commented Dec 27, 2024

BTW, what would be the opinion from @tangming1996 and @SherlockShemol : could there be any chance that this issue is related to the recent merge of #446 ?

@SherlockShemol
Copy link
Contributor

In my opinion it's not related to the recent PR.When I initially deploy sedna applications like joint inference and federated learning I seem to encounter the same dns problem which is caused by edgemesh.And I solve them by referencing a Q&A mannual on zhihu.Hope it helps.

@victorming666
Copy link
Author

@MooreZheng @SherlockShemol @tangming1996 Thank you for all have done regarding this project, as AI running on Edge devices has a booming perspective. I hope this project would keep evolving. But I find it's hard to use Sedna in my own project as there is little docs for app developers. If there is a toturial guide for app development, it would be much helpful.

@SherlockShemol
Copy link
Contributor

@MooreZheng @SherlockShemol @tangming1996 Thank you for all have done regarding this project, as AI running on Edge devices has a booming perspective. I hope this project would keep evolving. But I find it's hard to use Sedna in my own project as there is little docs for app developers. If there is a toturial guide for app development, it would be much helpful.

Can you provide more details about your project? Is it in the framwork of now sedna provides(joint inference, federated learning etc.).I am still a beginner in the Sedna project, so I may not be able to provide you with an answer immediately. However, during my learning process, I will pay attention to the parts you mentioned, and maybe someday I will improve it.

@tangming1996
Copy link
Contributor

BTW, what would be the opinion from @tangming1996 and @SherlockShemol : could there be any chance that this issue is related to the recent merge of #446 ?顺便说一句,来自和的意见是什么:这个问题有没有可能与最近的合并有关 #446

It shouldn't matter, as our new feature hasn't been released yet.

@victorming666
Copy link
Author

@SherlockShemol we're engaging some projects to deploy some ai models on edge devices(rk3568), but the network is not stable. And we don't want share some data to cloud. So we turn to kubeedge and sedna. But it seems rather difficult to use these frameworks as the demos are just some toy-like stuffs. That's why we want help from you all.

@tangming1996
Copy link
Contributor

It seems that the network of one edge node has a problem. You need to confirm whether the edgemesh-agent status is normal, and then confirm that all test cases can be run through. If there is still a problem, you can compare whether the configuration between the two edge nodes is consistent, because the network of one edge node is normal.

@victorming666
Copy link
Author

@tangming1996 yeah, only one edge node runs to completed and the other edge node hangs on to error. But no aggregated model output on the cloud node. I don't know what's wrong with this demo as the logs seems ok.
It has already taken us for almost 2 weeks to test these demos. Now we are trying to integrate Sedna into our aiot project. We are facing below issues:

  1. How to integrate Sedna into a AI restful web service framework like Flask or FastAPI?
  2. How to run this Sedna AI service which can be deployed on a K8s+kubeedge+edgemesh+sedna cluster?
  3. How to observe the running progress and show the output to the customer who can verify that federated-learning or cloud-edge co-inference or incremental-learning works?
    In short, are there any production-level use cases of Sedna?

@tangming1996
Copy link
Contributor

@victorming666 The aggregation of the cloud will be triggered only when the models of all edge nodes are successfully trained, because there are nodes in your environment that have problems and cannot upload the models to the cloud, resulting in the cloud being unable to complete the aggregation process.

@SherlockShemol
Copy link
Contributor

SherlockShemol commented Jan 9, 2025

@victorming666
video: https://www.bilibili.com/video/BV1hg4y1b78L
article: https://github.com/jaypume/article/blob/main/sedna/%E8%BE%B9%E4%BA%91%E5%8D%8F%E5%90%8CAI%E6%A1%86%E6%9E%B6Sedna%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90/README.MD

This is a public lecture by Mr. jaypume. Through it, you can get an overview of Sedna. If you want to create applications with Sedna, you can directly go to the Sedna Lib Source Code Analysis section. Hope it helps.

@victorming666
Copy link
Author

@SherlockShemol Thank you very much! We are scratching on the code of Sedna to figure how to integrate it into an aiot project. This quite help a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants