-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python/go grpc interface for async uploads #408
Conversation
2616767
to
a5f2ae8
Compare
go/pkg/project/project.go
Outdated
return nil, errors.IncompatibleRepositoryVersion(p.repository.RootURL()) | ||
} | ||
|
||
hostIP, err := localIP() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing behavior I think isn't it? I think this got bumped because we wanted to come up with some sensible behavior for localhost, IIRC. #203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem in the previous review still applies, I think. The main problem, off the top of my head, is that experiments run on your laptop will have different hosts when you have different local IPs, which is very odd behavior. The "HOST" column will suddenly appear when you connect to a different network or get a new DHCP lease.
Maybe we should make RFC 1918 addresses the blank string until we come up with a better solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function always will return local addresses, so maybe the solution is to just blank it out for now. It's a shame since it's useful information when you're running on multiple hosts, but I can't think of a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly looks good! I realize there is a lot of unfinished stuff in here so I won't review in any detail.
A few high-level thoughts:
- Have you thought about how the user interface might work? Maybe there is some of it in here but I can't see it obviously. I can think of things like displaying errors, what is printed when the training process is finished and things are still uploading, etc
- This is also currently unresolved, but have you put any more thought into partial writes? This might become more of an issue if things are done in the background. Checking out a partially written checkpoint might be particularly destructive (you could lose your current work, and the checked out stuff is corrupted!)
- We need to make sure there's a bit of developer documentation, otherwise it is going to very hard for people to add, e.g., a new bit of metadata to an experiment.
a5f2ae8
to
d496848
Compare
d496848
to
ff4c8df
Compare
ff4c8df
to
e99b937
Compare
e99b937
to
9980599
Compare
9980599
to
2a6ce21
Compare
2a6ce21
to
85c0894
Compare
85c0894
to
fc446b3
Compare
fc446b3
to
d8119f0
Compare
d8119f0
to
de02470
Compare
de02470
to
7db690a
Compare
7db690a
to
4ec155b
Compare
Signed-off-by: Andreas Jansson <[email protected]>
b1b09ea
to
7802941
Compare
What I as a user want from the logs is to know that something successfully finished, not that it started. I want to trust that Replicate has uploaded my data and that the checkpoint is consistent. I agree that it reads a little strange how messages show up out of order, but since uploads happen asynchronously I'm expecting that. In a sense it's nice, because it tells me that Replicate isn't blocking my training loop. Putting the step number in the Replicate log message would make that abundantly clear. |
Signed-off-by: Andreas Jansson <[email protected]>
Signed-off-by: Andreas Jansson <[email protected]>
On the other hand, you could argue that by Replicate not printing anything when you create a checkpoint, it looks broken because it doesn't print anything. The broader point is that this is a change in behavior, and I don't think we should change behavior. The old behavior didn't print a message on success either. |
There is a change in the logic though, in that checkpoints are now uploaded in the background. I think that should be reflected in the log output. |
TODO, discussed on zoom: Copy to temp directory before uploading (and block) |
Signed-off-by: Andreas Jansson <[email protected]>
Signed-off-by: Andreas Jansson <[email protected]>
I have created a bunch of issues for things mentioned in this PR. They are mentioned in the reference messages above. |
for future reference, this was fixed in https://github.com/replicate/replicate/pull/464 |
Removes duplicate logic in Python, reusing the Go implementation via a grpc API.
Apologies for this massive PR, I couldn't see a way of splitting it up since it touches everything.
Closes #317
Closes #344