Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python/go grpc interface for async uploads #408

Merged
merged 22 commits into from
Jan 6, 2021

Conversation

andreasjansson
Copy link
Member

@andreasjansson andreasjansson commented Dec 15, 2020

Removes duplicate logic in Python, reusing the Go implementation via a grpc API.

Apologies for this massive PR, I couldn't see a way of splitting it up since it touches everything.

Closes #317
Closes #344

return nil, errors.IncompatibleRepositoryVersion(p.repository.RootURL())
}

hostIP, err := localIP()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing behavior I think isn't it? I think this got bumped because we wanted to come up with some sensible behavior for localhost, IIRC. #203

Copy link
Member

@bfirsh bfirsh Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem in the previous review still applies, I think. The main problem, off the top of my head, is that experiments run on your laptop will have different hosts when you have different local IPs, which is very odd behavior. The "HOST" column will suddenly appear when you connect to a different network or get a new DHCP lease.

Maybe we should make RFC 1918 addresses the blank string until we come up with a better solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function always will return local addresses, so maybe the solution is to just blank it out for now. It's a shame since it's useful information when you're running on multiple hosts, but I can't think of a good solution.

Copy link
Member

@bfirsh bfirsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good! I realize there is a lot of unfinished stuff in here so I won't review in any detail.

A few high-level thoughts:

  • Have you thought about how the user interface might work? Maybe there is some of it in here but I can't see it obviously. I can think of things like displaying errors, what is printed when the training process is finished and things are still uploading, etc
  • This is also currently unresolved, but have you put any more thought into partial writes? This might become more of an issue if things are done in the background. Checking out a partially written checkpoint might be particularly destructive (you could lose your current work, and the checked out stuff is corrupted!)
  • We need to make sure there's a bit of developer documentation, otherwise it is going to very hard for people to add, e.g., a new bit of metadata to an experiment.

This was referenced Dec 30, 2020
@bfirsh bfirsh added this to the v0.3.0 milestone Dec 30, 2020
@andreasjansson
Copy link
Member Author

Since the data hasn't yet been uploaded before saving the checkpoint, it doesn't have much use to the user. But I see your point about it being weird the way it looks now.

What I as a user want from the logs is to know that something successfully finished, not that it started. I want to trust that Replicate has uploaded my data and that the checkpoint is consistent.

I agree that it reads a little strange how messages show up out of order, but since uploads happen asynchronously I'm expecting that. In a sense it's nice, because it tells me that Replicate isn't blocking my training loop. Putting the step number in the Replicate log message would make that abundantly clear.

Signed-off-by: Andreas Jansson <[email protected]>
Signed-off-by: Andreas Jansson <[email protected]>
@bfirsh
Copy link
Member

bfirsh commented Jan 6, 2021

What I as a user want from the logs is to know that something successfully finished, not that it started. I want to trust that Replicate has uploaded my data and that the checkpoint is consistent.

On the other hand, you could argue that by Replicate not printing anything when you create a checkpoint, it looks broken because it doesn't print anything.

The broader point is that this is a change in behavior, and I don't think we should change behavior. The old behavior didn't print a message on success either.

@andreasjansson
Copy link
Member Author

The broader point is that this is a change in behavior, and I don't think we should change behavior. The old behavior didn't print a message on success either.

There is a change in the logic though, in that checkpoints are now uploaded in the background. I think that should be reflected in the log output.

@bfirsh
Copy link
Member

bfirsh commented Jan 6, 2021

TODO, discussed on zoom: Copy to temp directory before uploading (and block)

Signed-off-by: Andreas Jansson <[email protected]>
@bfirsh
Copy link
Member

bfirsh commented Jan 11, 2021

I have created a bunch of issues for things mentioned in this PR. They are mentioned in the reference messages above.

@bfirsh
Copy link
Member

bfirsh commented Jan 11, 2021

TODO, discussed on zoom: Copy to temp directory before uploading (and block)

for future reference, this was fixed in https://github.com/replicate/replicate/pull/464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants