Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: w3compute protocol #110

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

vasco-santos
Copy link
Contributor

Adds compute/* protocol allowing an implementer to perform simple computations over data on behalf of an issuer.
This aims to enable clients to hire compute services to delegate some work, as well as for w3-up platform to hire third party compute services to verify client side offered computations if desirable.

Main goal with initial proposal is to compute PieceCidV2. We want to move from current state where we have a centralized PieceCid computation on Bucket Event, to something that is not highly coupled with both the data write trigger and the location. However, this can also be applied to other future computations like compute indexes for given content.

Of course, an implementer can have a custom resolution implementation for:

  • finding the data close to where it runs, such as prefer a Cloudflare location if running in CF Workers
  • finding the cheapest way to read the data (e.g. try Roundabout)

Note that the discovery process by actors looking for services providing given computations is for now out of scope of this spec. But, this is something I would like to see later on, as this could open the doors of our protocol to run in multiple places, such as Filecoin Station. Computations are the easiest path to try decentralizing the service to run anywhere.

@vasco-santos vasco-santos force-pushed the feat/w3-compute-protocol branch from 99a26e4 to b454b10 Compare February 1, 2024 13:43
@olizilla
Copy link
Contributor

olizilla commented Feb 5, 2024

Can you add more context on the cost of computing the PieceCID? The bias I have that this spec needs to add some words on to overcome is that I think it is very cheap to calculate the PieceCID if you have the CAR bytes.

I believe (unverified, bias alert) that it's more expensive to move the CAR than it is to calculate the PieceCID. I also have reservations about the overhead of creating and signing additional UCANs vs just calculating the piece cid locally.

This spec would be more compelling if it made the case about "it's very important to have a trusted PieceCID for each CAR early in the pipeline, as it's more expensive if we create a ~32GiB aggregate and then find that one of the PieceCIDs was wrong." Perhaps a client has to calculate it themselves and then invoke compute/piececid to get a second opinion and provide both signatures as evidence?

Can we calculate that the sum of the costs of repeated pieceCID calculations is less than re-building an aggregate if we find a bad one?

@vasco-santos
Copy link
Contributor Author

vasco-santos commented Feb 5, 2024

Can you add more context on the cost of computing the PieceCID? The bias I have that this spec needs to add some words on to overcome is that I think it is very cheap to calculate the PieceCID if you have the CAR bytes.

I believe (unverified, bias alert) that it's more expensive to move the CAR than it is to calculate the PieceCID. I also have reservations about the overhead of creating and signing additional UCANs vs just calculating the piece cid locally.

As you say, it is relatively cheap to calculate the PieceCID if you have the CAR bytes, and likely more expensive to move the CAR bytes. Calculating the Piece CID locally already happens and will continue to happen, being this the trigger to kick the pipeline. As w3-filecoin spec mentions, MAY compute the PieceCID for validation, or may not, that is an implementation detail that is not required.

This spec would be more compelling if it made the case about "it's very important to have a trusted PieceCID for each CAR early in the pipeline, as it's more expensive if we create a ~32GiB aggregate and then find that one of the PieceCIDs was wrong."

This is a protocol spec, more specifically on how to ask a third party service for computations. As a first provider capability, it can execute pieceCid computations on behalf of others. It was not written towards the direction of convincing any implementer to use it, instead of whatever they may do. Therefore, I would say this is out of consideration of a spec, but as an implementation documentation. What do you think?

For completeness of answer (which is actually also present in the implementation proposal document) previously shared, the intention today is to have clients to submit Piece computation and to have us Storefront to validate it. It already indirectly happens today, as the user submission of Piece is a NOP dettached from the flow of Bucket event to compute PieceCID. Moreover, today we decided that Storefront (w3up) MUST validate pieces, which may change in the future, but current product requirements together with using Spade and typical SPs flow, make it essential to have validation process. The main reasons are:

  • Product has a strong top level requirement for SLA for an ingested piece of content to land into a SP in less than 72H
  • Spade today has no SLA guarantees
  • We rely on aggregation per FIP 0069. In short this means that if we get into the Pipeline a wrong pieceCid for a given contentCid, all the 32Gb aggregate will be invalid. All the "good" pieces get delayed (may trigger the SLA), and need to go back into the aggregation queue.
  • There is no spec'ed Report API from SPs for when a given Aggregate has a problem. There are some alternatives that some SPs use, but they are not required today. This makes w3filecoin pipeline today completely blinded on why a given Aggregate may fail, except from an alert if it did not get into chain until some alerting thresholds. It is also not possible to query for a state, or error case.
  • Per the above limitations, w3filecoin still does not have an implementation for retries.
  • With current throughput of ~3 hours per created Aggregate, it is an easy attack vector for a bad actor to submit small invalid Pieces each couple of hours and completely stall the pipeline

Perhaps a client has to calculate it themselves and then invoke compute/piececid to get a second opinion and provide both signatures as evidence?

It is not a client problem if their claim is bad. They can just hire a malicious computer anyway... Must be the service to decide who they trust, if the service can be penalized by malicious content being broadcasted. As previously stated, client will already compute PieceCid themselves on upload when they have the bytes. They will not send the bytes with the filecoin/offer, just pointers (CIDs) to the data. Storefront MAY decide to grab those and validate them, hire a third party validator, or even run a validator ourselves.

Can we calculate that the sum of the costs of repeated pieceCID calculations is less than re-building an aggregate if we find a bad one

I think reasons were clear before, but in short:

  • there is no visibility today on aggregates failed, or why they failed.
  • Waiting for a timeout on 72H to look into it, will break SLA
  • More than breaking SLA there is no info on why it failed. So download the 32Gb, validate each single piece. Put everything back into the aggregation queue again and hope there won't be one more bad PieceCid

@Gozala
Copy link
Collaborator

Gozala commented Feb 5, 2024

Here are few links to some prior ideas on the similar subject I had in the past

https://gozala.io/workspace/#/page/w3-machine
https://hackmd.io/@gozala/invocation-router
https://github.com/web3-storage/RFC/pull/3/files

I think this doc is out of date, but IPVM had some relevant work also
https://github.com/ipvm-wg/workflow/blob/v0.1/README.md


The `w3-compute` protocol aims to enable clients to hire compute services to delegate some work, as well as for `w3-up` platform to hire third party compute services to verify client side offered computations if desirable.

Note that the discovery process by actors looking for services providing given computations is for now out of scope of this spec.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been trying to conceptualize w3 space as namespace where you can install various capabilities through subscriptions. Think of it as installing a software on your machine, it gets to read / write data on disk and use computational resources and you get to use the software and pay either subscription or one time fee it was sold for.

In this conceptual model provider/add is the way to install a service providing set of capabilities and once it is installed once capabilities are invoked they get handled by a service. More details are here https://github.com/web3-storage/specs/blob/main/w3-provider.md#provider-add

One subtle nuance here is that invocation audience is meant to be a service provider DID that will handle invoked capability, that way you could have multiple services providing same capabilities installed.

I think it would be nice if we could reconcile this proposal with that conceptual model. In which case piece CID compute would be a capability provided by the a service that could be installed in your space. It might be also interesting to consider version where you don't have to install provider, but instead delegate them access to a resource in your space so they could run compute over it, but that would require bit more thought probably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good context, while I see what you say may make sense, I am not sure on whether this would be the use case. I think that would make sense in the context of paying to compute block level indexes, or whatever other thing you would like as user. In this case, I think it is different because it can be even a service provider wanting to run it to have a validation, can be w3filecoin pipeline to decide if it trusts user computed piece. Therefore, is not in direct contact with user, or space where something runs.

With the above, I don't know how we should proceed to accommodate both angles. Probably too early to have this discussion and we should just compute pieces out of band until we have this

Copy link
Collaborator

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me and @mikeal previously have mused on idea of deploying functions and invoking them. https://gozala.io/workspace/#/page/w3-machine

General idea then was that you could effectively deploy a function as HTTP POST and make it available for invocation by deploying it on web3.storage. When invoked we'd POST arguments in a CAR and let the endpoint do it's work.

From todays perspective (and specifically for piece compute) I would probably reconsidered posting CARs and instead would have posted S3 or R2 pre-signed URLs for input and output instead. I wonder if exploring this from that perspective might be a more interesting. I would also perhaps presume that HTTP endpoint is ucanto endpoint to keep things simpler and we could make some ucanto server wrapper that will know to read args from URL and write results to URL as opposed to assuming request in response out.

"can": "compute/piececid",
"nb": {
/* CID of the uploaded content */
"content": { "/": "bag...car" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a link to a UCAN that gives you read access to the content. While we do allow public reads right now, this seems like an overkill. However if reads are charged and potentially accelerated it would make a lot of sense to pass "readable resource" itself which in UCAN word would be a capability on a resource giving you a read access.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often like to think of capability groups like TS interfaces with some methods e.g. upload/* is something like

interface Upload {
   add(upload): Promise<Result<...>>
   list(): Promise<....>
   ....
}

When I delegate access I give you reference to either Upload instance or just selected methods from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here is where we will need to map lot's of assumptions around what are the requirements and needs. I have seen this as a service that can choose where to read from according to their preferences. If I run in CF I prefer to read from R2, but the service is deployed anywhere. So, would not really be the caller to infer that.

Probably we need to consider better what we want and what are the requirements from perf and cost before deciding how these flows would be


### Compute PieceCID Failure

The service MAY fail the invocation if the linked `content` is not found. Implementer can rely on IPFS gateways, location claims or any other service to try to find the CAR bytes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you delegate read access to the content this problem goes away, it may only fail if authorization was invalid or expired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may also fail if the delegation is for non existent content, which was more what I mean here. I would like to go into your direction, but when we put costs and efficiency into the equation, probably the choice of where to read should be from the service and not from the user. Unless they of course give several UCANs of where to read from

"piece": { "/": "commitment...car" }
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, but currently tircky with ucanto as it expects you to keep connection open to respond with a result. That needs to change, but right now long running tasks will be tough.

It is also worth considering if this thing is atomic operation or composite. If later it's probably better to use effects to delimit execution. In this case I'd argue it is delimited. First step needs to read content out, which can succeed or fail and second one will compute the piece from read content. If we want to report progress between steps employing effects is probably a good call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think here we would likely need to design with effects


The `compute/` namespace contains capabilities relating to computations.

## `compute/*`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure compute namespace in itself makes sense. It is clear that unlike many other capabilities it is a pure computation as it has no side effects, yet not sure if namespace like this makes sense.

I would personally have added capability into the filecoin namespace somewhere and said that various providers could implement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I started actually there. But then was thinking more on an angle of having services that do compute speced together, but I start to see a lot of specificities with each type of computation that may be difficult

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants