Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDN-friendly HTTP API #20

Closed
mildbyte opened this issue Jul 19, 2022 · 11 comments · Fixed by #31
Closed

CDN-friendly HTTP API #20

mildbyte opened this issue Jul 19, 2022 · 11 comments · Fixed by #31
Assignees

Comments

@mildbyte
Copy link
Contributor

Implement a read-only HTTP API that obeys HTTP cache semantics and can benefit from any CDN / cache like Varnish:

Client sends a query like:

GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]

The server receives the query, checks it's a SELECT query, checks that the hash matches, executes it. It sends it back with an ETag that is a function of the versions of all tables that participate in this query:

200 OK
[query results]
ETag: [hash of the version of the table(s)]
Cache-Control: max-age=3600 [optional]

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#directives

When the client's browser queries the data again, it can then pass the ETag to see if the query results have changed:

GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]
If-None-Match: [etag]

Intermediate caches/CDNs might not even forward the query to the origin, instead serving it from cache if it's not stale. If it reaches the server, the server can cheaply revalidate the entry by recomputing the etag and responding with a 304 Not Modified if the tables in the query haven't changed their versions (without having to execute the query).

This only works for SELECT queries. INSERT and other writes should come in via POST requests.

@milesrichardson
Copy link

Putting the query text in a header value X-Seafowl-Query looks really weird and unexpected. But I guess it's probably not much different from including the whole query in the URL. Some intermediate caches might have different length limits for headers and URL. Dunno.

I understand the reason for it is to retain all the benefits that come along with the caching semantics of GET. Is that the only benefit? Because if so, maybe we can include a payload in the body of the GET request. Interestingly, it's not technically a violation of the HTTP spec for a server to send a GET request with a payload:

A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request.

So, it's unexpected, and therefore you probably couldn't expect an intermediate cache to respect it. But AFAICT, we don't expect the intermediate cache to do anything with the query text anyway, right? Or is it necessary because we want to Vary: X-Seafowl-Query?

That Stackoverflow discussion also points out that Elasticsearch uses GET with a request body. You can read their justification for it in the Elastic docs.

@mildbyte mildbyte self-assigned this Jul 26, 2022
@mildbyte
Copy link
Contributor Author

I got this working with Varnish (see https://github.com/splitgraph/seafowl/blob/20-cdn-friendly-http-api/src/http.rs#L109):

Setup: Varnish running on :80, proxying to Seafowl

vcl 4.1;

backend seafowl {
  .host = "127.0.0.1";
  .port = "3030";
}

sub vcl_recv {
   if (req.method != "GET" && req.method != "HEAD") {
        /* We only deal with GET and HEAD by default */
      return (pass);
  	}
}

sub vcl_backend_response {
  set beresp.ttl = 10s;
  set beresp.grace = 1h;
  if (beresp.status >= 400) {
  	set beresp.ttl = 0s;
    set beresp.grace = 0s;
  }
}

Attempt 1: query doesn't match the hash

$ curl -v -H "X-Seafowl-Query: SELECT 1;" \
    http://localhost:80/q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593

*   Trying 127.0.0.1:80...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593 HTTP/1.1
> Host: localhost
> User-Agent: curl/7.68.0
> Accept: */*
> X-Seafowl-Query: SELECT 1;
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< content-type: text/plain; charset=utf-8
< content-length: 13
< date: Tue, 26 Jul 2022 11:51:20 GMT
< X-Varnish: 2
< Age: 0
< Via: 1.1 varnish (Varnish/6.0)
< Connection: keep-alive
< 
* Connection #0 to host localhost left intact

#  DEBUG seafowl::http                         > Received query: SELECT 1;, URL hash e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593, actual hash 17db4fd369edb9244b9f91d9aeed145c3d04ad8ba6e95d06247f07a63527d11a

Correct hash

time curl -v \
    -H "X-Seafowl-Query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;"\
    http://localhost:80/q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593
*   Trying 127.0.0.1:80...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593 HTTP/1.1
> Host: localhost
> User-Agent: curl/7.68.0
> Accept: */*
> X-Seafowl-Query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: application/octet-stream
< etag: 589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb
< content-length: 1017
< date: Tue, 26 Jul 2022 11:51:52 GMT
< X-Varnish: 32770
< Age: 0
< Via: 1.1 varnish (Varnish/6.0)
< Accept-Ranges: bytes
< Connection: keep-alive
< 
{"port":"UNKNOWN PORT BRAZIL","SUM(supply_chains_copy.volume)":347063571.00081986}
{"SUM(supply_chains_copy.volume)":200905169.44735152}
{"exporter":"RAIZEN ENERGIA S.A","port":"SANTOS","SUM(supply_chains_copy.volume)":76494596.85595703}
{"exporter":"ADM DO BRASIL","port":"SANTOS","SUM(supply_chains_copy.volume)":46015909.2426928}
{"exporter":"BUNGE ALIMENTOS S/A","port":"SANTOS","SUM(supply_chains_copy.volume)":35109151.771828905}
{"exporter":"VICENTIN SAIC.","port":"SAN LORENZO","SUM(supply_chains_copy.volume)":35088856.436010584}
{"exporter":"COOPERATIVA DE PRODUTORES DE CANA-DE-ACUCAR, ACUCAR E A","port":"SANTOS","SUM(supply_chains_copy.volume)":34715153.24749756}
{"exporter":"CARGILL AGRICOLA SA","port":"SANTOS","SUM(supply_chains_copy.volume)":34151079.2334495}
{"exporter":"USINA DE ACUCAR SANTA TEREZINHA LTDA","port":"PARANAGUA","SUM(supply_chains_copy.volume)":31801358.93359375}
{"exporter":"LOUIS DREYFUS COMPANY BRASIL S.A.","port":"SANTOS","SUM(supply_chains_copy.volume)":30384562.039429814}
* Connection #0 to host localhost left intact


#  DEBUG seafowl::http                                   > Received query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;, URL hash e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593, actual hash e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd1159
#  DEBUG seafowl::http                         > Extracted table versions: [7]
#  DEBUG seafowl::http                         > ETag: 589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb, if-none-match header: None
# 

Same curl again (varnish is in grace period, so gives us the result but sends a query to Seafowl, uses the etag even though we didn't pass one)

< etag: 589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb
< content-length: 1017
< date: Tue, 26 Jul 2022 11:51:52 GMT
< X-Varnish: 32773 32771
< Age: 69
< Via: 1.1 varnish (Varnish/6.0)
< Accept-Ranges: bytes
< Connection: keep-alive
< 
...
real    0m0.009s
user    0m0.000s
sys     0m0.008s


#  DEBUG seafowl::http                                   > Extracted table versions: [7]
#  DEBUG seafowl::http                                   > ETag: 589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb, if-none-match header: Some("589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb")

curl with changed etag -- note varnish ignores the etag that we pass

$ time curl -v\
    -H "X-Seafowl-Query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;"\
    -H "If-None-Match: invalid-etag"\
    http://localhost:80/q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593

*   Trying 127.0.0.1:80...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593 HTTP/1.1
> Host: localhost
> User-Agent: curl/7.68.0
> Accept: */*
> X-Seafowl-Query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;
> If-None-Match: invalid-etag
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain; charset=utf-8
< date: Tue, 26 Jul 2022 11:53:02 GMT
< etag: 589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb
< content-length: 1017
< X-Varnish: 8 32774
< Age: 69
< Via: 1.1 varnish (Varnish/6.0)
< Accept-Ranges: bytes
< Connection: keep-alive
< 
{"port":"UNKNOWN PORT BRAZIL","SUM(supply_chains_copy.volume)":347063571.00081986}
{"SUM(supply_chains_copy.volume)":200905169.44735152}
{"exporter":"RAIZEN ENERGIA S.A","port":"SANTOS","SUM(supply_chains_copy.volume)":76494596.85595703}
{"exporter":"ADM DO BRASIL","port":"SANTOS","SUM(supply_chains_copy.volume)":46015909.2426928}
{"exporter":"BUNGE ALIMENTOS S/A","port":"SANTOS","SUM(supply_chains_copy.volume)":35109151.771828905}
{"exporter":"VICENTIN SAIC.","port":"SAN LORENZO","SUM(supply_chains_copy.volume)":35088856.436010584}
{"exporter":"COOPERATIVA DE PRODUTORES DE CANA-DE-ACUCAR, ACUCAR E A","port":"SANTOS","SUM(supply_chains_copy.volume)":34715153.24749756}
{"exporter":"CARGILL AGRICOLA SA","port":"SANTOS","SUM(supply_chains_copy.volume)":34151079.2334495}
{"exporter":"USINA DE ACUCAR SANTA TEREZINHA LTDA","port":"PARANAGUA","SUM(supply_chains_copy.volume)":31801358.93359375}
{"exporter":"LOUIS DREYFUS COMPANY BRASIL S.A.","port":"SANTOS","SUM(supply_chains_copy.volume)":30384562.039429814}
* Connection #0 to host localhost left intact

real    0m0.008s
user    0m0.007s
sys     0m0.000s

Update the dataset in Seafowl by inserting an empty row to it

seafowl=> SELECT COUNT(*) FROM supply_chains_copy;
 COUNT(UInt8(1)) 
-----------------
         2989191
(1 row)

seafowl=> INSERT INTO supply_chains_copy(exporter_id) VALUES(999999);
--
(0 rows)

seafowl=> SELECT COUNT(*) FROM supply_chains_copy;
 COUNT(UInt8(1)) 
-----------------
         2989192
(1 row)

Make the same request again to Varnish (twice, since the first time the reply is in grace)

$ time curl -v     -H "X-Seafowl-Query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;"    http://localhost:80/q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593
*   Trying 127.0.0.1:80...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /q/e0bacdcc01df9f006b491d116954ed55006b51fd540c9bad61ff2c639bd11593 HTTP/1.1
> Host: localhost
> User-Agent: curl/7.68.0
> Accept: */*
> X-Seafowl-Query: SELECT exporter, port, SUM(volume) FROM supply_chains_copy GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: application/octet-stream
< etag: 3db4e2d15128484c18e33cde50510caf0a36044278b0f9e265f49f31e9bb1ee0
< content-length: 1017
< date: Tue, 26 Jul 2022 11:55:23 GMT
< X-Varnish: 98306 65539
< Age: 1
< Via: 1.1 varnish (Varnish/6.0)
< Accept-Ranges: bytes
< Connection: keep-alive
< 
{"port":"UNKNOWN PORT BRAZIL","SUM(supply_chains_copy.volume)":347063571.00081986}
{"SUM(supply_chains_copy.volume)":200905169.44735152}
{"exporter":"RAIZEN ENERGIA S.A","port":"SANTOS","SUM(supply_chains_copy.volume)":76494596.85595703}
{"exporter":"ADM DO BRASIL","port":"SANTOS","SUM(supply_chains_copy.volume)":46015909.2426928}
{"exporter":"BUNGE ALIMENTOS S/A","port":"SANTOS","SUM(supply_chains_copy.volume)":35109151.771828905}
{"exporter":"VICENTIN SAIC.","port":"SAN LORENZO","SUM(supply_chains_copy.volume)":35088856.436010584}
{"exporter":"COOPERATIVA DE PRODUTORES DE CANA-DE-ACUCAR, ACUCAR E A","port":"SANTOS","SUM(supply_chains_copy.volume)":34715153.24749756}
{"exporter":"CARGILL AGRICOLA SA","port":"SANTOS","SUM(supply_chains_copy.volume)":34151079.2334495}
{"exporter":"USINA DE ACUCAR SANTA TEREZINHA LTDA","port":"PARANAGUA","SUM(supply_chains_copy.volume)":31801358.93359375}
{"exporter":"LOUIS DREYFUS COMPANY BRASIL S.A.","port":"SANTOS","SUM(supply_chains_copy.volume)":30384562.039429814}
* Connection #0 to host localhost left intact

real    0m0.008s
user    0m0.000s
sys     0m0.007s

Note the changed table version and the different etag (which caused the query to execute)

#  DEBUG seafowl::http                                   > Extracted table versions: [21]
#  DEBUG seafowl::http                                   > ETag: 3db4e2d15128484c18e33cde50510caf0a36044278b0f9e265f49f31e9bb1ee0, if-none-match header: Some("589ffd3acf522ae81192579f297589e93b0c41f748b224d1b4bb5c857a2f70cb")

@mildbyte
Copy link
Contributor Author

Putting the query text in a header value X-Seafowl-Query looks really weird and unexpected. But I guess it's probably not much different from including the whole query in the URL. Some intermediate caches might have different length limits for headers and URL. Dunno.
[...] maybe we can include a payload in the body of the GET request

It's a good question. For reference, some limits that I got from CDN docs:

The request body limits are much higher (e.g. 100MB on the CF Free plan). Do we trust intermediate caches to forward the GET request body more than X-... headers?

Or is it necessary because we want to Vary: X-Seafowl-Query?

The current implementation doesn't do that, but there is a case where, if the query doesn't match the hash in the URL, it will return a 400 error. This seems to be cached by Varnish by default (breaking that URL hash). I had to explicitly tell it to not cache 4xx responses; an alternative is varying on X-Seafowl-Query which we can't do if the query is the body.

@mildbyte
Copy link
Contributor Author

Here's an interesting thing I found yesterday: Cloudflare, by default, without getting (metered) page rules involved, caches assets based on their extension and doesn't cache HTML files: https://developers.cloudflare.com/cache/about/default-cache-behavior/#default-cached-file-extensions. So having a URL like /q/fairly_short_hash.(csv|bin) with the actual query in the body/headers, would make it more likely to be cached.

@mildbyte mildbyte linked a pull request Jul 27, 2022 that will close this issue
@mildbyte
Copy link
Contributor Author

(keeping it open, branch merged to keep main up to date)

@mildbyte mildbyte reopened this Jul 27, 2022
@mildbyte
Copy link
Contributor Author

mildbyte commented Aug 3, 2022

After sketching out the docs and thinking about what a good first tutorial would be, as well as how people would add data to Seafowl, this looks like a decent HTTP API for this:

  • POST /q (SQL in query body)
    • supports all read/write queries (writes by default with a mandatory Authorization header)
    • by default uncached by CDNs/HTTP caches
    • basic intro API that doesn't require the user to hash the query
  • GET /q/[hash][.optional extension]
    • SQL in query body or as an X-Seafowl-Query header
      • both of these are hacks, depending on what the user's CDN of choice / cache passes through -- I've only tested X-Seafowl-Query on Varnish / Cloudflare.
      • The optional extension is to fool Cloudflare into caching the GET asset without using page rules.

some other considerations:

  • ETag should change if a query involves a function and that function's definition changes
  • There should be an "escape hatch" to invalidate the caches (be it via changing the ETag generation code, having a smaller cache-control: maxage etc)

@milesrichardson
Copy link

milesrichardson commented Aug 3, 2022

Makes sense to me. That's basically in line with how Elasticsearch handles it (in the sense of offering the more "normal" API for basic curl usage). I guess the risk is that if developers write their initial hello world demo with the POST API, then that code is likely to seep into later iterations too. So we should make sure to encourage usage of the less ergonomic but more cacheable API with the hash in the URL.

We might be able to make the GET API more friendly without losing caching semantics. If the request is GET /q/SELECT%201 (i.e., the parameter is the raw query, not a hash), then we could hash the parameter and return a 301 redirect to GET /q/[hash]. This would retain all the caching semantics at the cost of an additional round-trip. But if we're encouraging it for development/debuggability purposes, that seems no worse than offering the POST API.

@mildbyte
Copy link
Contributor Author

We might be able to make the GET API more friendly without losing caching semantics. If the request is GET /q/SELECT%201 (i.e., the parameter is the raw query, not a hash), then we could hash the parameter and return a 301 redirect to GET /q/[hash]

That's a decent idea actually, though the 301 would have to indicate to the requester to not only resubmit to GET /q/[hash] but also include the query as a header or GET body. Is there a way to force a 301 / 307 to do that?

@milesrichardson
Copy link

I'm not aware of a way for the server to provide the client with a new body to send. But a 307 redirect will reuse the method and body of the request, so it's not perfect but it works if the client provides a copy of the query in both the body and URL parameter (or even only in the body).

@milesrichardson
Copy link

There is 303 See Other (wiki) which is interesting but doesn't seem to allow changing the body of the next request. Worth a read though

@mildbyte
Copy link
Contributor Author

I'm closing this to get the satisfaction of ticking something off and because this is now usable (see https://observablehq.com/@mildbyte/hello-seafowl). I created some follow-up issues which we might address before the first release:

#55
#56
#57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants