Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow (or even hanging) dhall-kubernetes based generation #1960

Closed
uwedeportivo opened this issue Jul 31, 2020 · 26 comments
Closed

slow (or even hanging) dhall-kubernetes based generation #1960

uwedeportivo opened this issue Jul 31, 2020 · 26 comments
Labels
performance All the slow and inefficient things

Comments

@uwedeportivo
Copy link

note: possibly related to #1890

setup and steps to reproduce

  • dhall version 1.33.1
  • grab repo https://github.com/sourcegraph/deploy-sourcegraph-dhall at branch the_rest_generate
  • execute dhall --file ./generate-uwe.dhall in the root of the repo
  • execute dhall --file ./generate-plain.dhall in the root of the repo

the repo is in concept similar to a helm setup, where in generate-uwe.dhall you have defined customizations that should override records in src/base. generate-plain.dhall is the same only no overrides (empty customizations). each subdir in src/base corresponds to one service with deployment/statefulset and everything else it needs (volume claims, config maps, rbac roles etc). each such subdir defines a configuration.dhall which is used in a generate.dhall for generating the appropriate dhall-kubernetes record. those in turn get used for generating the actual yaml files when we want to deploy to k8s.

what to observe

  • generate-uwe.dhall takes 40 min to finish, takes up 4GB of memory.
	Command being timed: "dhall-to-yaml --file ./generate-uwe.dhall"
	User time (seconds): 2426.85
	System time (seconds): 324.75
	Percent of CPU this job got: 91%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 50:12.85
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 13459224
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1660
	Minor (reclaiming a frame) page faults: 61758253
	Voluntary context switches: 456316
	Involuntary context switches: 1687648
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 63
	Page size (bytes): 4096
	Exit status: 0

by contrast generate-plain.dhall takes 1 sec to finish.

we would love feedback and advice on our setup and how we can address this problem. also how we can help debug this and help improve the performance.

@sjakobi
Copy link
Collaborator

sjakobi commented Jul 31, 2020

Many thanks for the report, @uwedeportivo! :)

Here's a diff of the two files:

diff --git a/generate-plain.dhall b/generate-uwe.dhall
index 16c26f6..22ee9ff 100644
--- a/generate-plain.dhall
+++ b/generate-uwe.dhall
@@ -1,5 +1,34 @@
-let Generate = ./src/base/generate.dhall
+let Generate =
+      ./src/base/generate.dhall sha256:1438389a0f50298789f5deb1adee571accb5aefc552df89aa3ed68cf45180702
 
-let Configuration/global = ./src/configuration/global.dhall
+let Configuration/global =
+      ./src/configuration/global.dhall sha256:78ac8cb500fe1399ed094b01eff2bc6ed42ba93f1bde2ca4c3fc2958d18d7157
 
-in  Generate Configuration/global::{=}
+let c =
+      Configuration/global::{=}
+      with GithubProxy.Deployment.Containers.GithubProxy.image = Some
+          "index.docker.io/sourcegraph/github-proxy:insiders@sha256:9bc0fab1ef7cddd6a09d45d47fe81314be60a7a42ca5b48b4fd3c33b45527dda"
+      with Prometheus.Deployment.Containers.Prometheus.resources.limits.memory = Some
+          "2G"
+      with Prometheus.Deployment.Containers.Prometheus.resources.requests.memory = Some
+          "500M"
+      with Prometheus.Deployment.Containers.Prometheus.image = Some
+          "index.docker.io/sourcegraph/prometheus:insiders@sha256:8906de7028ec7ecfcfecb63335dc47fe70dbf50d8741699eaaa17ea2ddfa857e"
+      with Prometheus.Deployment.Containers.Prometheus.resources.requests.ephemeralStorage = Some
+          "1Gi"
+      with Prometheus.Deployment.Containers.Prometheus.resources.limits.ephemeralStorage = Some
+          "1Gi"
+      with Gitserver.StatefulSet.persistentVolumeSize = Some "4Ti"
+      with Gitserver.StatefulSet.sshSecretName = Some "gitserver-ssh"
+      with Grafana.StatefulSet.Containers.Grafana.image = Some
+          "index.docker.io/sourcegraph/grafana:insiders@sha256:3a9f472109f9ab1ab992574e5e55b067a34537a38a0872db093cd877823ac42e"
+      with Grafana.StatefulSet.Containers.Grafana.resources.limits.memory = Some
+          "100Mi"
+      with Grafana.StatefulSet.Containers.Grafana.resources.requests.memory = Some
+          "100Mi"
+      with Grafana.StatefulSet.Containers.Grafana.resources.limits.ephemeralStorage = Some
+          "1Gi"
+      with Grafana.StatefulSet.Containers.Grafana.resources.requests.ephemeralStorage = Some
+          "1Gi"
+
+in  Generate c

It would be useful to reduce the differences between the files so it becomes clearer what the root of the performance issue is.

  • Are the SHA256 hashes significant? (I suspect not).
  • What happens if you remove, say, 50% or 75% of the with-expressions?

@sjakobi sjakobi added the performance All the slow and inefficient things label Jul 31, 2020
@ggilmore
Copy link

@sjakobi The execution time grows dramatically when we add more with expressions. It'll take some time to get exact numbers because the execution time is so long.

@sjakobi
Copy link
Collaborator

sjakobi commented Jul 31, 2020

I've done some quick timings with time dhall type --quiet --file ./generate-uwe.dhall:

  • 4 withs: 0.27user 0.01system 0:00.28elapsed 99%CPU (0avgtext+0avgdata 47592maxresident)k
  • 5 withs: 0.99user 0.00system 0:01.00elapsed 100%CPU (0avgtext+0avgdata 48528maxresident)k
  • 6 withs: 6.35user 0.06system 0:06.41elapsed 99%CPU (0avgtext+0avgdata 189440maxresident)k
  • 7 withs: 18.81user 0.09system 0:18.91elapsed 99%CPU (0avgtext+0avgdata 359808maxresident)k

My current suspicion is that desugarWith is somehow more expensive than it ought to be…

-- | Desugar all @with@ expressions
desugarWith :: Expr s a -> Expr s a
desugarWith = Optics.rewriteOf subExpressions rewrite
where
rewrite e@(With record (key :| []) value) =
Just (Prefer (PreferFromWith e) record (RecordLit [ (key, makeRecordField value) ]))
rewrite e@(With record (key0 :| key1 : keys) value) =
Just
(Prefer (PreferFromWith e) record
(RecordLit
[ (key0, makeRecordField $ With (Field record key0) (key1 :| keys) value) ]
)
)
rewrite _ = Nothing

@uwedeportivo
Copy link
Author

i've pushed three files to that repo

  • generate-uwe-10sec.dhall
  • generate-uwe-50sec.dhall
  • generate-uwe-40min.dhall

with gradually more with expressions.

i also took out the sha256 (that was my bad, playing and misunderstanding freeze)

@uwedeportivo
Copy link
Author

i will try to test it tomorrow without the with expressions (just defining the record directly)

@Gabriella439
Copy link
Collaborator

Gabriella439 commented Jul 31, 2020

Yeah, I'm fairly certain that the extended use of with is causing this problem. Remember that just a single with expression like x with a.b.c = e is syntactic sugar for x // { a = x.a // { b = x.a.b // { c = e } } }, and if you have a large number of chained withs you will get a blowup in the size of the expression that is exponential in the number of withs.

This is one of the things that came up in the discussion for dhall-lang/dhall-lang#923

There is one change we could potentially make to the standard to make this sort of chained with more efficient, which is the idea originally suggested in dhall-lang/dhall-lang#923
to desugar the with expression to use an intermediate let binding, like this:

x with a.b.c = e = let _ = x in _ // { a = _.a // { b = _.a.b // { c = e } } }

That would prevent the exponential blowup.

@sjakobi
Copy link
Collaborator

sjakobi commented Jul 31, 2020

An additional let-binding is indeed helpful but AFAICT not sufficient:

--- a/generate-uwe-50sec.dhall
+++ b/generate-uwe-50sec.dhall
@@ -2,8 +2,10 @@ let Generate = ./src/base/generate.dhall
 
 let Configuration/global = ./src/configuration/global.dhall
 
-let c =
+let b =
       Configuration/global::{=}
+
+let c = b
       with GithubProxy.Deployment.Containers.GithubProxy.image = Some
           "index.docker.io/sourcegraph/github-proxy:insiders@sha256:9bc0fab1ef7cddd6a09d45d47fe81314be60a7a42ca5b48b4fd3c33b45527dda"
       with Prometheus.Deployment.Containers.Prometheus.resources.limits.memory = Some

This shortens time dhall type --quiet --file ./generate-uwe-50sec.dhall from 80s to 8s on my system.

IIUC the reason for this speedup is that it avoids type-checking Configuration/global::{=} over and over again.

8s is still pretty slow though. The 40min version, changed to use the same trick, still OOMs with the version of dhall I'm currently using (master, I believe).

@Gabriella439
Copy link
Collaborator

@sjakobi: Right, this is why it requires language support. It needs to desugar to one let binding for each with

@sjakobi
Copy link
Collaborator

sjakobi commented Jul 31, 2020

@sjakobi: Right, this is why it requires language support. It needs to desugar to one let binding for each with

Ah, good point! With one let per with I get

$ time dhall type --quiet --file ./generate-uwe-50sec.dhall 
1.87user 0.04system 0:02.59elapsed 73%CPU (0avgtext+0avgdata 130332maxresident)k

Which is hardly slower than generate-plain.dhall.

@uwedeportivo
Copy link
Author

uwedeportivo commented Jul 31, 2020

thanks for looking into this.

first an update: we can confirm that let for each with works and is acceptably fast. i also removed all with and that version is as fast as plain but unwieldy to define.

since i have your attention i would like some advice on how to set things up.

let me briefly describe what we are doing and trying to accomplish:

kubernetes customizations

we have a set of base manifests (https://github.com/sourcegraph/deploy-sourcegraph/tree/master/base) that define what a normal sourcegraph installation needs to run in a k8s cluster. our customers use these manifests to install sourcegraph on-prem in their own cluster. this naturally leads to customizations that the customers want to specify on top of this base (for things like replica count and specific resources like memory all the way to custom namespaces, storageclassnames, image registries etc., see https://github.com/sourcegraph/sourcegraph/issues/12000 for a more complete list). there are various ways one can support this and the kubernetes community has come up with various solutions from which we tried a couple over the years. we were not too happy with what we tried so far. they all have their drawbacks and customers struggle to keep up-to-date with our base changes while maintaining their customizations (RFC 141 is slightly outdated but gives you an idea: https://docs.google.com/document/d/1tsksAlqe77NmdPLw7oyy2u0-1rYDQeVpPeDo1T0Xt50/edit?usp=sharing).

current state

our current state is: we encourage customers to keep base clean (no mods, so it's easy to merge upstream changes from our deploy-sourcegraph repo). as far as customizations we include some examples and machinery to let them define overlays using kustomize (https://kustomize.io/). we provide common overlays in the repo.

you can get a picture of that from https://github.com/sourcegraph/deploy-sourcegraph/tree/master/overlays

dhall

we are trying to be better than that and have been evaluating dhall for this (among other technologies). the repo https://github.com/sourcegraph/deploy-sourcegraph-dhall has our current approach with dhall. we are trying for now to get to a point where we can define customizations for one of our dogfood clusters and then use dhall to generate the manifests for it. here's a gist with some needed customizations for this dogfood cluster: https://gist.github.com/uwedeportivo/a4298942f75496a3de6ab24da65fc3aa (this is with the shell command variant and json jq, jy etc cmds mentioned in the RFC 141).

our goal is to provide an ergonomic way for customers to specify these. they shouldn't need to know more than some minimal subset of dhall. the minimal syntax they would use will be obvious from example customizations.

we see three ways to approach this:

  1. expose a complete configuration record type and they fill in the parts they want to customize (similar to the values.yaml in helm charts https://github.com/helm/charts/blob/master/stable/grafana/values.yaml)
  2. have them list out paths into a configuration record with with statements
  3. have a simplified configuration from which we derive and populate the complete configuration record

each have their pluses and minuses. number one is complete but overwhelming and exposes too much configuration API that we might want to change later. this would break customers. it also might require too much dhall knowledge from our customers. number 2 suffers from lack of discoverability of the allowed paths into those records. it also feels repetitive especially with the let workaround for each with. number 3 might complicate our dhall and some things might not be possible.

my question: do you have any advice on how we should set it up. what's a nice dhall-ergonomic way to do something like this.

thanks in advance :-)

@sjakobi
Copy link
Collaborator

sjakobi commented Aug 2, 2020

@uwedeportivo

2. have them list out paths into a configuration record with with statements

That sounds reasonable to me, at least once this perf issue is fixed. I'm not very familiar with using Dhall in practice though, so I my opinion on this is unlikely to be of much value. I think you might get more feedback on this if you'd pose your question at https://discourse.dhall-lang.org/ (and that might also help keep this issue about the perf problem).


@Gabriel439: Improving the desugaring in this way seems like a good way forward to me.

I wonder whether a standard change would be necessary or even helpful for this though. AFAIK the additional let-bindings are only helpful with the non-standard normalization of let-bindings in the Haskell implementation. With the substitution-based standard normalization rules, the additional let-bindings wouldn't help at all IIUC.

@Gabriella439
Copy link
Collaborator

@sjakobi: The standard change is necessary because otherwise we'd fail this test:

https://github.com/dhall-lang/dhall-lang/blob/ccb9f5d54b0ecba05a6493e84442ce445e411e9e/tests/parser/success/unit/WithA.dhall

Also, although the standard does specify one way to normalize let bindings, it does clarify that it permits (even encourages) alternative evaluation strategies if they are more efficient.

More importantly, even if we didn't mind the slight deviation from the standard (we've deviated in other small ways), I wouldn't want other implementations to have to rediscover this problem and independently reinvent the same solution. That would discourage them and also put them at a disadvantage to the Haskell implementation.

@Gabriella439
Copy link
Collaborator

That said, I plan to have a branch ready with the Haskell-side changes in parallel so that at least @uwedeportivo and @ggilmore will have something to try while waiting on the standardization change.

@Gabriella439
Copy link
Collaborator

@uwedeportivo: I would go with (2). If discoverability is the only issue, that can be fixed by language server support to auto-complete with expressions

@Gabriella439
Copy link
Collaborator

@uwedeportivo: I have a pull request up that you can use in the interim while I standardize this change: #1964

@uwedeportivo
Copy link
Author

wow, thank you very much. will try it out

@Gabriella439
Copy link
Collaborator

@uwedeportivo: You're welcome! 🙂

Gabriella439 added a commit to dhall-lang/dhall-lang that referenced this issue Aug 2, 2020
The motivation for this change is:

dhall-lang/dhall-haskell#1960

… where a deeply nested and chained `with` expression caused a blow-up
in the size of the desugared syntax tree.

This alternative desugaring was actually an idea we contemplated in the
original change that standardized `with`:

#923

… but at the time we elected to go with the simpler desugaring, reasoning
that users could work around performance issues with explicit `let`
binding:

#923 (comment)

However, for chained `with` expressions that work-around is not
ergonomic, as illustrated in
dhall-lang/dhall-haskell#1960.  Instead,
this adds language support for the `let` work-around.

The change to the `WithMultiple` test demonstrates that this leads to
a more compact desugared expression.  Before this change the
expression `{ a.b = 1, c.d = 2 } with a.b = 3 with c.e = 4` would
desugar to:

```dhall
  { a.b = 1, c.d = 2 }
⫽ { a = { a.b = 1, c.d = 2 }.a ⫽ { b = 3 } }
⫽ { c =
        ({ a.b = 1, c.d = 2 } ⫽ { a = { a.b = 1, c.d = 2 }.a ⫽ { b = 3 } }).c
      ⫽ { e = 4 }
  }
```

… and now the same expression desugars to:

```dhall
let _ = let _ = { a.b = 1, c.d = 2 } in _ ⫽ { a = _.a ⫽ { b = 3 } }

in  _ ⫽ { c = _.c ⫽ { e = 4 } }
```
@PierreR
Copy link
Contributor

PierreR commented Aug 2, 2020

If discoverability is the only issue, that can be fixed by language server support to auto-complete with expressions

@uwedeportivo let me know if LSP works for you. The last time I have tried it it was pretty bad (see #1817)

@ari-becker
Copy link

my question: do you have any advice on how we should set it up. what's a nice dhall-ergonomic way to do something like this.

Food for thought - our current approach to Kubernetes extensibility among dhall-unfamiliar users is as follows:

  1. A user expresses a desire / requirement to make a change to something, i.e. to add a raw list of environment variables to a specific container.
  2. We build a "feature" to enable the behavior, e.g. raw-environment-variables. What is a feature? In essence, something like FeatureParams -> Kubernetes.Container -> Kubernetes.Container. We build out a Type called FeatureParams that includes the necessary data to make it happen, then use that along with the base Kubernetes.Container to get an intermediate (possibly final) Kubernetes.Container. Because some features affect multiple manifests (i.e. "add a ConfigMap" both creates a ConfigMap and alters a Deployment to mount it), we can write multiple functions per feature, as needed.
  3. The user provides the information for the FeatureParams in either a YAML file (as the user is more comfortable than YAML /shrug) called by the feature name, i.e. raw-environment-variables.yaml, that we call yaml-to-dhall on, or a raw-environment-variables.dhall file directly
  4. Iterate over all of the features to generate the final manifests, with the help of a bash script that glues everything together.

It takes a couple minutes for the generation process to finish, which is a little slow, but still workable for our current purposes. In exchange, we get:
a) stable user-facing configuration (FeatureParams), that need not change even if other parts do (i.e. the change in dhall-kubernetes to return to Optional List from List)
b) Uses the decorator pattern to prevent any individual piece from exploding in size or complexity / becoming difficult to maintain
c) Extensibility - if we want to add more functionality, we just write another feature
d) We can work with users to eventually deprecate features, if we so desire. For example, the aforementioned raw-environment-variables is a Pareto escape-hatch, that we can use to expose the same information (i.e. the location of a database) in different ways (i.e. different environment variables for different services, to avoid code changes). Over time, we can work with internal upstream to reach a standard (enabled through a different feature) so that we don't need to use the escape hatch anymore.

@sjakobi
Copy link
Collaborator

sjakobi commented Aug 3, 2020

Also, although the standard does specify one way to normalize let bindings, it does clarify that it permits (even encourages) alternative evaluation strategies if they are more efficient.

@Gabriel439, I wasn't aware of these recommendations for let. Where in the standard are they?

@uwedeportivo
Copy link
Author

@ari-becker thanks for that. that is valuable input for us. and sorry everyone that i hijacked a little this issue to discuss something not quite relevant to it :-).

@Gabriella439
Copy link
Collaborator

@uwedeportivo @ggilmore @uwedeportivo @PierreR @ari-becker: I created a Discourse thread for the Kubernetes roadmap here:

https://discourse.dhall-lang.org/t/roadmap-for-improved-kubernetes-support/313

@uwedeportivo
Copy link
Author

@Gabriel439 i used your branch to build new binaries and ran it over the generate-uwe-40min.dhall and it is finishing in 1.7s and uses no more mem than plain. thank you very much. that fixes it.

@Gabriella439
Copy link
Collaborator

@uwedeportivo: You're welcome! Keep in mind it might be a week or two before that change gets merged into master since we're discussing upstream exactly how it should be standardized (assuming that people approve)

@Gabriella439
Copy link
Collaborator

@sjakobi: When I wrote that, I was thinking of this part from the β-normalization section:

Also, note that the semantics specifies several built-in types, functions and operators that conforming implementations must support. Implementations are encouraged to implement the following functions and operators in more efficient ways than the following reduction rules so long as the result of normalization is the same.

However, open reading that more closely it doesn't actually mention that it also applies to keywords like let, so perhaps I should generalize the text of that.

Gabriella439 added a commit to dhall-lang/dhall-lang that referenced this issue Aug 13, 2020
The motivation for this change is:

dhall-lang/dhall-haskell#1960

This change to the standard gives implementations greater freedom
in how they desugar a `with` expression, in order to permit more
efficient implementations.  The standard now also suggests a more
efficient implementation to help implementation authors.

This is technically a breaking change to the binary encoding of
a non-normalized `with` expression, but semantic integrity checks
are unaffected by this change.
Gabriella439 added a commit to dhall-lang/dhall-lang that referenced this issue Aug 19, 2020
The motivation for this change is:

dhall-lang/dhall-haskell#1960

This change to the standard gives implementations greater freedom
in how they desugar a `with` expression, in order to permit more
efficient implementations.  The standard now also suggests a more
efficient implementation to help implementation authors.

This is technically a breaking change to the binary encoding of
a non-normalized `with` expression, but most semantic integrity checks
are unaffected by this change.  The exception are those where
an imported expression contains a `with` expression with an
abstract left-hand side (see the `WithDesugar` test for an example of
code that would have a new integrity check after this change)
@Gabriella439
Copy link
Collaborator

This was fixed by #1993

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance All the slow and inefficient things
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants