Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution: 3.19 Tracking issue #11954

Closed
39 of 55 tasks
pecigonzalo opened this issue Jul 6, 2020 · 36 comments
Closed
39 of 55 tasks

Distribution: 3.19 Tracking issue #11954

pecigonzalo opened this issue Jul 6, 2020 · 36 comments
Assignees
Labels
Milestone

Comments

@pecigonzalo
Copy link
Contributor

pecigonzalo commented Jul 6, 2020

Plan

Support new and existing deployments

This is an ongoing expense, we anticipate this taking no more than 10d of work spread across the entire team.

Reduce upgrade overhead

Upgrading Kubernetes deployments requires customers spend a lot of engineering time to converge our released Kubernetes manifests with their fork as documented in RFC-141.

We will finish the Dhall investigation and make a decision by the end of 3.19.

Increase our e2e test frequency

To increase our release cadence, we need to be able to run e2e tests more frequently. This is currently not possible as our CI infrastructure causes tests to be unreliable.

Support per-team alerts

To allow teams to support and monitor the services and features they ship, we need to be able to route alerts to the relevant teams as described in RFC-189.

Availability

Period is from July 20th to August 19th (23 working days). Please write the days you won't be working and the number of working days for the period.

  • Dave: 22d (off July20)
  • Robert: 23d

Workload

@bobheadxi: 2.50d

  • Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations #12010
  • monitoring: migrate existing alert rules to generator #12117
  • Dogfood the monitoring we ship with Sourcegraph #5370 2d
  • cadvisor: investigate collecting IO metrics #12163
  • Better account for peaks / max resource usage in monitoring #12032 0.5d
  • Make license check not fail when our node version is upgraded #12318
  • monitoring: generate alerts as native prometheus alerts, let alert_count depend on alert rules #12336
  • sourcegraph/server:3.18 provisioning indicators panel missing #12421 🐛
  • deploy-sourcegraph: prometheus missing in up metric #12482
  • monitoring: no easy way to trigger alerts for testing #12423
  • monitoring: link back to grafana service dashboards in alert notifications #12235
  • monitoring: better advice for alerts that do not have an entry in alert_solutions #12236
  • frontend: hard errors alerts as ratio/percentage instead of absolute-value #12158
  • monitoring: change relevant hard threshold alerts to ratio-based alerts #12865
  • monitoring: index_queue_growth_rate firing without hitting threshold #12868 🐛
  • monitoring: remove custom alertmanager from cloud #12160
  • monitoring: difficult to understand 7d provisioning panel #12692

@davejrt

  • Bare-metal Buildkite agents capable of running Docker and VMs #12101
  • deploy-sourcegraph-dhall: Add CI job to generate diff between master and feature branch #12716
  • baremetal buildkite agent networking / instability issues #12996

@daxmc99

  • Sourcegraph.com - add redis-store & precise-code-intel-bundle-manager snapshotting #10450

@efritz

  • sourcegraph/customer #72 🐛👩

@ggilmore

  • deploy-sourcegraph-dhall: incrementally migrate sourcegraph.com services over to dhall #12113
  • [draft] deploy-sourcegraph-dhall: implement migrate-to-non-root overlay #12283
  • deploy-sourcegraph-dhall: grafana: implement new configuration logic for k8s.sgdev.org #12109
  • deploy-sourcegraph-dhall: frontend: implement new configuration logic for k8s.sgdev.org #12105

@keegancsmith

  • sourcegraph/customer #69 👩

@pecigonzalo

  • Reduce the impact of unplanned work #11904
  • Migrate terraform state to GCP #12496

@slimsag: 7.50d

  • sourcegraph/customer #62 2d 👩
  • Formalize managed instances #12495 5d
  • sourcegraph/customer #73 👩
  • sourcegraph/customer #53 0.5d 👩
  • sourcegraph/customer #66 👩
  • sourcegraph/customer #74 👩
  • sourcegraph/customer #85 👩
  • sourcegraph/customer #90 🐛👩
  • Run e2e tests on bare-metal Buildkite agents on every commit to master (non-blocking) #12339
  • Run e2e "regression" tests on bare-metal Buildkite agents on every commit to master (non-blocking) #12340
  • distribution: add monitoring architecture page #1221 :shipit:
  • distribution roadmap #1104 :shipit:
  • Document when to introduce new services or not #5487 :shipit:

@uwedeportivo: 8.00d

  • sourcegraph/customer #65 👩
  • deploy-sourcegraph-dhall: symbols: implement generate reading from config #12076 0.5d
  • deploy-sourcegraph-dhall: searcher: implement generate reading from config #12075 0.5d
  • deploy-sourcegraph-dhall: replacer: implement generate reading from config #12074 0.5d
  • deploy-sourcegraph-dhall: repo-updater: implement generate reading from config #12073 0.5d
  • deploy-sourcegraph-dhall: query-runner: implement generate reading from config #12072 0.5d
  • deploy-sourcegraph-dhall: precise-code-intel: implement generate reading from config #12071 0.5d
  • deploy-sourcegraph-dhall: postgres: implement generate reading from config #12070 0.5d
  • deploy-sourcegraph-dhall: jaeger: implement generate reading from config #12069 0.5d
  • deploy-sourcegraph-dhall: github-proxy: implement generate reading from config #12066 0.5d
  • deploy-sourcegraph-dhall: cadvisor: implement generate reading from config #12065 0.5d
  • ci-db-backcomp.sh is not branch-aware #12302
  • sourcegraph/customer #57 3d 👩

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • :shipit: Pull Request
@pecigonzalo pecigonzalo self-assigned this Jul 6, 2020
@pecigonzalo pecigonzalo added this to the 3.19 milestone Jul 6, 2020
@pecigonzalo pecigonzalo changed the title WIP: Distribution: 3.18 Tracking issue WIP: Distribution: 3.19 Tracking issue Jul 6, 2020
@pecigonzalo pecigonzalo changed the title WIP: Distribution: 3.19 Tracking issue Distribution: 3.19 Tracking issue Jul 16, 2020
@pecigonzalo
Copy link
Contributor Author

cc/ @christinaforney @dadlerj

@bobheadxi
Copy link
Member

bobheadxi commented Jul 17, 2020

this week (week of jul 13)

Landed a range of improvements to alerting for 3.18 (alert silencing, notifications, unifying some out-of-band alerts). Worked on justifying and fleshing out some details for supporting alert ownership for RFC189, planned tasks for 3.19, and am currently working on documenting the what, how, and why of our monitoring stack (sourcegraph/about#1221). Also paired with @keegancsmith on potential improvements to one of our alerts and the generator as a whole.

next week

Hopefully resolve https://github.com/sourcegraph/sourcegraph/issues/12158, since it is currently one of the more frequent critical alerts that isn't entirely actionable/valid on Cloud, and wrap up https://github.com/sourcegraph/sourcegraph/issues/5370 by setting up a time to take on-call and set up OpsGenie. Also finalize how we are supporting alert ownership for RFC189 and and finish documenting our monitoring stack

@davejrt
Copy link
Contributor

davejrt commented Jul 20, 2020

Week July 13

Landing blackbox exporter into sourcegraph.com environment sourcegraph/deploy-sourcegraph-dot-com#2984 Lots of time spend on calls with $CUSTOMER working through issues with their deployment related to indexed search starting. Also implemeted the fix suggested by @pecigonzalo regarding pv/pvc and their respecting claimref from namespaces.

Week July 20
Continuing to work through $CUSTOMER issues and beginning on #12101 in addition to fine tuning some of the blackbox alerts to ensure they reflect exactly what we need. These seemed to function well though in light of lasts weeks cloudflare outage.
Continuing to work with $CUSTOMER to resolve any further issues

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Jul 20, 2020

Week July 13

I worked on planning 3.19 which included starting this experiment for tracking project progress. I have also started RFC-202 for standardizing configuration across our services.

Week July 20

My focus this week be setting our team goals and planning the retrospective for 3.18. Ill also start experimenting with using projects to track unplanned tasks and close old backlog items/projects.

@uwedeportivo
Copy link
Contributor

pretend this is last friday:

this week:

mainly release with dax and i finished all my dhall assignments from geoffrey (i used geoffrey's nicely done framework to capture in dhall the customizations we need to support dog-food k8s)

next week:

3.19 planning, some small remainders from release process, continue with geoffrey on dhall (we will probably start converting dog-food and start capturing all the kustomize customizations and twitter and apple customization)

@ggilmore
Copy link
Contributor

This past week, I lost some time due medical issues and follow up. The rest of the week was spent catching up on rescheduled meetings, more medical follow up, following up with @uwe's good work on deploy-sourcegraph-dhall, and 3.19 planning

This week:

@slimsag
Copy link
Member

slimsag commented Jul 21, 2020

Copying my update over from the 3.18 tracking issue for posterity / visibility:

Last week

I spent ~60% of my time on https://github.com/sourcegraph/customer/issues/62#issuecomment-661266320 and made great progress but with many context switches / interruptions throughout. I was sidetracked regularly by::

This week

I intend to:

@pecigonzalo
Copy link
Contributor Author

Priorities update

As discussed with CE https://github.com/sourcegraph/customer/issues/65 is now our top priority

@davejrt
Copy link
Contributor

davejrt commented Jul 24, 2020

Week July 20 - Last week

Week July 27- Next week

  • Begin work on sourcegraph/sourcegraph#12101
  • Look more closely into Dhall and learning the language
  • Investigate and complete sourcegraph/sourcegraph#12160
  • Continue to provide assistance related to sourcegraph/customer#65

@ggilmore
Copy link
Contributor

This week:

Next week:
I should have fewer medical appointments, and I'll continue dedicating my time to investigating https://github.com/sourcegraph/customer/issues/65 until that's solved

@uwedeportivo
Copy link
Contributor

Week July 20 - Last week

Week July 27- Next week

  • do some dhall work (prepare more customizations)
  • debug indexed search on megakube.sgdev and bigdata.sgdev

@slimsag
Copy link
Member

slimsag commented Jul 27, 2020

Last week

I spent the vast majority of my time helping with https://github.com/sourcegraph/customer/issues/65 and nearly completed https://github.com/sourcegraph/customer/issues/62 (literally only a few hours away).

This week

I hope to finish https://github.com/sourcegraph/customer/issues/62 , the distribution roadmap, the monitoring architecture documentation, and catch up with Robert on monitoring.

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Jul 27, 2020

Week July 20

Last week focus has been working with the team to set our team goals. The test around using GitHub projects for tracking progress seems to be working and ill continue with this during the rest of the iteration.
I have been also talking with Chayim about the secrets loading implementation.

Week July 27

Ill continue to focus on setting our team goals, we have settled on them but we are still working out the details. I will also try to finalize RFC-202 and the review of RFC 199.

Team update

Issue https://github.com/sourcegraph/customer/issues/65 has been resolved, but our focus remains on the sub-issues created by it https://github.com/sourcegraph/customer/issues/69 and https://github.com/sourcegraph/customer/issues/70 for this week.

@slimsag
Copy link
Member

slimsag commented Jul 31, 2020

This week

I helped $CUSTOMER with Uwe and Dave to ensure their demo went smoothly, completely finished work on managed instances and am ready to ship it to $MY_CUSTOMER_0, caught up with Robert on monitoring and next steps there, shipped Sourcegraph to $MY_CUSTOMER_1, began data collection for https://github.com/sourcegraph/customer/issues/71 - and started looking into release automation for deploy-sourcegraph-docker as well as brainstorming production readiness ideas.

I did not complete/merge the distribution roadmap or monitoring architecture documentation, but still intend to do so.

Next week

Review https://github.com/sourcegraph/sourcegraph/pull/12581 - merge the distribution roadmap, architecture docs, get feedback on production readiness and move forward on that, continue looking into monitoring and release automation.

@bobheadxi
Copy link
Member

bobheadxi commented Aug 1, 2020

This week

I've been working on a range of improvements and polish to alerting (formatting, bug, regression, migration, testing, etc).

I've also landed the core functionality for the RFC 189's per-team alerts (via routing implementation) and have prepared a pull request to configure team-based paging. The final pieces of this are also up for review: migrate the rest of our out-of-band alerts and drop our custom alerting.

I commented on this last week, but I've hit a wall with cadvisor IO metrics (https://github.com/sourcegraph/sourcegraph/issues/12163) and don't really see a way forward - my update on that issue includes possible alternatives

Next week

Land everything related to per-team alerts and work with each team to get rotations and alerts set up (per @nicksnyder's request). I imagine there will be problematic alerts / other issues, and will likely focus on follow-up work. This will (finally!) close out dogfooding.

@ggilmore
Copy link
Contributor

ggilmore commented Aug 1, 2020

This week:

  • helped out with the $customer issue investigations
  • more medical appointments and meetings
  • synced with @uwedeportivo on his dhall work and talked about next steps
  • discovered some dhall perf issues with @uwedeportivo and filed slow (or even hanging) dhall-kubernetes based generation dhall-lang/dhall-haskell#1960, hopefully we can find a decent resolution for this
  • task switched to back to infrastructure work for dhall dogfooding: added dhall plugins to our CI agents, preparing terraform code for catfood.sgdev.org (dogfood clone) - holding off on deploying it until next week to save on costs

@davejrt
Copy link
Contributor

davejrt commented Aug 1, 2020

THIS WEEK

I'd say a fairly even portion of my time was spent between working on tasks for $customer and replicating their setup internally. I haven't gotten back around to do anything with Dhall which is unfortunate but I hope to make more progress on that next week. I made a decent start on #12101 and a working agent that aside from needing some fine tuning will be good to start using and iterate on.

I did take a tertiary glance at removing alertmanager but Robert kindly informed me that this was wrapped up with a bunch of other PRs he'll land next week so it's in his more than capable hands for now.

NEXT WEEK

I'll have a PR ready to land for the baremetal CI agent which will be ready to start running jobs. I'm going to resync with geoffrey and/or uwe re dhall and make an effort to pick that back up again.

@uwedeportivo
Copy link
Contributor

this week:

tinkered together with geoffrey on dhall (https://github.com/sourcegraph/deploy-sourcegraph-dhall/tree/the_rest_generate). we're in contact with the dhall core devs about some issues we hit and we also asked them for advice on how to set up the dhall interface for our customers (dhall-lang/dhall-haskell#1960 (comment)). we have tried a couple of things and we'll settle on something for the POC for the dogfood cluster. we should be able to tie things up for the POC evaluation sometime next week.

did some debugging, overlay creation and general support for our bigdata $customer.

next week:

i want us to finish up the dhall POC and do some evaluation of suitability and if/how we proceed. i'm leaning towards proceeding. i think dhall advantages outweigh some of the difficulties. but i don't want to inject priors into the evaluation process so disregard my last sentence :-)

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Aug 3, 2020

Last week

We finished our initial team goals, I also finalized the review of RFC-199. We will make we test using microVMs with ignite for a v0 and will have to review the outcome of that testing before we can move to v1 and define how we deploy/support/HA/etc.

This week

We will kick-off our 360 review cycle and I will focus on that. Ill be working on the roadmap and a product readiness document with Stephen and will pair with Geoffrey to get more familiar with our Dhall implementation.
I have not been able to progress RFC-202 and if time allows I would like to finish that up.

Team update

The high priority sourcegraph/customer#69 from last week has been resolved, and we will return to our tracking issue priorities. sourcegraph/customer#70 remains unclear as we can't reproduce it consistently and has been deprioritized for the moment.

@ggilmore
Copy link
Contributor

ggilmore commented Aug 8, 2020

This week:

@slimsag
Copy link
Member

slimsag commented Aug 8, 2020

This week:

I spent most of my time, maybe 70% discussing things (distribution things, CE things, security things, code intel things, and more.) I spent 10% of my time helping customers, and 10% thinking about how to onboard CE folks. I made slight progress on release automation, but no progress on the other things I set out to do this week in my last update.

@bobheadxi
Copy link
Member

This week

I followed up on last week's update and have finalized most of the work for per-team alerts, and have been dogfooding it (to myself). I have pinged each team to set up on-call rotations so that we can switch over completely to the new alerting stack and remove our old alerting by end of 3.19 or early 3.20. I have also made a range of improvements to our alerting, including: making our provisioning alerts more informative, converting some of our hard-threshold alerts to be ratios, improve our alerts solutions documentation. I also looked into adjusting resources for some of our services that seem like they could use it.

Next week

See if I can help others wrap up any outstanding tasks for this iteration, maybe work on converting more of our noisy alerts to be ratio-based, and start looking at what I can do in 3.20

@uwedeportivo
Copy link
Contributor

this week:

did some dhall work experimenting with unit tests and some more encompassing customizations that span more than one resource. did some initial bootstrap for marek for the 3.19 release. some customer work with cap1.

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Aug 10, 2020

Last week

Kicked-off 360 review cycle and I was focused on that. I paired with Geoffrey to get more familiar with our Dhall implementation and architecture. I meet with Eric to talk about running code intel on firecracker VMs and how would we deploy those.

This week

Ill be working mainly on our 360 reviews and 3.20 planning. Ill also like to Dhall, I would like to do more testing now that I understand its structure better. Im also working on improving our incidents pipeline so its easier to track the status and number of active incidents.

Team update

Given the number of customer issues, we are over our original estimates for support time, which will likely impact "Increase our e2e test frequency" and potentially "Reduce upgrade overhead" although we are looking to make a decision at the end of the sprint anyway.
We started planning 3.20 last week and should by the end of the week.

@marekweb
Copy link
Contributor

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.19 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@slimsag
Copy link
Member

slimsag commented Aug 15, 2020

This week:

I was out Mon and half of Tue due to a family emergency. Tue was spent getting caught up. I discussed Dhall with the team and helped to determine next steps, interviewed two candidates (eng and CE), and helped customers (#85 (completed), #62 (completed), #74 (in progress), #12999 (in progress), #73 (in progress)). I reduced the reliance of myself in shipping Sourcegraph to some customers, and am running behind on other areas (support, distribution roadmap, monitoring arch docs, service arch docs, etc.) due to the life incident I had earlier in the week.

@ggilmore
Copy link
Contributor

This week, I got catfood.sgdev.org in a demo-able state for the Dhall PoC (along with the accompanying https://github.com/sourcegraph/deploy-sourcegraph-dhall changes). Following the Wednesday demo, the @sourcegraph/distribution team has enough confidence to move forward with the dhall implementation. My next priorities are to sync with the Dhall maintainers and develop the Dhall roadmap/tracking issue

@pecigonzalo
Copy link
Contributor Author

Last Week

I have mostly been working on the 3.20 plan and 360 reviews. I worked on a couple of incidents as well (hostError on K8s nodes, https://github.com/sourcegraph/infrastructure/pull/2060, https://github.com/sourcegraph/customer/issues/75) and actively reviewing other Kubernetes notifications to identify fixes (eg. sourcegraph/deploy-sourcegraph#816).

This Week

My priority for this week is finalizing the plans for 3.20 and meeting with the team to close the reviews. Additionally, ill continue to work through our alerts and find action items from them.

@bobheadxi
Copy link
Member

last week

A lot of misc. work and doing investigations around alerts discussions in Slack (often around alerts frequencies). Fiddled around with Dhall, and made other misc. improvements to tooling (docsite, license_finder). Brainstormed a potential idea for next iteration around improving release dogfooding / our deploy-sourcegraph forks

next week

Stay up to date on 3.20 plans and see how I can pitch in. Probably follow up on teams setting up their opsgenie alerts, and continue following up on issues with alerts that get raised

@davejrt
Copy link
Contributor

davejrt commented Aug 17, 2020

last week
Landed this PR which provides the code to build a buildkite-agent running on GCP, as well as the terraform code to deploy the autoscaling group. Also spent a number of sessions with Kimberly helping her sourcegraph running locally in her wsl2 environment.

this week
Stephen pointed out the performance issues with the buildkite-agents, especially with relation to booting a vagrant box as well as pulling docker images. I suspect this is related to the nested virt and some virtual box networking that will need to be tuned. Captured in #12996

@slimsag
Copy link
Member

slimsag commented Aug 22, 2020

This week:

I made the pure-docker vagrant tests reliable and used them + the new documentation to ship v3.18.0 to https://app.hubspot.com/contacts/2762526/company/407948923/ further reducing their upgrade delay and requirement on myself. These also played a fundamental role in identifying a release blocker for 3.19.0 later in the week.

I spent some time helping a customer, fixing a bug for another customer, and a fair amount of time in meetings and interviewing candidates.

All of Thursday was spent helping @uwedeportivo with the 3.19.0 release which had a number of bad issues which plagued it. We managed to fix them in time to release on the 20th as promised. I also sync'd with a customer and did some live debugging to resolve a major issue for them

On Friday I addressed some tech debt, helped Rijnard with testing some structural search ideas, sync'd with a customer and responded to a customer P0

@bobheadxi
Copy link
Member

bobheadxi commented Aug 24, 2020

this week

some debug sessions, follow-ups on per-team alerting and general improvements on that front (docs and tweaks). did some planning for k8s dogfooding in 3.20, and made a proof-of-concept for what it might look like

next week

wrap up the dogfooding automation for deploy-sourcegraph => dogfood, and work with someone to get it deployed proper

@davejrt
Copy link
Contributor

davejrt commented Aug 24, 2020

Last week 17 July

I'd estimate 90% of my time was spent debugging the GCP/vagrant/docker e2e testing issues #12996. PR is up for review here. Doesn't appear to be directly related to networking as first thought, and a high CPU count has an impact based on my testing

This week 24 July

Iron out any remaining issues with this portion of the e2e testing and sync with Uwe around improving the release process. I need to look more closely at our autoscaling capabilites in GCP to ensure we're getting maximum value and reliability out of the config.

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Aug 24, 2020

Last Week

Last week was focused on closing 3.19, planning 3.20 and closing our 360 review packages. I was able to make some progress in the Prometheus issue and sent a fix to the customer (thanks @bobheadxi and @uwe for the help here).
I also started to organize our GCP Projects, so its easier to set permissions by using groups instead of users and merging deploy-sourcegraph to deploy-sourcegraph-dot-com, which will continue this week.

This Week

Ill work on adding our new on-call rotation to OpsGenia and sending a PR with the updated information to help finish per-team alerts and get some help to finalize the deploy-sourcegraph-dot-com merge.
@bobheadxi and I will sync later this week so I can start helping on Dogfood Kubernetes deployments.

@uwedeportivo
Copy link
Contributor

last week

fighting really hard to get 3.19 out the door. we had several issues: it started with the segfault in redis-cache in single-server, DNS issues in single-server, regression tests not logging in, regression tests referencing wrong test repos, build issues, update-docker-image-tags issues. after we finished tagging the final release for 3.19, stephen discovered a race condition in DB migration, so we had to fix, redact 3.19.0 and release 3.19.1

this week

i started looking in the dhall migration/keep-up-to-date tool. will continue playing with that for a couple of days. also help geoffrey with planning and fleshing out the dhall project for real now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants