-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribution: 3.19 Tracking issue #11954
Comments
this week (week of jul 13)Landed a range of improvements to alerting for 3.18 (alert silencing, notifications, unifying some out-of-band alerts). Worked on justifying and fleshing out some details for supporting alert ownership for RFC189, planned tasks for 3.19, and am currently working on documenting the what, how, and why of our monitoring stack (sourcegraph/about#1221). Also paired with @keegancsmith on potential improvements to one of our alerts and the generator as a whole. next weekHopefully resolve https://github.com/sourcegraph/sourcegraph/issues/12158, since it is currently one of the more frequent critical alerts that isn't entirely actionable/valid on Cloud, and wrap up https://github.com/sourcegraph/sourcegraph/issues/5370 by setting up a time to take on-call and set up OpsGenie. Also finalize how we are supporting alert ownership for RFC189 and and finish documenting our monitoring stack |
Week July 13 Landing blackbox exporter into sourcegraph.com environment sourcegraph/deploy-sourcegraph-dot-com#2984 Lots of time spend on calls with $CUSTOMER working through issues with their deployment related to indexed search starting. Also implemeted the fix suggested by @pecigonzalo regarding pv/pvc and their respecting claimref from namespaces. Week July 20 |
Week July 13I worked on planning 3.19 which included starting this experiment for tracking project progress. I have also started RFC-202 for standardizing configuration across our services. Week July 20My focus this week be setting our team goals and planning the retrospective for 3.18. Ill also start experimenting with using projects to track unplanned tasks and close old backlog items/projects. |
pretend this is last friday: this week: mainly release with dax and i finished all my dhall assignments from geoffrey (i used geoffrey's nicely done framework to capture in dhall the customizations we need to support dog-food k8s) next week: 3.19 planning, some small remainders from release process, continue with geoffrey on dhall (we will probably start converting dog-food and start capturing all the kustomize customizations and twitter and apple customization) |
This past week, I lost some time due medical issues and follow up. The rest of the week was spent catching up on rescheduled meetings, more medical follow up, following up with @uwe's good work on deploy-sourcegraph-dhall, and 3.19 planning This week:
|
Copying my update over from the 3.18 tracking issue for posterity / visibility: Last weekI spent ~60% of my time on https://github.com/sourcegraph/customer/issues/62#issuecomment-661266320 and made great progress but with many context switches / interruptions throughout. I was sidetracked regularly by::
This weekI intend to:
|
Priorities updateAs discussed with CE https://github.com/sourcegraph/customer/issues/65 is now our top priority |
Week July 20 - Last week
Week July 27- Next week
|
This week:
Next week: |
Week July 20 - Last week
Week July 27- Next week
|
Week of July 20Added the first step towards supporting per-team alerts, worked on migrating custom alerts into our generator, investigated some provisioning issues on dot-com, investigated collecting IO metrics to support https://github.com/sourcegraph/customer/issues/65, and added support for Next weekWrap up outstanding PRs and implement support for per-team alerts + dogfood it alongside the migrated custom alerts |
Last weekI spent the vast majority of my time helping with https://github.com/sourcegraph/customer/issues/65 and nearly completed https://github.com/sourcegraph/customer/issues/62 (literally only a few hours away). This weekI hope to finish https://github.com/sourcegraph/customer/issues/62 , the distribution roadmap, the monitoring architecture documentation, and catch up with Robert on monitoring. |
Week July 20Last week focus has been working with the team to set our team goals. The test around using GitHub projects for tracking progress seems to be working and ill continue with this during the rest of the iteration. Week July 27Ill continue to focus on setting our team goals, we have settled on them but we are still working out the details. I will also try to finalize RFC-202 and the review of RFC 199. Team updateIssue https://github.com/sourcegraph/customer/issues/65 has been resolved, but our focus remains on the sub-issues created by it https://github.com/sourcegraph/customer/issues/69 and https://github.com/sourcegraph/customer/issues/70 for this week. |
This weekI helped $CUSTOMER with Uwe and Dave to ensure their demo went smoothly, completely finished work on managed instances and am ready to ship it to $MY_CUSTOMER_0, caught up with Robert on monitoring and next steps there, shipped Sourcegraph to $MY_CUSTOMER_1, began data collection for https://github.com/sourcegraph/customer/issues/71 - and started looking into release automation for deploy-sourcegraph-docker as well as brainstorming production readiness ideas. I did not complete/merge the distribution roadmap or monitoring architecture documentation, but still intend to do so. Next weekReview https://github.com/sourcegraph/sourcegraph/pull/12581 - merge the distribution roadmap, architecture docs, get feedback on production readiness and move forward on that, continue looking into monitoring and release automation. |
This weekI've been working on a range of improvements and polish to alerting (formatting, bug, regression, migration, testing, etc). I've also landed the core functionality for the RFC 189's per-team alerts (via routing implementation) and have prepared a pull request to configure team-based paging. The final pieces of this are also up for review: migrate the rest of our out-of-band alerts and drop our custom alerting. I commented on this last week, but I've hit a wall with cadvisor IO metrics (https://github.com/sourcegraph/sourcegraph/issues/12163) and don't really see a way forward - my update on that issue includes possible alternatives Next weekLand everything related to per-team alerts and work with each team to get rotations and alerts set up (per @nicksnyder's request). I imagine there will be problematic alerts / other issues, and will likely focus on follow-up work. This will (finally!) close out dogfooding. |
This week:
|
THIS WEEK I'd say a fairly even portion of my time was spent between working on tasks for $customer and replicating their setup internally. I haven't gotten back around to do anything with Dhall which is unfortunate but I hope to make more progress on that next week. I made a decent start on #12101 and a working agent that aside from needing some fine tuning will be good to start using and iterate on. I did take a tertiary glance at removing alertmanager but Robert kindly informed me that this was wrapped up with a bunch of other PRs he'll land next week so it's in his more than capable hands for now. NEXT WEEK I'll have a PR ready to land for the baremetal CI agent which will be ready to start running jobs. I'm going to resync with geoffrey and/or uwe re dhall and make an effort to pick that back up again. |
this week: tinkered together with geoffrey on dhall (https://github.com/sourcegraph/deploy-sourcegraph-dhall/tree/the_rest_generate). we're in contact with the dhall core devs about some issues we hit and we also asked them for advice on how to set up the dhall interface for our customers (dhall-lang/dhall-haskell#1960 (comment)). we have tried a couple of things and we'll settle on something for the POC for the dogfood cluster. we should be able to tie things up for the POC evaluation sometime next week. did some debugging, overlay creation and general support for our bigdata $customer. next week: i want us to finish up the dhall POC and do some evaluation of suitability and if/how we proceed. i'm leaning towards proceeding. i think dhall advantages outweigh some of the difficulties. but i don't want to inject priors into the evaluation process so disregard my last sentence :-) |
Last weekWe finished our initial team goals, I also finalized the review of RFC-199. We will make we test using microVMs with ignite for a v0 and will have to review the outcome of that testing before we can move to v1 and define how we deploy/support/HA/etc. This weekWe will kick-off our 360 review cycle and I will focus on that. Ill be working on the roadmap and a product readiness document with Stephen and will pair with Geoffrey to get more familiar with our Dhall implementation. Team updateThe high priority sourcegraph/customer#69 from last week has been resolved, and we will return to our tracking issue priorities. sourcegraph/customer#70 remains unclear as we can't reproduce it consistently and has been deprioritized for the moment. |
This week:
|
This week: I spent most of my time, maybe 70% discussing things (distribution things, CE things, security things, code intel things, and more.) I spent 10% of my time helping customers, and 10% thinking about how to onboard CE folks. I made slight progress on release automation, but no progress on the other things I set out to do this week in my last update. |
This weekI followed up on last week's update and have finalized most of the work for per-team alerts, and have been dogfooding it (to myself). I have pinged each team to set up on-call rotations so that we can switch over completely to the new alerting stack and remove our old alerting by end of 3.19 or early 3.20. I have also made a range of improvements to our alerting, including: making our provisioning alerts more informative, converting some of our hard-threshold alerts to be ratios, improve our alerts solutions documentation. I also looked into adjusting resources for some of our services that seem like they could use it. Next weekSee if I can help others wrap up any outstanding tasks for this iteration, maybe work on converting more of our noisy alerts to be ratio-based, and start looking at what I can do in 3.20 |
this week: did some dhall work experimenting with unit tests and some more encompassing customizations that span more than one resource. did some initial bootstrap for marek for the 3.19 release. some customer work with cap1. |
Last weekKicked-off 360 review cycle and I was focused on that. I paired with Geoffrey to get more familiar with our Dhall implementation and architecture. I meet with Eric to talk about running code intel on firecracker VMs and how would we deploy those. This weekIll be working mainly on our 360 reviews and 3.20 planning. Ill also like to Dhall, I would like to do more testing now that I understand its structure better. Im also working on improving our incidents pipeline so its easier to track the status and number of active incidents. Team updateGiven the number of customer issues, we are over our original estimates for support time, which will likely impact "Increase our e2e test frequency" and potentially "Reduce upgrade overhead" although we are looking to make a decision at the end of the sprint anyway. |
Dear all, This is your release captain speaking. 🚂🚂🚂 Branch cut for the 3.19 release is scheduled for tomorrow. Is this issue / PR going to make it in time? Please change the milestone accordingly. Thank you |
This week: I was out Mon and half of Tue due to a family emergency. Tue was spent getting caught up. I discussed Dhall with the team and helped to determine next steps, interviewed two candidates (eng and CE), and helped customers (#85 (completed), #62 (completed), #74 (in progress), #12999 (in progress), #73 (in progress)). I reduced the reliance of myself in shipping Sourcegraph to some customers, and am running behind on other areas (support, distribution roadmap, monitoring arch docs, service arch docs, etc.) due to the life incident I had earlier in the week. |
This week, I got catfood.sgdev.org in a demo-able state for the Dhall PoC (along with the accompanying https://github.com/sourcegraph/deploy-sourcegraph-dhall changes). Following the Wednesday demo, the @sourcegraph/distribution team has enough confidence to move forward with the dhall implementation. My next priorities are to sync with the Dhall maintainers and develop the Dhall roadmap/tracking issue |
Last WeekI have mostly been working on the 3.20 plan and 360 reviews. I worked on a couple of incidents as well ( This WeekMy priority for this week is finalizing the plans for 3.20 and meeting with the team to close the reviews. Additionally, ill continue to work through our alerts and find action items from them. |
last week A lot of misc. work and doing investigations around alerts discussions in Slack (often around alerts frequencies). Fiddled around with Dhall, and made other misc. improvements to tooling (docsite, license_finder). Brainstormed a potential idea for next iteration around improving release dogfooding / our deploy-sourcegraph forks next week Stay up to date on 3.20 plans and see how I can pitch in. Probably follow up on teams setting up their opsgenie alerts, and continue following up on issues with alerts that get raised |
last week this week |
This week: I made the pure-docker vagrant tests reliable and used them + the new documentation to ship v3.18.0 to https://app.hubspot.com/contacts/2762526/company/407948923/ further reducing their upgrade delay and requirement on myself. These also played a fundamental role in identifying a release blocker for 3.19.0 later in the week. I spent some time helping a customer, fixing a bug for another customer, and a fair amount of time in meetings and interviewing candidates. All of Thursday was spent helping @uwedeportivo with the 3.19.0 release which had a number of bad issues which plagued it. We managed to fix them in time to release on the 20th as promised. I also sync'd with a customer and did some live debugging to resolve a major issue for them On Friday I addressed some tech debt, helped Rijnard with testing some structural search ideas, sync'd with a customer and responded to a customer P0 |
this week some debug sessions, follow-ups on per-team alerting and general improvements on that front (docs and tweaks). did some planning for k8s dogfooding in 3.20, and made a proof-of-concept for what it might look like next week wrap up the dogfooding automation for deploy-sourcegraph => dogfood, and work with someone to get it deployed proper |
Last week 17 July I'd estimate 90% of my time was spent debugging the GCP/vagrant/docker e2e testing issues #12996. PR is up for review here. Doesn't appear to be directly related to networking as first thought, and a high CPU count has an impact based on my testing This week 24 July Iron out any remaining issues with this portion of the e2e testing and sync with Uwe around improving the release process. I need to look more closely at our autoscaling capabilites in GCP to ensure we're getting maximum value and reliability out of the config. |
Last WeekLast week was focused on closing 3.19, planning 3.20 and closing our 360 review packages. I was able to make some progress in the Prometheus issue and sent a fix to the customer (thanks @bobheadxi and @uwe for the help here). This WeekIll work on adding our new on-call rotation to OpsGenia and sending a PR with the updated information to help finish per-team alerts and get some help to finalize the |
last week fighting really hard to get 3.19 out the door. we had several issues: it started with the segfault in redis-cache in single-server, DNS issues in single-server, regression tests not logging in, regression tests referencing wrong test repos, build issues, update-docker-image-tags issues. after we finished tagging the final release for 3.19, stephen discovered a race condition in DB migration, so we had to fix, redact 3.19.0 and release 3.19.1 this week i started looking in the dhall migration/keep-up-to-date tool. will continue playing with that for a couple of days. also help geoffrey with planning and fleshing out the dhall project for real now |
Plan
Support new and existing deployments
This is an ongoing expense, we anticipate this taking no more than 10d of work spread across the entire team.
Reduce upgrade overhead
Upgrading Kubernetes deployments requires customers spend a lot of engineering time to converge our released Kubernetes manifests with their fork as documented in RFC-141.
We will finish the Dhall investigation and make a decision by the end of 3.19.
Increase our e2e test frequency
To increase our release cadence, we need to be able to run e2e tests more frequently. This is currently not possible as our CI infrastructure causes tests to be unreliable.
Support per-team alerts
To allow teams to support and monitor the services and features they ship, we need to be able to route alerts to the relevant teams as described in RFC-189.
Availability
Period is from July 20th to August 19th (23 working days). Please write the days you won't be working and the number of working days for the period.
Workload
@bobheadxi: 2.50d
cadvisor: investigate collecting IO metrics#12163up
metric #12482monitoring: remove custom alertmanager from cloud#12160@davejrt
Bare-metal Buildkite agents capable of running Docker and VMs#12101deploy-sourcegraph-dhall: Add CI job to generate diff between master and feature branch#12716baremetal buildkite agent networking / instability issues#12996@daxmc99
@efritz
@ggilmore
deploy-sourcegraph-dhall: incrementally migrate sourcegraph.com services over to dhall#12113[draft] deploy-sourcegraph-dhall: implement migrate-to-non-root overlay#12283deploy-sourcegraph-dhall: grafana: implement new configuration logic for k8s.sgdev.org#12109deploy-sourcegraph-dhall: frontend: implement new configuration logic for k8s.sgdev.org#12105@keegancsmith
@pecigonzalo
Reduce the impact of unplanned work#11904@slimsag: 7.50d
sourcegraph/customer#74 👩sourcegraph/customer#90 🐛👩Run e2e tests on bare-metal Buildkite agents on every commit to master (non-blocking)#12339Run e2e "regression" tests on bare-metal Buildkite agents on every commit to master (non-blocking)#12340@uwedeportivo: 8.00d
Legend
The text was updated successfully, but these errors were encountered: