Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexserver: add debug endpoint for deleting repository shards #485

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ggilmore
Copy link
Contributor

@ggilmore ggilmore commented Nov 17, 2022

This PR adds a new debug endpoint /debug/delete?id=$REPO_ID. When a user calls this URL, all the shards associated with this ID will be deleted (or tombstoned if they're in compound shards).

Given the frequent memory-map incidents that we had this this week, this PR should make it easier for users to recover when they're in this state.

They should be able:

  1. Grab a list of all IDs / repositories that are indexed on the instance:
    wget -q -O - http://localhost:6072/debug/indexed                                                                                                                                              (base) 14:51:36
    ID              Name
    736075603       wut
    1309754292      whatever
  2. Extract some subset of repo ids via some method like awk | tail/head
    wget -q -O - http://localhost:6072/debug/indexed | awk '{print $1}' | tail -n +2 | head -n 2                                                                                                  (base) 15:01:56
    736075603
    1309754292
  3. Pipe their chosen ids to this new debug route:
    wget -q -O - http://localhost:6072/debug/indexed | awk '{print $1}' | tail -n +2 | head -n 2 | xargs -I {} -- wget -q -O -  "http://localhost:6072/debug/delete?id={}"

This should also be useful for customers that want to try incremental indexing (and need a faster way to revert other than waiting for a new build).


(notes)

  • This PR still isn't final, I still need to add basic tests for deleteShards.

  • I decided to put this logic entirely in main.go. This logic is quite similar to the logic used in build/builder.go, but

    • I didn't want to export this as a public function, I can't see anyone else needing to re-use this shard deletion logic (this debug endpoint should be the only user)
    • Builder has its own custom logic around shard deletion that's battle-tested (deleting shards only when it successfully finishes a new build).
  • I think the use of build.Options here isn't great. That object is so huge, and we ultimately only need the indexDir, repoID, and repoName out of it - everything else is irrelevant for this use case. I tried experimenting with refactoring some of those interfaces, but I ended up just creating a bunch of pass-through-methods instead (from https://www.amazon.com/Philosophy-Software-Design-John-Ousterhout/dp/1732102201). What are your thoughts on this? Is it it worth the refactor? Can we create a smaller "options" object that doesn't have so many extraneous fields?

@ggilmore ggilmore requested a review from a team November 17, 2022 23:08
@stefanhengl
Copy link
Member

I think the use of build.Options here isn't great. That object is so huge, and we ultimately only need the indexDir, repoID, and repoName out of it - everything else is irrelevant for this use case. I tried experimenting with refactoring some of those interfaces, but I ended up just creating a bunch of pass-through-methods instead (from https://www.amazon.com/Philosophy-Software-Design-John-Ousterhout/dp/1732102201). What are your thoughts on this? Is it it worth the refactor? Can we create a smaller "options" object that doesn't have so many extraneous fields?

I would leave it as it. It is very "Zoekt" to pass around options structs. I believe you could do with IndexOptions but you need FindAllShards() which doesn't really have to implemented on top of build.Options and should maybe be a function instead.

However, I think at some point, Zoekt would profit from a unified concept of a shard IE operations (find(id), delete(id), ...) to abstract from compound shards, simple shards, and delta shards.

@stefanhengl
Copy link
Member

You might as well add your nice recipe "How to delete a repo" from the PR description to the documentation of the debug endpoint.

@@ -124,6 +250,109 @@ func TestMain(m *testing.M) {
os.Exit(m.Run())
}

func createTestNormalShard(t *testing.T, indexDir string, r zoekt.Repository, numShards int, optFns ...func(options *build.Options)) []string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly copied this from the builder package, should we create a testutil package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, maybe it's time. Not blocking for this PR though.

@ggilmore ggilmore marked this pull request as ready for review November 19, 2022 01:54
@ggilmore ggilmore requested review from a team and stefanhengl November 19, 2022 01:54
@ggilmore
Copy link
Contributor Author

@sourcegraph/search-core PTAL

Copy link
Member

@stefanhengl stefanhengl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

cmd/zoekt-sourcegraph-indexserver/main.go Outdated Show resolved Hide resolved
@@ -124,6 +250,109 @@ func TestMain(m *testing.M) {
os.Exit(m.Run())
}

func createTestNormalShard(t *testing.T, indexDir string, r zoekt.Repository, numShards int, optFns ...func(options *build.Options)) []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, maybe it's time. Not blocking for this PR though.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have very similiar logic to this in cleanup.go, which is run when we are told to stop indexing a repo. It feels like you should be able to reuse that? I took a look at cleanup and I believe it would be relatively straightforward. Additionally the function for this living in cleanup.go sounds good to me rather than growing main.go.

cmd/zoekt-sourcegraph-indexserver/main.go Outdated Show resolved Hide resolved
cmd/zoekt-sourcegraph-indexserver/main.go Outdated Show resolved Hide resolved
@ggilmore
Copy link
Contributor Author

@sourcegraph/search-core PTAL

I have:

  1. moved the logic from main.go into cleanup.go
  2. changed the deletion logic to use zoekt.FilePaths instead of stat-ing directly
  3. added a new sub-routine to cleanup() that ensures that we delete any metadata files that don't have corresponding shards (this is necessary after the change introduced in "2" since there is no longer an enforced order for how we delete files)
  4. changed the logic for deleteShards to scan the entire indexDir instead of using build.Options.FindAllShards() (following the style of other functions in cleanup)
  5. changed the debugHandler to take out a global indexDir lock
  6. wired up the shardLogger to the deleteShards implementation

// Is this repository inside a compound shard? If so, set a tombstone
// instead of deleting the shard outright.
if zoekt.ShardMergingEnabled() && maybeSetTombstone([]shard{s}, repoID) {
shardsLog(indexDir, "tomb", []shard{s})
Copy link
Contributor Author

@ggilmore ggilmore Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Looking at the implementation of shardslog, I'm unsure if it's threadsafe (what happens if there are two writers to the same file)?

deleteShard's requirement that we hold the global index lock protects us against this, but I wonder if we should make this more explicit (e.g. have a dedicated mutex just for shardslog).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting: https://github.com/natefinch/lumberjack

Lumberjack assumes that only one process is writing to the output files. Using the same lumberjack configuration from multiple processes on the same machine will result in improper behavior.

// isn't present in indexDir.
//
// Users must hold the global indexDir lock before calling deleteShards.
func deleteShards(indexDir string, repoID uint32) error {
Copy link
Member

@stefanhengl stefanhengl Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from cleanup:

	simple := shards[:0]
		for _, s := range shards {
			if shardMerging && maybeSetTombstone([]shard{s}, repo) {
				shardsLog(indexDir, "tombname", []shard{s})
			} else {
				simple = append(simple, s)
			}
		}

		if len(simple) == 0 {
			continue
		}

		removeAll(simple...)

Can't we just call that instead? I guess that is what @keegancsmith was refering to(?). We could factor it out into its own function and call it from your handler and from cleanup. WDYT?

@ggilmore
Copy link
Contributor Author

@stefanhengl PTAL at my latest commit which tries to factor out the logic.

To be honest, I don't like the logic of this shared interface much at all (indexDir and shards are passed (and there is no contract that indexDir needs to contain those shards, indexDir is only passed for logging purposes, etc). LMK what you think.

Comment on lines +161 to +162
// remove any Zoekt metadata files in the given dir that don't have an
// associated shard file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unrelated cleanup? IE maybe should be another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I added this functionality since we switched the implementation. Before, the implementation made sure to delete the metadata file before its associated shard file. Since we switched to using zoekt.IndexFilePaths that order isn't guaranteed anymore (leaving open the possibility of a metadata file not having an associated shard if we delete the shard and then crash). I added this additional logic to cleanup so that we would have a background process that would reap those "stranded" files.

I am happy to pull bit into a separate PR though.

Comment on lines +335 to +336
// Deleting shards in reverse sorted order (2 -> 1 -> 0) always ensures that we don't leave an inconsistent
// state behind even if we crash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if we do crash we will have partial state and be none the wiser anyways. So I think this is just extra complexity without making us any safer than before.

In fact I'd argue deleting 0 first is the best bet, since that is what we actually check in other parts. So if that is missing then we end up deleting the other ones. Then adding a check in cleanup to remove stranded indexes would make sense.

At the end of the day we are relying on the fact that os.Remove is super fast. Not ideal, and really we should introduce some other way of being atomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an extra file that acts as a tombstone?

for a given repo if repo.tombstone.json exists, for all intents and purposes zoekt should treat the repo as deleted. webserver should unload all of its associated shards, and cleanup should have a reaper that would eventually delete all the shards + metafiles before deleting the tombstone file.

//
// If one of the provided shards is a compound shard and the repository is contained within it,
// the repository is tombstoned instead.
func deleteOrTombstone(indexDir string, repoID uint32, shardMerging bool, shards ...shard) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you mentioned you didn't like that we have to pass in indexDir and shardMerging to this func. Maybe instead it should be part of some helper struct that state.

minor nit: I'd make repoID come after shardMerging param. IE the more "static" something is the closer to the front of the arg list. Just a habit from functional languages and currying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it...I'll play around with passing around some sort of helper to see if that's less awkward. Something like:

type ServerConfig struct {
   shardMerging bool
   indexDir path

}

// If one of the provided shards is a compound shard and the repository is contained within it,
// the repository is tombstoned instead.
func deleteOrTombstone(indexDir string, repoID uint32, shardMerging bool, shards ...shard) {
var simple []shard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you got rid of the filtering hack, I guess just because now that it is a func it is safer that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By filtering hack I assume you mean the shardMap := getShards(indexDir) line? Yes, I decided to pass the list of shards associated with the ID directly since cleanup() already has this information. If we're sharing the function implementation, gathering the entire shardMap seemed redundant.

cmd/zoekt-sourcegraph-indexserver/main.go Show resolved Hide resolved
cmd/zoekt-sourcegraph-indexserver/cleanup.go Outdated Show resolved Hide resolved
cmd/zoekt-sourcegraph-indexserver/cleanup.go Outdated Show resolved Hide resolved
@keegancsmith
Copy link
Member

What is the status of this PR? What can I do to help unblock/land?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants