Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during host startup can cause a deadlock in the restart flow, leaving the host unhealthy until a manual restart #10766

Open
cjaliaga opened this issue Jan 24, 2025 · 0 comments

Comments

@cjaliaga
Copy link
Member

cjaliaga commented Jan 24, 2025

When a host startup fails (e.g., due to storage connectivity issues), sets the host to Error and initiates a new host startup. This new startup acquires _hostStartSemaphore and calls BuildHost(). During that process, WorkerFunctionMetadataProvider.GetFunctionMetadataAsync() detects no active worker channels and calls RestartHostAsync(). However, RestartHostAsync() attempts to cancel the same in-progress startup, and because there’s no ThrowIfCancellationRequested, _hostStartSemaphore is never released. The restart remains blocked, leaving the host in Error state until it is manually restarted.

Repro steps

  1. First Host Start
    The host begins to initialize (loads metadata, starts worker channels, etc.).

  2. Failure Connecting to Storage
    A transient error (e.g., DNS or storage connection issue) leads to an aborted startup and moves the host state to Error.

  3. Host Startup Canceled
    Because of the error, the system transitions the existing startup to a canceled state (shutting down worker channels).

  4. New Host Scheduled
    The system schedules a new host startup after a short delay.

  5. Second Host Startup
    This new host startup acquires the _hostStartSemaphore and begins another BuildHost() process.

  6. Metadata Provider Finds No Channels
    Inside WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(), the code detects that no channels exist (they were previously shut down).

    if (channels?.Any() != true)
    {
    if (_scriptHostManager.State is ScriptHostState.Default
    || _scriptHostManager.State is ScriptHostState.Starting
    || _scriptHostManager.State is ScriptHostState.Initialized)
    {
    // We don't need to restart if the host hasn't even been created yet.
    _logger.LogDebug("Host is starting up, initializing language worker channel");
    await _channelManager.InitializeChannelAsync(workerConfigs, _workerRuntime);
    }
    else
    {
    // During the restart flow, GetFunctionMetadataAsync gets invoked
    // again through a new script host initialization flow.
    _logger.LogDebug("Host is running without any initialized channels, restarting the JobHost.");
    await _scriptHostManager.RestartHostAsync();
    }

  7. RestartHostAsync() Called
    Since no channels are active, RestartHostAsync() is invoked to re-initialize the host.

  8. Cancellation Attempt
    RestartHostAsync() attempts to cancel the active startup operation by calling Cancel() on its CancellationTokenSource. However, because this cancellation call happens within the same call stack/async flow as the current BuildHost(), there is no ThrowIfCancellationRequested check or natural yield point to abort the build operation.

foreach (var startupOperation in ScriptHostStartupOperation.ActiveOperations)
{
_logger.CancelingStartupOperationForRestart(startupOperation.Id);
try
{
startupOperation.CancellationTokenSource.Cancel();
}
catch (ObjectDisposedException)
{
// This can be disposed at any time.
}
}
try
{
await _hostStartSemaphore.WaitAsync();

  1. Semaphore Deadlock
    With the second startup still holding the _hostStartSemaphore—and never releasing it due to the ineffective cancellation—the new restart attempt blocks indefinitely when trying to reacquire that semaphore.

At this point, the host remains in an Error state until it is manually restarted, since the restart logic is effectively deadlocked.

Example Call Stack

Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.RestartHostAsync
Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync
Microsoft.Azure.WebJobs.Script.WebHost.FunctionMetadataProvider.GetFunctionMetadataAsync
Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.LoadFunctionMetadata
Microsoft.Azure.WebJobs.Script.DependencyInjection.ScriptStartupTypeLocator.GetExtensionsStartupTypesAsync
Microsoft.Azure.WebJobs.Script.WebHost.DefaultScriptHostBuilder.BuildHost
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.BuildHost
Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.StartHostAsync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant