Retention service hangs and does not remove old shards #25054

gwossum · 2024-06-11T18:00:54Z

Under certain conditions, the retention service can become hung waiting on a shard's reference count to drop to zero. When this happens, no other shards can be removed by the retention service. This can eventually result in high disk usage.

The attached goroutine trace shows a system exhibiting the issue. The retention service is stuck on waiting on the WaitGroup used to indicate that the references to the shard have dropped to zero.
goroutine.txt

Fix issue that can cause the retention service to hang waiting on a `Shard.Close` call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up. The fix adds to new methods to `Store`, `SetShardNewReadersBlocked` and `InUse`. `InUse` can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. `SetShardNewReadersBlocked` determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of `InUse` and the deletion of shards. If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers. closes: #25054

* fix: prevent retention service from hanging Fix issue that can cause the retention service to hang waiting on a `Shard.Close` call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up. The fix adds to new methods to `Store`, `SetShardNewReadersBlocked` and `InUse`. `InUse` can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. `SetShardNewReadersBlocked` determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of `InUse` and the deletion of shards. If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers. closes: #25054

davidby-influx · 2024-06-13T20:42:39Z

Can you create port issues for all the repos and branches, @gwossum ?

* fix: prevent retention service from hanging Fix issue that can cause the retention service to hang waiting on a `Shard.Close` call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up. The fix adds to new methods to `Store`, `SetShardNewReadersBlocked` and `InUse`. `InUse` can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. `SetShardNewReadersBlocked` determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of `InUse` and the deletion of shards. If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers. closes: influxdata#25054

gwossum added kind/bug area/storage 1.x team/edge labels Jun 11, 2024

gwossum self-assigned this Jun 11, 2024

gwossum mentioned this issue Jun 11, 2024

fix: prevent retention service from hanging #25055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retention service hangs and does not remove old shards #25054

Retention service hangs and does not remove old shards #25054

gwossum commented Jun 11, 2024

davidby-influx commented Jun 13, 2024

Retention service hangs and does not remove old shards #25054

Retention service hangs and does not remove old shards #25054

Comments

gwossum commented Jun 11, 2024

davidby-influx commented Jun 13, 2024