Skip to content

Conversation

@mkilp
Copy link
Contributor

@mkilp mkilp commented Dec 23, 2025

Hi everyone,

I ran into a semi-big issue while developing my ECS service. My containers are pretty heavy (multiple gigs) and I did a sanity check inside ECR to make see what kind of storage I am looking at.

Low and behold I noticed there are dozens of images that never get deleted.
This PR adds a lifecycle policy to ECR that expires untagged images, which as far as I can tell are what we can safely remove.

I do forsee a problem with rollbacks since SST uses the digest to attache the image to the task.

I am happy to take input on the exact lifecycle policy to use, I understand this is a pretty big change since it adds to the bootstrap. I tested it and its working:

CleanShot 2025-12-23 at 18 23 39

My change will essentially gives 30 days of rollback time. We could also change the rule to at least keep x number of untagged images. Note: We can only target untagged images with one rule.

I do believe this is very important since ECR storage is pretty expensive at 0.10$ per GB Month.

cheers,

marv

@ekaya97
Copy link
Contributor

ekaya97 commented Dec 28, 2025

+1

as workaround, following can be applied to sst.config.ts

 async run() {
      new aws.ecr.LifecyclePolicy("sst-asset-lifecycle", {
      repository: "sst-asset", //ECR name from SST bootstrap
      policy: JSON.stringify({
       "rules": [
				{
					"rulePriority": 1,
					"description": "Expire untagged images pushed over 30 days ago",
					"selection": {
						"tagStatus": "untagged",
						"countType": "sinceImagePushed",
						"countUnit": "days",
						"countNumber": 30
					},
					"action": {"type": "expire"}
				}
			]
      }),
    });
  },

@vimtor vimtor self-assigned this Jan 10, 2026
@vimtor
Copy link
Collaborator

vimtor commented Jan 12, 2026

i'm wondering if this should be more explicit. something like:

const vpc = new sst.aws.Vpc("MyVpc");
const cluster = new sst.aws.Cluster("MyCluster", { vpc });

new sst.aws.Service("MyService", {
  cluster,
  image: {
    context: "./app",
    dockerfile: "Dockerfile",
    expiresIn: "30 days"
  }
});

what do you think?

@ekaya97
Copy link
Contributor

ekaya97 commented Jan 12, 2026

@vimtor

yes, this is much better.

with enums like "x days", "on push", etc.

@vimtor
Copy link
Collaborator

vimtor commented Jan 12, 2026

exactly! @ekaya97

do any of you wanna give it a try?

@mkilp
Copy link
Contributor Author

mkilp commented Jan 12, 2026

@vimtor I'll take care of it tonight!

@ekaya97
Copy link
Contributor

ekaya97 commented Jan 12, 2026

couple questions:

  1. putting this into the service might cause the impression that it applies only to this service but it applies to the entire ECR - would need to work with tags, make sure custom image tags and existing untagged images are handled properly.
  2. how do you handle modifications? deprecate/delete the old policy? update the old policy?
  3. do you allow for separate ECR for each Cluster and apply policy at that level ? do you add a new ECR component entirely and remove the bootstrapped ECR ?
  4. how do you confirm that policy was applied? they are not instant and may take up to 24h - fire and forget or actually check?

@vimtor
Copy link
Collaborator

vimtor commented Jan 12, 2026

probably if we do it by service/task we should add a unique tag based on the component's name and create the lifecycle targeting it. maybe there's another way but that's the only one that comes to mind atm

@mkilp
Copy link
Contributor Author

mkilp commented Jan 21, 2026

@vimtor @ekaya97

Sorry, things got busy for me so I didn't get around to it yet but wanted to contribute to the convo:

My 2 cents are: SST/pulumi untags all images that are not in use by a task in a service. I opened this PR in the current setup for one 'sst-asset' ECR that is managed by SST and in this it seems that when pulumi deletes the task definition during an update it also automatically untags the image behind that definition (not 100% sure on that but to be confirmed). I am not sure if custom tags inside the service definition will persists between task definition updates.

Happy to add it to the service definition instead of the bootstrap, but the tagging/targeting might become an issue with how pulumi works. SST already tags the image with the container name using the same 'tags' property a custom tag would use but it gets removed anyway. Hence I created the 'untagged' rule.

If we wanted service based expiration, we may need to change how SST structures images under the hood.

@vimtor
Copy link
Collaborator

vimtor commented Jan 21, 2026

great analysis @mkilp

i'm not sure what the best approach is here

tbh i don't think it's worth changing how we tag images for a feature like this

a simple fix could be adding an example to the docs using @ekaya97's snippet

still, i like the PR change - everyone will need this, so it fits as a core feature. i dislike making it a default without any config, but i don't know where to put that config, which is probably telling

on the other hand, most people rolling back will just sst deploy an old commit instead of hand-picking the image, so letting untagged images expire feels pretty safe

what do you think?

@mkilp
Copy link
Contributor Author

mkilp commented Jan 21, 2026

thanks @vimtor

I did a more in depth trial on my AWS account and found the following:

  • Custom Tags which are not the service name persist
  • However they do need to be unique
  • A pretty big find was this: Image tags are only sometimes reused across stages. let take the following example:
    Service A deployed on Stage X -> Image tagged "A"
    Change made to Image
    Service A deployed on Stage Y -> Image Tagged "A" again AND Image deployed on Stage X becomes untagged

This might actually cause huge issues if we expire all untagged images and you try to scale out Service A on Stage X due to the task definition not being able to find the image SHA that it contains.

The only solution that I can currently find is changing the fargate internal component and add the stage to the tag that is being applied to the image build. That way we might have more active images at all time but at least prevent accidentally expiring in-use images since tags from other stages will persist. Overall it should be a net gain on savings since we still expire completely unused images.
In the future we might be able to change the service component to allow custom rules and we can work with more custom generated tags (deployId, commitId something like that) but I think the scope of such a change requires a lot more thought.

Let me know what you think about the fargate component approach and I can get the change into this PR.

@vimtor
Copy link
Collaborator

vimtor commented Jan 22, 2026

i'm thinking that maybe the best approach is to provide an utility method like sst.aws.getECRRepository to avoid leaking that implementation detail and then just provide add an example in the docs:

new aws.ecr.LifecyclePolicy("sst-asset-lifecycle", {
  repository: sst.aws.getECRRepository().apply(repo => repo.name),
  policy: JSON.stringify({
   "rules": [
      {
        "rulePriority": 1,
        "description": "Expire untagged images pushed over 30 days ago",
        "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 30
        },
        "action": {"type": "expire"}
      }
    ]
  }),
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants