What is AWS ECR and why should you care?
AWS ECR (Elastic Container Registry) is Amazon's managed container image storage. Think of it like a private Docker Hub, but built into AWS. Every time you run a container on ECS Fargate, the task has to pull an image from somewhere — and ECR is where most teams keep theirs.
On 1 July 2026, container orchestration is table stakes for any serious cloud deployment. If you're running ECS Fargate, you're almost certainly using ECR too. But here's the thing: ECR is kind of invisible. It works until it doesn't, and when it breaks, the error messages are unhelpful. A task fails to start with ResourceInitializationError. Your teammates shrug. Six months later, you discover five years of untagged images sitting in your registry, quietly charging you $400 a month in storage. This guide covers the parts ECR documentation glosses over: how pulls actually work, why private-subnet tasks fail, and how to keep your bill reasonable.
How ECR works with ECS Fargate
When an ECS Fargate task starts, it needs to pull a container image. The task doesn't have Docker installed locally — it's running on Amazon's managed infrastructure. So the Fargate agent reaches out to ECR, authenticates, and downloads the image.
That authentication is the key piece nobody talks about. Your Fargate task doesn't log in with a username and password. Instead, it uses an IAM role — specifically, the execution role you assigned to the task. ECR checks that role's permissions. If the role can read the image, great. If not, the task fails silently-ish (you get a generic error).
ECR also lives in a specific AWS region. If your task is in us-east-1 and your image is in eu-west-1, the task can't pull it without extra configuration. That's not usually a problem, but it's worth knowing.
IAM permissions: the invisible wall
Here's where most people stumble. Your ECS task needs an execution role — the identity it uses to pull images, write logs, and talk to other AWS services. That role needs specific permissions to read from ECR.
The bare minimum looks like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability"
],
"Resource": "arn:aws:ecr:us-east-1:REPLACE_WITH_ACCOUNT_ID:repository/REPLACE_WITH_REPO_NAME"
},
{
"Effect": "Allow",
"Action": "ecr:GetAuthorizationToken",
"Resource": "*"
}
]
}
That second statement is critical. GetAuthorizationToken is the only ECR action that doesn't apply to a specific repository — it applies globally. Without it, your task can't even authenticate to ECR, and everything else fails.
If you mess this up, your task sits in PROVISIONING for a minute, then fails with ResourceInitializationError. There's no log saying "hey, your IAM role can't read ECR." The error is vague. So the first thing to check when a task won't start is the execution role.
Private subnet failures and why they happen
Say your Fargate task is in a private subnet (good security practice). It has no direct internet access — all outbound traffic goes through a NAT gateway. Now your task tries to pull an image from ECR.
ECR is an AWS service, so it lives on the internet (sort of — technically it has both public and VPC endpoints). When your private-subnet task reaches out to ECR's public endpoint, the request goes through the NAT gateway, comes back, and should work. Usually it does.
But sometimes it doesn't. If your NAT gateway is in a different AZ from your task, or if you've misconfigured route tables, the request fails silently. Or if your security groups block egress on port 443 (HTTPS), ECR can't respond.
The fix is usually one of these:
- Use a VPC endpoint for ECR. Create a VPC endpoint in your VPC, and Fargate pulls the image through the endpoint instead of the public internet. This is more secure and faster.
- Ensure your NAT gateway is reachable. Check route tables — traffic to ECR should route through the NAT.
- Open outbound HTTPS. Your task's security group needs egress on port 443 to reach ECR.
- Check CloudWatch logs. Your task's CloudWatch log group often has clues if you dig past the error message.
If you're in a private subnet, a VPC endpoint is the gold standard. It keeps all traffic inside AWS, avoids NAT charges, and is faster. Set it up once, forget about it.
Understanding ECR pricing and image cleanup
ECR costs are simple: you pay for stored image data. No per-pull charge, no per-push charge. Just storage.
The catch: every image layer you push gets stored forever unless you delete it. If you push the same image tag repeatedly (like a "latest" tag), each push is a new image under the hood. After a few years, you've got thousands of untagged images sitting there.
A single image can be 500 MB to 2 GB depending on the base OS and packages. Multiply that by hundreds or thousands of builds, and you're looking at terabytes. At roughly $0.10 per GB per month, that's expensive.
The solution is a lifecycle policy. It's a rule that says "delete any image that hasn't been used in 30 days" or "keep only the last 10 images per tag." Here's a reasonable example:
{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 images",
"selection": {
"tagStatus": "tagged",
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {
"type": "expire"
}
},
{
"rulePriority": 2,
"description": "Delete untagged images after 30 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 30
},
"action": {
"type": "expire"
}
}
]
}
This keeps the last 10 tagged images and deletes any untagged ones older than 30 days. Apply this when you set up ECR, and you'll never have a surprise $400 bill.
Debugging ECR issues
When a task fails to start, the error message is usually useless. But the answer is almost always one of these:
- Bad IAM permissions. The execution role can't read the repository.
- Wrong repository or region. The task is looking for
app.example.com/myimage:latestinus-east-1, but you pushed it toeu-west-1. - Network issues. Private subnet, security groups, or route tables blocking access to ECR.
- Image doesn't exist. You pushed a tag, then deleted it, then tried to use it.
- ECR API throttling. Rare, but if you're pulling hundreds of images at once, ECR might rate-limit you.
To troubleshoot:
- Check the task's execution role in the IAM console. Does it have ECR permissions?
- Check the CloudWatch logs group for the task. Often there's a clue buried in the
awslogsoutput. - Try pulling the image manually from an EC2 instance in the same VPC. If that works, the problem is networking or IAM in the task configuration.
- Look at the VPC endpoint (if you're using one). Is it reachable from the task's subnet?
Conclusion
ECR is not complicated, but it's easy to miss the details. It sits between your ECS task and the outside world, and if it's configured wrong, everything stops. The three things to get right are: IAM permissions (including GetAuthorizationToken), network access (especially in private subnets), and image cleanup (lifecycle policies). Do those three, and ECR becomes invisible in the good way — it just works, and your bill stays reasonable.
Merits
- Fully managed — no container registry to operate yourself
- Integrated with IAM — fine-grained access control built in
- Cheap — only pay for storage, no per-pull charges
- Fast — images live in the same AWS region as your tasks
- VPC endpoints available — keep traffic inside AWS for security and speed
- Lifecycle policies — automatic cleanup prevents bill surprises
Demerits
- Error messages are opaque — "ResourceInitializationError" tells you nothing
- Regional — images in one region aren't visible to tasks in another without workarounds
- No built-in notifications — your image storage can balloon without warning
- IAM complexity — the execution role has many moving parts
- VPC endpoint setup is manual — should be easier to secure private-subnet access
- Untagged image cleanup is manual — lifecycle policies don't activate by default
Caution
This guide uses placeholder values: repository names are REPLACE_WITH_REPO_NAME, account IDs are REPLACE_WITH_ACCOUNT_ID, and regions are generic examples. Always substitute real values from your AWS account. Before deploying any of these configurations to production, test them in a sandbox environment first. IAM changes and networking configurations can disrupt running workloads, so make changes carefully and have a rollback plan. Proceed at your own risk, and verify that all placeholders are replaced with actual, correct values.
Frequently asked questions
- How do I push an image to ECR from my CI/CD pipeline?
- What's the difference between a public and private ECR repository?
- Can I pull an ECR image from outside AWS?
- How do I scan ECR images for security vulnerabilities?
- Why is my ECR bill so high all of a sudden?
- Do I need to use ECR, or can I use Docker Hub or another registry?
- How do I grant another AWS account access to my ECR repository?
- What happens if I delete an image that a running task is using?
Tags
#aws #ecr #ecs #containers #fargate #devops #cloudnative #iam


Responses
Sign in to leave a response.
Loading…