Every ML team I’ve seen gets IAM wrong in the same two ways.
The first way: a shared AdministratorAccess account that the whole team uses for notebooks, training jobs, and experiments. It works fine until someone accidentally deletes a production S3 bucket.
The second way: a wildly over-permissioned role — s3:* on *, ecr:* on *, sometimes iam:* — because the docs said to add these permissions and nobody questioned it.
Neither approach will survive a security review. And if you’re aiming for a role at a company doing ML at scale, IAM hygiene is something you’ll be asked about.
Execution role vs. user role — the distinction most people miss
When you run a SageMaker training job, two identities are involved:
Your user/role — the identity you assume in the AWS Console or CLI. This is what starts the job. It needs sagemaker:CreateTrainingJob and the ability to pass the execution role.
The execution role — the role that SageMaker assumes on your behalf while the job is running inside AWS infrastructure. This is the role attached to the training job itself. It’s what actually reads your S3 data, pulls your ECR image, and writes model artifacts.
These are different IAM roles. Most tutorials conflate them or give both the same permissions. The right approach is a user role with narrow SageMaker control-plane permissions, and an execution role with the data-plane permissions the training job actually needs.
What a SageMaker training job actually touches
Strip away the magic and a training job does four things:
- Pulls a Docker image from ECR
- Reads training data from S3
- Writes model artifacts to S3
- Writes logs to CloudWatch
That’s it. Those are the actual permissions the execution role needs. Nothing else.
The minimum working policy for the execution role:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DataAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-ml-bucket",
"arn:aws:s3:::your-ml-bucket/*"
]
},
{
"Sid": "ECRImagePull",
"Effect": "Allow",
"Action": [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability",
"ecr:GetAuthorizationToken"
],
"Resource": "*"
},
{
"Sid": "CloudWatchLogs",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
}
]
}
Note: ecr:GetAuthorizationToken has no resource-level permission — it must use "Resource": "*". That’s an AWS constraint, not laziness.
The trust policy mistake that breaks jobs silently
The execution role needs a trust relationship with the SageMaker service. Without it, the PassRole call from your user role will succeed, but the actual job will fail with a cryptic permissions error.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
This is the trust policy on the execution role — not on your user role. Easy to mix up, especially when you’re new to AWS.
Why scope matters beyond the obvious
s3:* on * means your training job can read any bucket in the account, including production databases, user uploads, or another team’s confidential datasets. A misconfigured training script, or a compromised container image, has broad blast radius.
Scoping to a specific bucket prefix (your-ml-bucket/experiments/*) means a compromised job can only touch experiment data. That’s the point of least-privilege.
End project: Terraform module for a SageMaker execution role
Here’s a minimal but production-ready Terraform module. Drop this in modules/sagemaker-execution-role/.
main.tf:
variable "bucket_name" {
description = "S3 bucket the SageMaker jobs will read/write"
type = string
}
variable "role_name" {
description = "Name of the IAM execution role"
type = string
default = "sagemaker-execution-role"
}
resource "aws_iam_role" "sagemaker_execution" {
name = var.role_name
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "sagemaker_least_privilege" {
name = "sagemaker-least-privilege"
role = aws_iam_role.sagemaker_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "S3DataAccess"
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
Resource = [
"arn:aws:s3:::${var.bucket_name}",
"arn:aws:s3:::${var.bucket_name}/*"
]
},
{
Sid = "ECRImagePull"
Effect = "Allow"
Action = [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability",
"ecr:GetAuthorizationToken"
]
Resource = "*"
},
{
Sid = "CloudWatchLogs"
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
}
]
})
}
output "role_arn" {
value = aws_iam_role.sagemaker_execution.arn
}
Call it from your root module:
module "sagemaker_role" {
source = "./modules/sagemaker-execution-role"
bucket_name = "my-ml-training-data"
role_name = "my-project-sagemaker-role"
}
Then use module.sagemaker_role.role_arn as the role_arn in your sagemaker.Estimator() call or in the SageMaker Console.
Run terraform plan and verify it creates exactly one role and one inline policy — no managed policies, no wildcards on S3. If the training job fails, check CloudWatch under /aws/sagemaker/TrainingJobs. The error will tell you exactly which permission is missing.
This is the kind of thing you document in your project README and walk through in interviews. IAM isn’t glamorous, but engineers who understand it stand out from those who just copy-paste AdministratorAccess.