Getting a model to run in a notebook is the easy part.
The hard part is getting it to serve predictions reliably at 2 AM when traffic spikes, without you having to do anything about it. That’s the gap between a demo and a deployed system, and it’s the gap most ML learners never close.
This article covers exactly how to close it on AWS SageMaker.
Why local inference doesn’t scale
Running model.predict() in a notebook has three production problems:
- It’s tied to your machine. The notebook closes, the service stops. You can’t call it from another service or expose it as an API without significant extra work.
- It doesn’t handle concurrency. A single Python process serving predictions one at a time is fine for prototypes. Real traffic requires batching, queuing, or horizontal scaling.
- There’s no observability. No latency metrics, no error rates, no auto-restart on crash.
SageMaker endpoints solve all three. They wrap your model in a managed container, expose an HTTPS endpoint, and handle scaling based on CloudWatch metrics.
SageMaker inference modes: pick the right one
SageMaker offers three inference patterns. Picking wrong is expensive.
Real-time endpoints — persistent, always-on HTTPS endpoints. Best for user-facing applications where latency matters and you need a response in under a second. You pay per hour even when idle.
Serverless inference — no persistent instance. SageMaker spins up a container on demand, runs inference, shuts it down. Best for intermittent traffic where cold starts (1–5 seconds) are acceptable. You pay per inference call, not per hour.
Batch transform — runs inference on a full dataset asynchronously. No endpoint created. Best for overnight processing, data pipelines, or evaluating a model against a test set. Cheapest option for high-volume offline work.
For a production API serving user requests, use real-time endpoints. For a weekend project or internal tool with sporadic usage, serverless is often 10x cheaper.
The 4 steps from trained model to live endpoint
Step 1: Package the model artifact
SageMaker expects your model in a specific format: a model.tar.gz file uploaded to S3. For HuggingFace models, the SageMaker SDK handles this automatically. For custom models, you package the weights and any preprocessing code into this tarball.
Step 2: Define the model container
SageMaker provides pre-built Docker images for popular frameworks (PyTorch, TensorFlow, HuggingFace). You specify which one matches your model. If your model needs custom dependencies, you extend one of these images and push to ECR.
Step 3: Create the endpoint configuration
This defines the instance type, initial instance count, and traffic routing. You can configure multiple production variants for A/B testing — a capability most teams never use but is genuinely useful when rolling out a new model version.
Step 4: Deploy the endpoint
SageMaker provisions the infrastructure, pulls your model artifact, starts the container, and registers the endpoint in Route 53. From this point, you call runtime.invoke_endpoint() to get predictions.
The cold start problem — and how to handle it
Real-time endpoints have no cold start. They’re always running.
Serverless endpoints have a cold start of 1–5 seconds when the first request arrives after a period of inactivity. This is usually fine for internal tools. It’s not fine for user-facing applications where the first request of the day gets a 5-second timeout.
If you need serverless pricing but can’t tolerate cold starts, there’s a pattern: a CloudWatch Events rule that pings your endpoint every 5 minutes with a dummy payload. This keeps the container warm. The cost is negligible — a few hundred requests per day at serverless pricing.
For real-time endpoints, cold starts apply when the endpoint is first deployed or after scaling events add a new instance. This is usually under 60 seconds and happens in the background while existing instances continue serving.
End project: deploy a HuggingFace model to a real-time endpoint with autoscaling
This script deploys distilbert-base-uncased-finetuned-sst-2-english (a sentiment classification model) to a SageMaker real-time endpoint and configures autoscaling. It runs with any AWS account that has the IAM permissions from the previous article.
#!/usr/bin/env python3
"""
Deploy a HuggingFace model to SageMaker real-time endpoint with autoscaling.
Requirements:
pip install sagemaker boto3
Environment:
AWS_DEFAULT_REGION set, or pass region_name below.
SAGEMAKER_EXECUTION_ROLE_ARN — ARN of your SageMaker execution role.
"""
import os
import json
import time
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
# --- Config ---
REGION = os.environ.get("AWS_DEFAULT_REGION", "ap-south-1")
ROLE_ARN = os.environ["SAGEMAKER_EXECUTION_ROLE_ARN"]
ENDPOINT_NAME = "sentiment-classifier-v1"
# HuggingFace model hub ID
HF_MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
HF_TASK = "text-classification"
# Instance type: ml.m5.large is cheapest for CPU inference
# Use ml.g4dn.xlarge for GPU if latency is critical
INSTANCE_TYPE = "ml.m5.large"
INITIAL_INSTANCE_COUNT = 1
# Autoscaling config
MIN_CAPACITY = 1
MAX_CAPACITY = 3
TARGET_INVOCATIONS_PER_INSTANCE = 50 # scale out when avg > 50 req/min per instance
def deploy_endpoint():
sess = sagemaker.Session(boto_session=boto3.Session(region_name=REGION))
# HuggingFaceModel pulls directly from the Hub — no S3 upload needed
hf_model = HuggingFaceModel(
model_data=None, # None = pull from Hub at runtime
env={
"HF_MODEL_ID": HF_MODEL_ID,
"HF_TASK": HF_TASK,
},
role=ROLE_ARN,
transformers_version="4.37",
pytorch_version="2.1",
py_version="py310",
sagemaker_session=sess,
)
print(f"Deploying {HF_MODEL_ID} to endpoint '{ENDPOINT_NAME}'...")
predictor = hf_model.deploy(
initial_instance_count=INITIAL_INSTANCE_COUNT,
instance_type=INSTANCE_TYPE,
endpoint_name=ENDPOINT_NAME,
wait=True,
)
print(f"Endpoint deployed: {ENDPOINT_NAME}")
return predictor
def configure_autoscaling(endpoint_name):
client = boto3.client("application-autoscaling", region_name=REGION)
resource_id = f"endpoint/{endpoint_name}/variant/AllTraffic"
# Register the endpoint as a scalable target
client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=MIN_CAPACITY,
MaxCapacity=MAX_CAPACITY,
)
# Target tracking policy: scale based on invocations per instance
client.put_scaling_policy(
PolicyName=f"{endpoint_name}-scaling-policy",
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": TARGET_INVOCATIONS_PER_INSTANCE,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleInCooldown": 300, # seconds before scaling in
"ScaleOutCooldown": 60, # seconds before scaling out
},
)
print(f"Autoscaling configured: {MIN_CAPACITY}–{MAX_CAPACITY} instances")
def test_endpoint(endpoint_name):
runtime = boto3.client("sagemaker-runtime", region_name=REGION)
test_inputs = [
"This product is absolutely fantastic, highly recommend!",
"Terrible experience, the item broke after one day.",
"It works as described, nothing more nothing less.",
]
print("\nRunning test predictions:")
for text in test_inputs:
payload = json.dumps({"inputs": text})
response = runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Body=payload,
)
result = json.loads(response["Body"].read())
label = result[0]["label"]
score = round(result[0]["score"], 3)
print(f" Input: {text[:60]}")
print(f" Output: {label} (confidence: {score})\n")
def main():
predictor = deploy_endpoint()
configure_autoscaling(ENDPOINT_NAME)
test_endpoint(ENDPOINT_NAME)
print("\nEndpoint is live. Invoke it from any service:")
print(f" Endpoint name: {ENDPOINT_NAME}")
print(f" Region: {REGION}")
print("\nTo delete the endpoint and stop billing:")
print(f" predictor.delete_endpoint()")
print("Or via CLI:")
print(f" aws sagemaker delete-endpoint --endpoint-name {ENDPOINT_NAME}")
if __name__ == "__main__":
main()
Run this with python deploy.py. Deployment takes 5–8 minutes. After that, you have a live HTTPS endpoint you can call from any AWS service — a Lambda function, an EC2 instance, an ECS container — using runtime.invoke_endpoint().
The autoscaling config means if you suddenly get 3x traffic, SageMaker adds instances automatically. When traffic drops, it scales back in after 5 minutes. You don’t touch anything.
Delete the endpoint when you’re done experimenting. A ml.m5.large instance costs roughly ₹7–8 per hour in ap-south-1. Running it overnight adds up.
The habit of knowing how to tear down infrastructure is just as important as knowing how to create it.