Model Deployment & Endpoints

Duration: 65 min

Deploying models to production requires choosing the right endpoint type. This module covers real-time endpoints, serverless endpoints, async inference, and multi-model endpoints for different use cases.

Real-Time Endpoints

Real-time endpoints provide low-latency predictions for synchronous requests. They maintain warm instances ready to serve predictions immediately.

from sagemaker.estimator import Estimator
import sagemaker

session = sagemaker.Session()
role = 'arn:aws:iam::123456789012:role/SageMakerRole'

# Train a model
estimator = Estimator(
    image_uri='382416733822.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:latest',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path='s3://my-bucket/output',
    sagemaker_session=session
)

estimator.fit({'training': 's3://my-bucket/train-data/'})

# Deploy as real-time endpoint
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='xgboost-realtime-endpoint'
)

# Make predictions
import csv
import io

test_data = '5.1,3.5,1.4,0.2'
response = predictor.predict(test_data)
print(f"Prediction: {response}")

Serverless Endpoints

Serverless endpoints automatically scale based on traffic, eliminating the need to manage instances. They're ideal for variable workloads.

from sagemaker.serverless import ServerlessInferenceConfig

# Create serverless endpoint
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=1024,
    max_concurrency=10
)

predictor = estimator.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name='xgboost-serverless-endpoint'
)

# Invoke serverless endpoint
response = predictor.predict(test_data)
print(f"Prediction: {response}")

Async Inference

Async inference handles large payloads and long-running predictions. Requests are queued and results are stored in S3.

from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig

# Configure async inference
async_config = AsyncInferenceConfig(
    output_path='s3://my-bucket/async-output/',
    max_concurrent_invocations_per_instance=10
)

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    async_inference_config=async_config,
    endpoint_name='xgboost-async-endpoint'
)

# Invoke async endpoint
import json

input_location = 's3://my-bucket/async-input/test-data.json'
response = predictor.predict_async(input_location)
output_location = response.output_location
print(f"Output will be at: {output_location}")

Multi-Model Endpoints

Multi-model endpoints host multiple models on a single endpoint, reducing costs and simplifying management.

from sagemaker.multidatamodel import MultiDataModel

# Create multi-model endpoint
multi_model = MultiDataModel(
    name='multi-model-endpoint',
    model_data_prefix='s3://my-bucket/models/',
    model_name='xgboost-multi',
    container_uri='382416733822.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:latest',
    role=role,
    sagemaker_session=session
)

# Add models
multi_model.add('model-1.tar.gz')
multi_model.add('model-2.tar.gz')

# Deploy
predictor = multi_model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large'
)

# Invoke specific model
response = predictor.predict(
    test_data,
    target_model='model-1.tar.gz'
)

Endpoint Configuration

{
  "endpoint_config": {
    "endpoint_name": "my-endpoint",
    "endpoint_config_name": "my-endpoint-config",
    "production_variants": [
      {
        "variant_name": "variant-1",
        "model_name": "my-model",
        "initial_instance_count": 1,
        "instance_type": "ml.m5.large",
        "initial_variant_weight": 1.0
      }
    ],
    "data_capture_config": {
      "enabled": true,
      "initial_sampling_percentage": 100,
      "destination_s3_uri": "s3://my-bucket/data-capture/"
    }
  }
}

Auto-Scaling Endpoints

import boto3

autoscaling = boto3.client('application-autoscaling')

# Register endpoint for auto-scaling
autoscaling.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=10
)

# Create scaling policy
autoscaling.put_scaling_policy(
    PolicyName='endpoint-scaling-policy',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        }
    }
)

Quiz 1

❓ What is the primary advantage of real-time endpoints?

Low-latency synchronous predictions Automatic scaling Batch processing Cost savings

Quiz 2

❓ When should you use serverless endpoints?

For high-throughput, consistent traffic For variable or unpredictable traffic For batch processing For real-time low-latency predictions

Quiz 3

❓ What is async inference best for?

Real-time predictions Small payloads Large payloads and long-running predictions Low-latency requirements

Quiz 4

❓ What is the main benefit of multi-model endpoints?

Host multiple models on one endpoint, reducing costs Faster predictions Automatic model selection Better accuracy

Quiz 5

❓ What does auto-scaling do for endpoints?

Automatically trains new models Adjusts instance count based on traffic Monitors model performance Deploys new models