In AWS Deadline Cloud CMF Worker in a closed VPC, you need an STS VPC endpoint
This page has been translated by machine translation. View original
Introduction
I'd like to share a problem I encountered when running AWS Deadline Cloud CMF (Customer-Managed Fleet) Workers in a closed VPC with no internet egress.
The Worker Agent service was added to the Fleet in an IDLE state and appeared normal at first glance. However, when I actually submitted a job, the task failed at the input file synchronization stage with the following error:
Connect timeout on endpoint URL: "https://sts.ap-northeast-1.amazonaws.com/"
I had understood that the Deadline Cloud Worker Agent was designed to operate without calling STS (AWS Security Token Service), so it was surprising to see an STS connection suddenly appear during job execution.
To state the conclusion upfront: the Worker Agent itself indeed does not call STS. However, only the job attachments process responsible for input file synchronization internally falls back to STS. Therefore, in a closed VPC, the solution is to add an STS interface VPC endpoint.
What is AWS Deadline Cloud
AWS Deadline Cloud is a managed service that allows you to build a render farm on the cloud for 3DCG/VFX production. It provides features necessary for render farm operations, including Worker auto-scaling, job management, and license supply.
Target Audience
- Those who are building or considering building CMF in a closed VPC with AWS Deadline Cloud
- Those struggling to isolate issues where the Worker appears normal but jobs keep failing
- Those who want to know how Deadline Cloud Workers obtain AWS credentials
References
- AWS Deadline Cloud Developer Guide: Customer-managed fleets
- AWS Deadline Cloud Developer Guide: Storage profiles and job attachments
- AWS PrivateLink: Creating interface VPC endpoints
- aws-deadline/deadline-cloud-worker-agent
- aws-deadline/deadline-cloud (job attachments implementation)
Prerequisites
To understand the reproduction conditions, I'll first explain three prerequisites.
- CMF: Unlike SMF (Service-Managed Fleet) where AWS launches Workers, this is a method where users register their own EC2 instances as Workers in the Fleet.
- Closed VPC: In this article, I use this term to refer to a VPC that has no internet gateway or NAT and cannot communicate outbound to the internet. A typical private subnet can communicate outbound via NAT, but here even that is blocked, and AWS APIs are only reachable via interface VPC endpoints (PrivateLink).
- job attachments: A Deadline Cloud mechanism that synchronizes inputs such as scene files to Workers via S3. Input synchronization runs on the Worker as an action called
syncInputJobAttachments.
The configuration looks like the following diagram. It separately illustrates the route through which the Worker Agent itself obtains credentials via the deadline endpoint, and the route through which job attachments calls S3 and STS.
Symptoms
When a job is submitted, it progresses from READY to RUNNING. However, the task fails with FAILED at the input synchronization stage, and the following message appears in the progressMessage in CloudWatch Logs:
An issue occurred with AWS service request while downloading binary file:
Connect timeout on endpoint URL: "https://sts.ap-northeast-1.amazonaws.com/"
Why the Worker Agent Itself Does Not Call STS
Since the Worker Agent has started and joined the Fleet, it must have obtained some AWS credentials. However, that route does not go through STS. When the Worker Agent starts, it calls deadline's AssumeFleetRoleForWorker API and receives temporary credentials for the fleet worker role via the deadline endpoint. The deadline service itself takes over the credential issuance, and sts:AssumeRole is not called directly.
The following logs appear in the Worker Agent:
[AWSCreds.Query] Requesting AWS Credentials
[deadline:AssumeFleetRoleForWorker] (200) accessKeyId=ASIA... sessionToken=...
[AWSCreds.Query] Obtained temporary Worker AWS Credentials.
For this reason, even in a closed VPC without a STS endpoint, the Worker Agent itself operates without issues and is added to the Fleet in an IDLE state. This is the source of the confusing situation where the Worker appears normal but only jobs fail.
Why Only job attachments Calls STS
STS appears during input synchronization because the AWS account ID is needed for the ExpectedBucketOwner header attached during S3 GetObject. This chain of account ID resolution only surfaces in a closed VPC.
The job attachments process first tries to extract the account ID from the session. However, the credentials file issued by the Worker Agent contains only five fields: Version, AccessKeyId, SecretAccessKey, SessionToken, and Expiration, and does not include an AccountId field. Furthermore, the profile configuration written for credential retrieval does not have aws_account_id either. As a result, botocore determines the account ID to be None and calls sts.get_caller_identity() as a last resort.
The full chain of events is summarized as follows:
- job attachments requests the account ID for S3 GetObject's
ExpectedBucketOwner. - The credentials file has no
AccountId, and the profile configuration has noaws_account_id. - botocore determines the account ID to be None and calls
sts.get_caller_identity()as a fallback. - Since there is no STS endpoint in the closed VPC, the TCP connection to STS times out.
- job attachments catches this exception and marks the task as
FAILEDdue to input file download failure.
Resolution
The solution is to add the STS interface VPC endpoint (com.amazonaws.<region>.sts) to your VPC. This allows the fallback get_caller_identity() to be resolved within the VPC, and input synchronization completes successfully. At a cost of a few dollars per endpoint per month, this is the most straightforward solution.
The following alternatives are also worth considering:
- Instead of delivering scene files via job attachments, bake them into the AMI or download them directly from S3, thereby avoiding running
syncInputJobAttachmentsaltogether - Write
aws_account_idinto the profile configuration to skip the STS fallback (however, this delves into the library's internal behavior and requires significant effort, so it is not recommended)
Summary
When setting up CMF Workers in a closed VPC, the Worker Agent being added in IDLE state and jobs completing successfully are two separate concerns. While the Worker Agent itself bypasses STS for credential issuance, the input synchronization of job attachments falls back to STS through a separate route. When working with a closed VPC, make sure to prepare the STS interface VPC endpoint.
I hope this proves useful for those building Deadline Cloud CMF Workers in a closed VPC.