I tried Zero-ETL integration in a cross-account environment between DynamoDB and Amazon SageMaker Lakehouse

I tried Zero-ETL integration in a cross-account environment between DynamoDB and Amazon SageMaker Lakehouse

2025.09.09

This page has been translated by machine translation. View original

Introduction

I'm kasama from the Data Business Division. In this article, I'd like to try Zero-ETL integration in a cross-account environment between DynamoDB and Amazon SageMaker Lakehouse.

Prerequisites

DynamoDB-To-S3

This is a configuration where DynamoDB and Glue Data Catalog exist in different accounts, with data integrated via Zero-ETL and output to S3, which can then be queried through Athena.

The target for Zero-ETL integration is specified as an AWS Glue Database. In the context of SageMaker Lakehouse, this functions as a "managed catalog with S3 as storage," where data is stored in Apache Iceberg format on S3 and can be accessed via Athena.
https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/lakehouse-how.html
https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/lakehouse-components.html
By default, Zero-ETL integration is managed by IAM/AWS Glue policies, with Lake Formation being optional, so we'll use the default configuration.
https://docs.aws.amazon.com/glue/latest/dg/zero-etl-limitations.html

We'll follow the configuration methods described in these documents:
https://docs.aws.amazon.com/ja_jp/glue/latest/dg/zero-etl-target.html
https://aws.amazon.com/jp/blogs/aws/new-amazon-dynamodb-zero-etl-integration-with-amazon-sagemaker-lakehouse/

CloudFormation Deployment

For implementation, we'll use IaC where possible, and AWS CLI for everything else.

First, let's define the source DynamoDB. For Zero-ETL integration, AWS Glue service needs to perform the following operations:

  • Read the DynamoDB table structure
  • Export data using the Point-in-Time Recovery feature

These permissions can be set directly using DynamoDB table's resource-based policy.
https://docs.aws.amazon.com/glue/latest/dg/zero-etl-sources.html#zero-etl-config-source-dynamodb

cm-kasama-dynamodb.yml
AWSTemplateFormatVersion: "2010-09-09"
Description: Source account resources for Glue Zero-ETL (DynamoDB table with PITR).

Parameters:
  TableName:
    Type: String
    Default: cm-kasama-test-transactions
    Description: DynamoDB table name (source)

Resources:
  SourceTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Ref TableName
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: transaction_id
          AttributeType: S
        - AttributeName: timestamp
          AttributeType: N
      KeySchema:
        - AttributeName: transaction_id
          KeyType: HASH
        - AttributeName: timestamp
          KeyType: RANGE
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: true
      ResourcePolicy:
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
            - Sid: AllowGlueZeroETLFromIntegration
              Effect: Allow
              Principal:
                Service: glue.amazonaws.com
              Action:
                - dynamodb:ExportTableToPointInTime
                - dynamodb:DescribeTable
                - dynamodb:DescribeExport
              Resource: "*"
              Condition:
                StringEquals:
                  aws:SourceAccount: !Ref AWS::AccountId
                StringLike:
                  aws:SourceArn: !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:integration:*

I deployed it using CloudFormation.
Screenshot 2025-09-08 at 22.48.13

Next, let's define S3, IAM Role, and Glue Database on the target side.
The IAM Role policy is configured based on the following document.
https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html#zero-etl-setup-target-resources-target-iam-role

cm-kasama-zero-etl-target.yml
AWSTemplateFormatVersion: "2010-09-09"
Description: Target account resources for Glue Zero-ETL (S3, Glue DB, IAM Role). 
Parameters:
  GlueDatabaseName:
    Type: String
    Default: cm_kasama_zero_etl_db
    Description: Glue database name to receive data
  S3BucketName:
    Type: String
    Description: S3 bucket to store data
  TargetRoleName:
    Type: String
    Description: IAM role used by Glue on target side

Resources:
  DataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref S3BucketName

  TargetRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Ref TargetRoleName
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: glue.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: glue-target-inline
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Sid: GlueCatalogAccess
                Effect: Allow
                Action:
                  - glue:GetDatabase
                  - glue:GetDatabases
                  - glue:GetTable
                  - glue:GetTables
                  - glue:CreateTable
                  - glue:UpdateTable
                  - glue:DeleteTable
                  - glue:CreatePartition
                  - glue:BatchCreatePartition
                  - glue:UpdatePartition
                  - glue:GetPartition
                  - glue:GetPartitions
                Resource:
                  - !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:catalog
                  - !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:database/${GlueDatabaseName}
                  - !Sub arn:aws:glue:${AWS::Region}:${AWS::AccountId}:table/${GlueDatabaseName}/*
              - Sid: S3Write
                Effect: Allow
                Action:
                  - s3:ListBucket
                  - s3:GetBucketLocation
                  - s3:PutObject
                  - s3:GetObject
                  - s3:DeleteObject
                Resource:
                  - !Sub arn:aws:s3:::${S3BucketName}
                  - !Sub arn:aws:s3:::${S3BucketName}/*
              - Sid: LogsAndMetrics
                Effect: Allow
                Action:
                  - cloudwatch:PutMetricData
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: "*"
  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Sub ${AWS::AccountId}
      DatabaseInput:
        Name: !Ref GlueDatabaseName
        Description: Zero-ETL target database
        LocationUri: !Sub s3://${S3BucketName}/

I also deployed this using CloudFormation.
Screenshot 2025-09-08 at 22.49.07

Target Side Setup with AWS CLI

Now I'll proceed with setup using AWS CLI commands on CloudShell.
Let's start with the target side setup.
Please paste the following into CloudShell and run it:

export AWS_REGION=ap-northeast-1

# === Target: When running in Glue's CloudShell ===
# Set your account ID to TARGET_ACCOUNT_ID, manually enter source account ID (SOURCE_ACCOUNT_ID)
export TARGET_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export SOURCE_ACCOUNT_ID=<SOURCE_AWS_ACCOUNT_ID>

# Resource names (assuming common naming across accounts. Change as needed)
export DYNAMODB_TABLE=cm-kasama-test-transactions
export GLUE_DB_NAME=cm_kasama_zero_etl_db
export TARGET_ROLE=<IAM_ROLE_NAME>
export S3_BUCKET=<S3_BUCKET>
export INTEGRATION_NAME=cm-kasama-cross-account-dynamodb-glue

echo "A=${TARGET_ACCOUNT_ID} B=${SOURCE_ACCOUNT_ID} REGION=${AWS_REGION}"

Setting up Glue Resource-based Policy

Since Glue Resource-based policy isn't supported by CloudFormation, we'll use AWS CLI commands.
https://docs.aws.amazon.com/glue/latest/dg/security_iam_resource-based-policy-examples.html


cat > catalog-resource-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowCreateInboundFromSourceAccount",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::${SOURCE_ACCOUNT_ID}:root" },
      "Action": "glue:CreateInboundIntegration",
      "Resource": [
        "arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:catalog",
        "arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:database/${GLUE_DB_NAME}"
      ]
    },
    {
      "Sid": "AllowGlueServiceAuthorize",
      "Effect": "Allow",
      "Principal": { "Service": "glue.amazonaws.com" },
      "Action": "glue:AuthorizeInboundIntegration",
      "Resource": [
        "arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:catalog",
        "arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:database/${GLUE_DB_NAME}"
      ]
    }
  ]
}
EOF

aws glue put-resource-policy \
  --region "${AWS_REGION}" \
  --policy-in-json file://catalog-resource-policy.json

aws glue get-resource-policy --region "${AWS_REGION}"

AllowCreateInboundFromSourceAccount allows the Zero-ETL integration creator on the data source side to create integrations for the target Glue resources. We're specifying root here, but for more restrictions, you could specify the IAM user or role that will execute the aws glue create-integration command on the data source side.
AllowGlueServiceAuthorize allows the AWS Glue service itself (glue.amazonaws.com) to authorize integrations on behalf of the target account.

Setting up create-integration-resource-property

https://docs.aws.amazon.com/cli/latest/reference/glue/create-integration-resource-property.html

aws glue create-integration-resource-property \
  --resource-arn arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:database/${GLUE_DB_NAME} \
  --target-processing-properties RoleArn=arn:aws:iam::${TARGET_ACCOUNT_ID}:role/${TARGET_ROLE}

aws glue get-integration-resource-property \
  --resource-arn arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:database/${GLUE_DB_NAME}

The create-integration-resource-property command configures the target resource (in this case, the Glue Database) for Zero-ETL integration. This command completes the permissions setup that allows the integration service to access the target Glue Database and write data to S3.

I'm using the AWS CLI because when trying to create a Zero-ETL integration in the source account management console, I couldn't link the target IAM Role at this time.

Source Side Setup with AWS CLI

Please paste the following into CloudShell and run it:

export AWS_REGION=ap-northeast-1

# === Source: When running in DynamoDB's CloudShell ===
# Set your account ID to SOURCE_ACCOUNT_ID, manually enter target account ID (TARGET_ACCOUNT_ID)
export SOURCE_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export TARGET_ACCOUNT_ID=<TARGET_ACCOUNT_ID>   # ← Target (Glue) account ID

# Resource names (assuming common naming across accounts. Change as needed)
export DYNAMODB_TABLE=cm-kasama-test-transactions
export GLUE_DB_NAME=cm_kasama_zero_etl_db
export TARGET_ROLE=<IAM_ROLE_NAME>
export S3_BUCKET=<S3_BUCKET>
export INTEGRATION_NAME=cm-kasama-cross-account-dynamodb-glue

echo "TARGET_ACCOUNT_ID=${TARGET_ACCOUNT_ID} SOURCE_ACCOUNT_ID=${SOURCE_ACCOUNT_ID} REGION=${AWS_REGION}"

Inserting DynamoDB Data

Let's insert 3 test records using CLI:

cat > seed-items.json <<'EOF'
{
  "cm-kasama-test-transactions": [
    {"PutRequest": {"Item": {
      "transaction_id": {"S": "txn-001"},
      "timestamp": {"N": "1735300800"},
      "user_id": {"S": "user_1"},
      "amount": {"N": "5000"},
      "currency": {"S": "USD"},
      "status": {"S": "completed"},
      "merchant": {"S": "merchant_1"},
      "category": {"S": "shopping"},
      "location": {"M": {"country": {"S": "US"}, "city": {"S": "New York"}}}
    } }},
    {"PutRequest": {"Item": {
      "transaction_id": {"S": "txn-002"},
      "timestamp": {"N": "1735301400"},
      "user_id": {"S": "user_2"},
      "amount": {"N": "3500"},
      "currency": {"S": "EUR"},
      "status": {"S": "pending"},
      "merchant": {"S": "merchant_2"},
      "category": {"S": "food"},
      "location": {"M": {"country": {"S": "UK"}, "city": {"S": "London"}}}
    } }},
    {"PutRequest": {"Item": {
      "transaction_id": {"S": "txn-003"},
      "timestamp": {"N": "1735302000"},
      "user_id": {"S": "user_3"},
      "amount": {"N": "8000"},
      "currency": {"S": "JPY"},
      "status": {"S": "completed"},
      "merchant": {"S": "merchant_3"},
      "category": {"S": "transport"},
      "location": {"M": {"country": {"S": "JP"}, "city": {"S": "Tokyo"}}}
    } }}
  ]
}
EOF

aws dynamodb batch-write-item \
  --region "${AWS_REGION}" \
  --request-items file://seed-items.json

Setting up glue create-integration

https://docs.aws.amazon.com/cli/latest/reference/glue/create-integration.html

aws glue create-integration \
  --region "${AWS_REGION}" \
  --integration-name "${INTEGRATION_NAME}" \
  --source-arn arn:aws:dynamodb:${AWS_REGION}:${SOURCE_ACCOUNT_ID}:table/${DYNAMODB_TABLE} \
  --target-arn arn:aws:glue:${AWS_REGION}:${TARGET_ACCOUNT_ID}:database/${GLUE_DB_NAME}

Now we actually create the Zero-ETL integration between DynamoDB and Glue using the create-integration command. We're not making any detailed configurations here, but for example, you could change the CDC interval from the default 15 minutes by specifying RefreshInterval in the --integration-config.

Results

I confirmed that the integration was successfully created from the data source account.
Screenshot 2025-09-09 at 8.58.46

It can also be confirmed from the target account.
Screenshot 2025-09-09 at 8.55.29
Screenshot 2025-09-09 at 8.57.05

Two folders were created in S3.
Screenshot 2025-09-09 at 9.00.58

cm_kasama_test_transactions contains the actual data from the DynamoDB table. zetl_integration_table_state contains the status of the Zero-ETL integration.
Screenshot 2025-09-09 at 9.08.16

When queried with Athena, cm_kasama_test_transactions shows the actual values from the DynamoDB table.
Screenshot 2025-09-09 at 9.20.25

In zetl_integration_table_state, we can see detailed status of the Zero-ETL integration.
Screenshot 2025-09-09 at 9.20.05

Verifying Incremental Updates

Let's also verify incremental updates. As before, we'll manipulate data from the data source account's CloudShell.

export AWS_REGION=ap-northeast-1
export DYNAMODB_TABLE=cm-kasama-test-transactions

Update existing item (refund/amount change for txn-001)

NOW=$(date +%s)
cat > update_txn_001.json <<EOF
{
  "TableName": "${DYNAMODB_TABLE}",
  "Key": { "transaction_id": {"S": "txn-001"}, "timestamp": {"N": "1735300800"} },
  "UpdateExpression": "SET #status = :s, #amount = :a, #updated_at = :u",
  "ExpressionAttributeNames": { "#status": "status", "#amount": "amount", "#updated_at": "updated_at" },
  "ExpressionAttributeValues": { ":s": {"S": "refunded"}, ":a": {"N": "5500"}, ":u": {"N": "${NOW}"} },
  "ReturnValues": "ALL_NEW"
}
EOF
aws dynamodb update-item --region "${AWS_REGION}" --cli-input-json file://update_txn_001.json

Insert new item with arrays (txn-004)

cat > put_txn_004.json <<EOF
{
  "TableName": "${DYNAMODB_TABLE}",
  "Item": {
    "transaction_id": {"S": "txn-004"},
    "timestamp": {"N": "1735302600"},
    "user_id": {"S": "user_1"},
    "amount": {"N": "1200"},
    "currency": {"S": "USD"},
    "status": {"S": "completed"},
    "merchant": {"S": "merchant_4"},
    "category": {"S": "entertainment"},
    "location": {"M": {"country": {"S": "US"}, "city": {"S": "Seattle"}}},
    "payment_methods": {"L": [
      {"S": "credit_card"},
      {"S": "apple_pay"}
    ]},
    "tags": {"SS": ["movie", "imax", "weekend"]}
  }
}
EOF
aws dynamodb put-item --region "${AWS_REGION}" --cli-input-json file://put_txn_004.json

Delete existing item (delete txn-002)

cat > delete_txn_002.json <<EOF
{
  "TableName": "${DYNAMODB_TABLE}",
  "Key": { "transaction_id": {"S": "txn-002"}, "timestamp": {"N": "1735301400"} }
}
EOF
aws dynamodb delete-item --region "${AWS_REGION}" --cli-input-json file://delete_txn_002.json

I confirmed that the data was updated in DynamoDB around 9:44 on 2025/9/9.
Screenshot 2025-09-09 at 9.46.10

I confirmed the updates in CloudWatch and S3 around 9:53 on 2025/9/9.
Screenshot 2025-09-09 at 10.18.53
Screenshot 2025-09-09 at 10.19.50

I also confirmed the updates to the cm_kasama_test_transactions table in Athena. Arrays are stored in their original state.
Screenshot 2025-09-09 at 10.22.38
The zetl_integration_table_state table is also updated.
Screenshot 2025-09-09 at 10.23.05

Conclusion

Since this is a relatively new service, there may be feature additions and specification changes in the future. Please consider this article as a reference as of September 2025.

Share this article

FacebookHatena blogX

Related articles