[AWS CDK] Attempting to retrieve Glue Data Catalog data across accounts using Step Functions and Athena

2025.11.09

This page has been translated by machine translation. View original

Introduction

This is kasama from the Data Business Division.
In this post, I will implement a mechanism using AWS Step Functions and Athena to insert data from a Glue Data Catalog in a separate account (data source) into a Glue Table in our own account (target) using AWS CDK.

Prerequisites

First, I considered two approaches to implement this architecture.

Assume Role Method

glue-db-to-glue-db
This method involves creating an IAM Role in the source account to access the Catalog and S3, and having the Step Functions in the target account assume that Role to execute queries. For Select queries only, the process succeeds because the Source Role only needs to read from the Source Catalog. However, for Insert operations, the Source Role attempts to write to the Target Catalog and S3, which fails due to insufficient permissions.

glue-db-to-glue-db-with-lambda
To work around this issue with the Assume Role method, we can use Lambda or Glue Python Shell.
Specifically, we would first assume the Source Role to retrieve data from the source account, then switch to the Target Role to insert into the target account's table by implementing this logic within Lambda. This method avoids permission issues as each Role accesses resources in its own account. However, for this implementation, I prioritized simplicity and chose the Glue Data Catalog Resource Policy method. The Lambda approach remains a valid option when more granular control is needed or when Glue Data Catalog Resource Policy cannot be configured in your environment.

Glue Data Catalog Resource Policy Method

glue-db-to-glue-db-with-resource-policy
In this method, the Step Functions in the target account executes Athena queries directly using the Target Role. A Glue Data Catalog Resource Policy is configured in the source account to allow the Target Role access to the Source Catalog. This approach enables the Target Role to both write to the Target Catalog and S3 (permissions in its own account) and read from the Source Catalog (permissions via Resource Policy). I chose this method for its implementation simplicity.

This follows the same procedure as the official documentation.

Implementation

The following is the implementation of the Glue Data Catalog Resource Policy method.
The implementation code is stored on GitHub.

Project Structure

$ tree
.
├── cdk
│   ├── bin
│   │   └── app.ts
│   ├── cdk.json
│   ├── lib
│   │   └── cross-account-glue-athena-stack.ts
│   ├── package-lock.json
│   ├── package.json
│   ├── parameter.ts
│   └── tsconfig.json
├── README.md
├── source
│   ├── create-database-and-table.sql
│   ├── glue-resource-policy.json
│   ├── s3.yml
│   └── sample-data
│       └── sales.csv
└── target
    ├── create-database-and-table.sql
    └── insert-query.sql

The source account resources and the target account's Glue Database and Table are defined with CloudFormation.

Target Account

#!/usr/bin/env node
import * as cdk from 'aws-cdk-lib';
import { CrossAccountGlueAthenaStack } from '../lib/cross-account-glue-athena-stack';
import { devParameter } from '../parameter';

const app = new cdk.App();

// Get sourceAccountId from context (runtime argument)
const sourceAccountId = app.node.tryGetContext('sourceAccountId');

// Validate sourceAccountId
if (!sourceAccountId) {
  throw new Error(
    `sourceAccountId is required. Please provide it via context:
  npx cdk deploy --context sourceAccountId=111111111111`
  );
}

// Stack name using project name
const stackName = `${devParameter.projectName}-stack`;

new CrossAccountGlueAthenaStack(app, stackName, {
  stackName: stackName,
  description:
    'Target Account - Cross-account Glue Data Catalog access with Athena and Step Functions (tag:cross-account-glue)',
  env: {
    account: devParameter.env?.account || process.env.CDK_DEFAULT_ACCOUNT,
    region: devParameter.env?.region || process.env.CDK_DEFAULT_REGION,
  },
  tags: {
    Project: devParameter.projectName,
    Environment: devParameter.envName,
    Repository: 'blog-code-58',
  },
  projectName: devParameter.projectName,
  envName: devParameter.envName,
  parameter: devParameter,
  sourceAccountId,
});

app.ts defines the stack.

cdk/lib/cross-account-glue-athena-stack.ts

import * as fs from 'node:fs';
import * as path from 'node:path';
import * as cdk from 'aws-cdk-lib';
import * as athena from 'aws-cdk-lib/aws-athena';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import type { Construct } from 'constructs';
import type { AppParameter } from '../parameter';

export interface CrossAccountGlueAthenaStackProps extends cdk.StackProps {
  projectName: string;
  envName: string;
  parameter: AppParameter;
  sourceAccountId: string;
}

export class CrossAccountGlueAthenaStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: CrossAccountGlueAthenaStackProps) {
    super(scope, id, props);

    const { projectName, envName, sourceAccountId } = props;

    // ========================================
    // IAM Role - Step Functions Execution Role
    // ========================================
    const stepFunctionsExecutionRole = new iam.Role(this, 'StepFunctionsExecutionRole', {
      roleName: `${projectName}-${envName}-sfn-execution-role`,
      assumedBy: new iam.ServicePrincipal('states.amazonaws.com'),
      description: 'Execution role for Step Functions to execute cross-account Athena queries',
    });

    // ========================================
    // Target Account - S3 Bucket for data storage
    // ========================================
    const targetDataBucket = new s3.Bucket(this, 'TargetDataBucket', {
      bucketName: `${projectName}-${envName}-target-data`,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
    });

    // Grant Step Functions execution role access to target bucket
    targetDataBucket.grantReadWrite(stepFunctionsExecutionRole);

    // Target database name (created manually via SQL)
    const targetDatabaseName = 'cm_kasama_cross_account_target_db';

    // ========================================
    // Target Account - Athena Workgroup with AWS Managed Storage
    // ========================================
    const targetWorkgroup = new athena.CfnWorkGroup(this, 'TargetAthenaWorkgroup', {
      name: `${projectName}-${envName}-target-workgroup`,
      description: 'Athena workgroup with AWS managed storage for target account queries',
      workGroupConfiguration: {
        enforceWorkGroupConfiguration: true,
        publishCloudWatchMetricsEnabled: true,
        engineVersion: {
          selectedEngineVersion: 'AUTO',
        },
        managedQueryResultsConfiguration: {
          enabled: true,
        },
      },
    });

    // Grant Athena permissions to Step Functions execution role
    stepFunctionsExecutionRole.addToPolicy(
      new iam.PolicyStatement({
        sid: 'AllowAthenaQueryExecution',
        effect: iam.Effect.ALLOW,
        actions: [
          'athena:StartQueryExecution',
          'athena:GetQueryExecution',
          'athena:GetQueryResults',
          'athena:StopQueryExecution',
          'athena:GetDataCatalog',
        ],
        resources: [
          `arn:aws:athena:${this.region}:${this.account}:workgroup/${targetWorkgroup.name}`,
          `arn:aws:athena:${this.region}:${this.account}:datacatalog/*`,
        ],
      })
    );

    // Grant Glue permissions to Step Functions execution role
    // - Target Catalog: Read/Write for INSERT query
    // - Source Catalog: Read-only via registered Data Catalog (enabled by Source Account's Resource Policy)
    stepFunctionsExecutionRole.addToPolicy(
      new iam.PolicyStatement({
        sid: 'AllowTargetGlueCatalogAccess',
        effect: iam.Effect.ALLOW,
        actions: [
          'glue:GetDatabase',
          'glue:GetTable',
          'glue:GetPartitions',
          'glue:BatchCreatePartition',
        ],
        resources: [
          `arn:aws:glue:${this.region}:${this.account}:catalog`,
          `arn:aws:glue:${this.region}:${this.account}:database/${targetDatabaseName}`,
          `arn:aws:glue:${this.region}:${this.account}:table/${targetDatabaseName}/*`,
        ],
      })
    );

    stepFunctionsExecutionRole.addToPolicy(
      new iam.PolicyStatement({
        sid: 'AllowSourceGlueCatalogReadAccess',
        effect: iam.Effect.ALLOW,
        actions: ['glue:GetDatabase', 'glue:GetTable', 'glue:GetPartitions'],
        resources: [
          `arn:aws:glue:${this.region}:${sourceAccountId}:catalog`,
          `arn:aws:glue:${this.region}:${sourceAccountId}:database/*`,
          `arn:aws:glue:${this.region}:${sourceAccountId}:table/*/*`,
        ],
      })
    );

    // Grant S3 read access to Source Account bucket
    stepFunctionsExecutionRole.addToPolicy(
      new iam.PolicyStatement({
        sid: 'AllowSourceS3BucketReadAccess',
        effect: iam.Effect.ALLOW,
        actions: ['s3:GetObject', 's3:ListBucket'],
        resources: [
          `arn:aws:s3:::${projectName}-${envName}-data`,
          `arn:aws:s3:::${projectName}-${envName}-data/*`,
        ],
      })
    );

    // ========================================
    // Athena Data Catalog - Register Source Account Catalog
    // ========================================
    const sourceCatalogName = 'source_catalog';
    new athena.CfnDataCatalog(this, 'SourceDataCatalog', {
      name: sourceCatalogName,
      type: 'GLUE',
      description: 'Cross-account Glue Data Catalog from Source Account',
      parameters: {
        'catalog-id': sourceAccountId,
      },
    });

    // ========================================
    // Step Functions - Athena Query State Machine
    // ========================================
    // Load INSERT query from file
    const insertTemplate = fs.readFileSync(
      path.join(__dirname, '../../target/insert-query.sql'),
      'utf-8'
    );
    const insertQuery = insertTemplate
      .replace('${TARGET_DATABASE}', targetDatabaseName)
      .replace('${SOURCE_CATALOG}', sourceCatalogName);

    // Execute INSERT query with .sync integration (waits for completion automatically)
    const executeInsertQuery = new tasks.AthenaStartQueryExecution(this, 'ExecuteInsertQuery', {
      queryString: insertQuery,
      workGroup: targetWorkgroup.name,
      integrationPattern: sfn.IntegrationPattern.RUN_JOB,
      comment: 'Execute cross-account INSERT query and wait for completion',
    });

    // Create state machine
    new sfn.StateMachine(this, 'AthenaInsertStateMachine', {
      stateMachineName: `${projectName}-${envName}-athena-insert`,
      definitionBody: sfn.DefinitionBody.fromChainable(executeInsertQuery),
      role: stepFunctionsExecutionRole,
      timeout: cdk.Duration.minutes(10),
      comment:
        'Cross-account Athena INSERT: Copy data from Source Account Glue Catalog to Target Account table',
    });
  }
}

In cross-account-glue-athena-stack.ts, the following resources are created:

IAM Role for Step Functions execution
Target account S3 bucket (for INSERT destination data storage)
Athena Workgroup
Athena Data Catalog (Source Account Catalog registration)
Step Functions State Machine

In the Athena Workgroup, managedQueryResultsConfiguration.enabled: true is configured. This feature stores Athena query results in AWS managed storage. Previously, an S3 bucket had to be manually specified, but this feature simplifies operations. For more details, refer to the following article.

The Step Functions execution role is granted the following permissions:

Read and write permissions to the target account's Glue Catalog
Read-only permissions to the source account's Glue Catalog
Read and write permissions to the target account's S3
Read permissions to the source account's S3

IAM Role permissions alone are insufficient, and explicit permission via Glue Data Catalog Resource Policy in the source account is required.
In Athena Data Catalog, the AWS::Athena::DataCatalog resource is used to register the Source Account Catalog. By specifying the source account ID in the catalog-id parameter, cross-account references in the format source_catalog.database.table become possible from the target account. By the way, by specifying the ARN like glue:arn:aws:glue:us-east-1:999999999999:catalog, this registration process becomes unnecessary. This is also mentioned in the AWS official documentation introduced earlier.

In the Step Functions State Machine, integrationPattern: sfn.IntegrationPattern.RUN_JOB is specified to automatically wait until the Athena query is completed.

parameter.ts

import { Environment } from 'aws-cdk-lib';

export interface AppParameter {
  env?: Environment;
  envName: string;
  projectName: string;
}

export const devParameter: AppParameter = {
  envName: 'dev',
  projectName: 'cm-kasama-cross-account',
  env: {},
};

export const prodParameter: AppParameter = {
  envName: 'prod',
  projectName: 'cm-kasama-cross-account',
  env: {},
};

parameter.ts defines parameters to be used for each environment.

insert-query.sql

-- Insert data from source account to target account
-- Using registered Athena Data Catalog to access cross-account Glue Catalog
INSERT INTO ${TARGET_DATABASE}.sales_copy
SELECT * FROM ${SOURCE_CATALOG}.cm_kasama_cross_account_db.sales;

This defines the Insert statement to be executed by Athena.

Source Account

source/s3.yml

AWSTemplateFormatVersion: "2010-09-09"
Description: "S3 Bucket for data storage"

Parameters:
  ProjectName:
    Type: String
    Default: cm-kasama-cross-account
    Description: Project name for resource naming
  EnvName:
    Type: String
    Default: dev
    Description: Environment name (dev, prod, etc.)
  TargetAccountId:
    Type: String
    Description: Target Account ID for cross-account access

Resources:
  DataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub "${ProjectName}-${EnvName}-data"
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256

  DataBucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref DataBucket
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Sid: AllowTargetAccountRead
            Effect: Allow
            Principal:
              AWS: !Sub "arn:aws:iam::${TargetAccountId}:role/${ProjectName}-${EnvName}-sfn-execution-role"
            Action:
              - s3:GetObject
              - s3:ListBucket
            Resource:
              - !GetAtt DataBucket.Arn
              - !Sub "${DataBucket.Arn}/*"

This creates an S3 bucket and Bucket Policy in the source account. It takes TargetAccountId as a parameter and grants s3:GetObject and s3:ListBucket permissions to the Step Functions execution role in the target account.

source/sample-data/sales.csv

id,product,amount,date
1,ProductA,1000,2025-01-01
2,ProductB,2000,2025-01-02
3,ProductC,1500,2025-01-03
4,ProductA,1200,2025-01-04
5,ProductD,3000,2025-01-05
6,ProductB,1800,2025-01-06
7,ProductC,2200,2025-01-07
8,ProductE,2500,2025-01-08
9,ProductA,1100,2025-01-09
10,ProductD,2800,2025-01-10

This is sample data for testing.

source/glue-resource-policy.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<TARGET_ACCOUNT_ID>:role/cm-kasama-cross-account-dev-sfn-execution-role"
      },
      "Action": ["glue:GetDatabase", "glue:GetTable", "glue:GetPartitions"],
      "Resource": [
        "arn:aws:glue:ap-northeast-1:<SOURCE_ACCOUNT_ID>:catalog",
        "arn:aws:glue:ap-northeast-1:<SOURCE_ACCOUNT_ID>:database/*",
        "arn:aws:glue:ap-northeast-1:<SOURCE_ACCOUNT_ID>:table/*/*"
      ]
    }
  ]
}

This Resource Policy grants read-only permissions (glue:GetDatabase, glue:GetTable, glue:GetPartitions) to the source account's Glue Catalog for the Step Functions execution role in the target account. It can be set up using AWS CLI or the AWS console.

Deployment

To create the IAM Role first, we deploy to the target account first.

Target Account CDK Deployment

cd cdk

# Install dependencies
npm install

# CDK Deploy (specify source account ID)
npx cdk deploy \
  --context sourceAccountId=<SOURCE_ACCOUNT_ID> \
  --require-approval never \
  --profile <TARGET_ACCOUNT_PROFILE>

After execution, the CloudFormation stack is generated.
Screenshot 2025-11-08 at 22.30.01

Source Account S3 Bucket Creation

cd source

aws cloudformation create-stack \
  --stack-name cm-kasama-cross-account-s3 \
  --template-body file://s3.yml \
  --parameters \
    ParameterKey=TargetAccountId,ParameterValue=<TARGET_ACCOUNT_ID> \
    ParameterKey=EnvName,ParameterValue=dev \
  --profile <SOURCE_ACCOUNT_PROFILE>

Upload Sample Data to Source Account

aws s3 cp sample-data/sales.csv s3://<SOURCE_BUCKET_NAME>/data/sales.csv \
  --profile <SOURCE_ACCOUNT_PROFILE>

Create Glue Database/Table in Source Account

Execute SQL in the source account's Athena console.

source/create-database-and-table.sql

CREATE DATABASE IF NOT EXISTS cm_kasama_cross_account_db
COMMENT 'Database for cross-account access testing'
LOCATION 's3://<SOURCE_BUCKET_NAME>/data/';

CREATE EXTERNAL TABLE IF NOT EXISTS cm_kasama_cross_account_db.sales (
  id INT COMMENT 'Sales ID',
  product STRING COMMENT 'Product name',
  amount INT COMMENT 'Sales amount',
  date STRING COMMENT 'Sales date'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://<SOURCE_BUCKET_NAME>/data/'
TBLPROPERTIES (
  'skip.header.line.count' = '1',
  'classification' = 'csv'
);

You can verify the data with a Select query.
Screenshot 2025-11-08 at 22.52.37

Create Glue Database/Table in Target Account

Execute SQL in the target account's Athena console.

target/create-database-and-table.sql

CREATE DATABASE IF NOT EXISTS cm_kasama_cross_account_target_db
COMMENT 'Target database for cross-account data copy'
LOCATION 's3://<TARGET_BUCKET_NAME>/';

CREATE EXTERNAL TABLE IF NOT EXISTS cm_kasama_cross_account_target_db.sales_copy (
  id INT COMMENT 'Sales ID',
  product STRING COMMENT 'Product name',
  amount INT COMMENT 'Sales amount',
  date STRING COMMENT 'Sales date (YYYY-MM-DD)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://<TARGET_BUCKET_NAME>/sales_copy/'
TBLPROPERTIES (
  'classification' = 'csv'
);

Configure Glue Data Catalog Resource Policy in Source Account

Replace the placeholders in source/glue-resource-policy.json with actual values and run the following command:

cd source

aws glue put-resource-policy \
  --policy-in-json file://glue-resource-policy.json \
  --enable-hybrid TRUE \
  --profile <SOURCE_ACCOUNT_PROFILE>

Alternatively, you can set this up through the AWS Glue console by navigating to "Data catalog" → "Catalog settings" → "Permissions" and configuring the content of source/glue-resource-policy.json.
Screenshot 2025-11-08 at 23.02.40

Post-Deployment Verification

I executed Step Functions from the AWS console. The process completed successfully in about a minute.
Screenshot 2025-11-08 at 23.05.18

I confirmed that data was successfully inserted using the following query:

SELECT * FROM cm_kasama_cross_account_target_db.sales_copy;

Screenshot 2025-11-08 at 23.06.44

We can also confirm that the copied data files are stored in the target account's S3 bucket.
Screenshot 2025-11-08 at 23.07.12

Conclusion

Through this verification, I've gained an understanding of multiple approaches to cross-account access for Glue Data Catalogs and their respective characteristics. I hope this will be helpful for those working on similar challenges.