[Update] I Actually Tried Amazon SageMaker Data Agent Now That It Can Discover Data from Business Terminology

[Update] I Actually Tried Amazon SageMaker Data Agent Now That It Can Discover Data from Business Terminology

Amazon SageMaker Data Agent can now leverage business context, enabling data discovery from business terminology rather than technical table names. I actually tried it out, so I'll share the verification results.
2026.06.14

This page has been translated by machine translation. View original

This is Ishikawa from the Cloud Business Division. Amazon SageMaker Data Agent is now able to reference business context (glossary terms, metadata, and README) in SageMaker Catalog to discover datasets and generate SQL/Python from business terminology rather than technical table names, so I tried it out.

https://aws.amazon.com/jp/about-aws/whats-new/2026/06/amazon-sagemaker-data-agent-bdc/

SageMaker Data Agent is an interactive agent that enables data exploration, analysis, and visualization using natural language from notebooks and Query Editor in SageMaker Unified Studio. This Data Agent can now reference business context and metadata accumulated in SageMaker Catalog.

For example, data users no longer need to remember technical table names like tbl_cust_001. By simply querying with business terminology such as "What data is available about customer churn?" or "Calculate customer retention rate," the agent identifies the appropriate tables and columns and generates more accurate code. A notable feature is that metadata synchronized from Collibra, Atlan, and Alation can also be leveraged.

What is the SageMaker Data Agent Business Context Integration?

According to the official documentation, when you ask the Data Agent a question, tables are identified through the following process:

  1. Search for accessible tables from technical metadata in AWS Glue Data Catalog or Redshift
  2. Query business context to find assets matching business terminology
  3. Integrate technical metadata and business metadata to identify the correct table
  4. Generate SQL / PySpark referencing the correct catalog, database, table, and columns

The business context queried by the agent consists of the following four elements registered in SageMaker Catalog:

  • Glossary terms
  • Custom metadata forms
  • Asset summary
  • README

The flow can be illustrated as follows:

This integration feature is available in SageMaker Unified Studio notebooks (Data Notebook) and Query Editor. Note that Data Agent is a feature that operates exclusively within the SageMaker Unified Studio UI, and as of the time of writing this article, there is no means to call the agent directly from the CLI or SDK. Therefore, in this article, business metadata is assigned using the AWS CLI, and queries to the agent are performed through the Query Editor interface.

Trying It Out

Prerequisites

  • SageMaker Unified Studio domain (IAM-based domain used this time)
  • Projects and data assets must be published in SageMaker Catalog
  • Verification environment: region ap-northeast-1

For the verification subject, we use the quick_workshop.order_info table prepared for a workshop. This is a Glue View that joins order details (order) with a customer industry master (customer_industry) on customer ID, containing 19 columns of Japanese sales data including customer name, sales, profit, industry, and sector.

Verification Scenario

While this table has columns like sales and profit, there is no column that directly corresponds to the business concept of "LTV (Customer Lifetime Value)." Therefore, we will:

  1. Attach business metadata (glossary tags, summary, README) to order_info,
  2. Query the Data Agent using only the business term "What data can I use for LTV analysis?", and
  3. Verify whether the agent can identify order_info from the term "LTV," which does not exist in any technical column name.

If this identification succeeds, it serves as proof that business context integration is working.

Step 1: Checking the Initial State (Without Business Metadata)

First, let's check the state before any metadata is attached. Since order_info is managed as a DataZone asset on SageMaker Catalog, we use get-asset to check it.

$ aws datazone get-asset \
  --domain-identifier dzd-d3w2uawk8vo7a1 \
  --identifier 4iwqc4p3huask9 \
  --query '{description:description, glossaryTerms:glossaryTerms, forms:formsOutput[].formName}' \
  --output json
{
    "description": null,
    "glossaryTerms": null,
    "forms": [
        "GlueViewForm",
        "SubscriptionTermsForm",
        "DataSourceReferenceForm",
        "AssetCommonDetailsForm"
    ]
}

Both description and glossaryTerms are empty, and the asset only has technical metadata forms. The domain had a glossary called "Customer Metrics" with three registered terms — LTV, Churn Rate, and ARR — but none of them were linked to order_info.

Step 2: Attaching Business Metadata

This is the technical core of this verification. DataZone assets follow an immutable revision model, and there is no API like update-asset. Metadata is attached by creating a new revision with create-asset-revision.

The following three items are attached:

  • --description: Asset description (describing the business purpose)
  • Glossary tag: Link the "LTV" term using --glossary-terms
  • Summary and README: Store in the summary / readMe fields of AssetCommonDetailsForm

The fact that summary and README go into AssetCommonDetailsForm can be confirmed by checking the form type definition. Both summary (up to 5,000 characters) and readMe (up to 10,240 characters) have the @searchable attribute, making them exactly the fields that the Data Agent targets for searching.

One important note: since create-asset-revision creates a new revision, omitting existing forms from --forms-input will cause them to be deleted. Therefore, the request is structured to re-supply the existing four forms (such as GlueViewForm) as-is, while only adding summary and readMe to AssetCommonDetailsForm. The execution was done by saving the content to a JSON file and passing it with --cli-input-json.

The README attached is the following Markdown. Business terms are intentionally included here.

## order_info (Order / Customer Analysis View)

A foundational view for sales and customer analysis, combining order details with customer industry classification.

### Main Use Cases
- Calculating Customer Lifetime Value (LTV): Aggregate by customer name × sales
- Sales and profit analysis by industry/sector
- Visualization of profit margin (profit ÷ sales)

### Related Business Terms
- LTV (Customer Lifetime Value): The customer name × sales columns in this view are the source data for calculation

We run create-asset-revision to create a new version of the asset already in the existing catalog with updated metadata.

$ aws datazone create-asset-revision \
  --region ap-northeast-1 \
  --cli-input-json file://create_asset_revision_request.json
{
    "revision": "2",
    "description": "受注明細(order)と顧客産業マスタ(customer_industry)を顧客IDで結合した分析用ビューです。顧客別の売上・利益・ディスカウント、産業・セクター別の販売動向、顧客生涯価値(LTV)算出の基盤データとして利用します。",
    "glossaryTerms": ["btm8669qf3vrfd"],
    "formsOutput": [ "GlueViewForm", "SubscriptionTermsForm", "DataSourceReferenceForm", "AssetCommonDetailsForm" ]
}

Revision 2 was created, with description, glossaryTerms (LTV), and summary/readMe attached, while the existing four forms were preserved.

Step 3: Verifying Reflection in the Catalog

What the Data Agent searches is the listings that have been published to the catalog. To reflect the created revision 2 in the catalog, we re-publish it.

$ aws datazone create-listing-change-set \
  --domain-identifier dzd-d3w2uawk8vo7a1 \
  --entity-type ASSET --entity-identifier 4iwqc4p3huask9 \
  --entity-revision 2 --action PUBLISH
{
    "listingId": "didt0hra2stxg9",
    "listingRevision": "3"
}

After reflecting the changes, order_info now appears in catalog searches using business terminology.

![20260609-amazon-sagemaker-data-agent-bdc-1](/Users/ishikawa.satoru/workspaces/cc/blog/20260606-amazon-sagemaker-data-agent-bdc-2/img/20260609-amazon-sagemaker-data-agent-bdc-1.gif)$ aws datazone search-listings \
  --domain-identifier dzd-d3w2uawk8vo7a1 \
  --search-text "顧客生涯価値 LTV" \
  --query 'items[].assetListing.{Name:name, GTerms:glossaryTerms[].name}' \
  --output json
[
    {
        "Name": "order_info",
        "GTerms": [ "LTV" ]
    }
]

At this point, we have established a state where order_info can be reached via the business term "LTV."

Step 4: Querying the Data Agent with Business Terminology in Query Editor

Sign in to the SageMaker Unified Studio portal and open the project's "Query Editor." The Data Agent chat panel is displayed on the right side of the screen. We submit the following question. No table name is specified at all — we query using only the business term "LTV."

I want to analyze Customer Lifetime Value (LTV). Which table should I use?
Please tell me why you chose that table, and also provide SQL to estimate LTV.

Select "Athena (SQL)" from the connection selector, and submit the same question again. The agent then explored the Glue catalog → databases → tables, checked the business metadata, and recommended order_info as follows.

Please see the entire flow below.

20260609-amazon-sagemaker-data-agent-bdc-1

Key excerpts from the response:

  1. quick_workshop.order_info (Most recommended)

This table has the "LTV" term tag attached and is designed as the core dataset for customer purchase behavior analysis and sales analysis.

Reasons for selection:

  • Customer ID (顧客id) enables data aggregation per customer
  • Contains sales data (売上), the fundamental metric for LTV
  • Order date (オーダー日付) enables period-based analysis
  • Profit (利益) data is also included, enabling more detailed analysis
  • Industry and sector information is included, enabling segment-based LTV analysis
  • Business metadata explicitly states that this is tailored for LTV analysis

The selection reasons are noteworthy. The first point explicitly cites the "LTV" term tag attached in Step 2, and the second point quotes directly from the attached summary (core dataset for customer purchase behavior analysis and sales analysis). Despite the fact that no column in order_info is named "LTV," it was confirmed that the agent was able to identify the table based on business metadata. This is the effect of business context integration.

Step 5: Running the Generated SQL (A Stumbling Block)

The agent generated a "Customer LTV Analysis" description and SQL, and inserted it into a cell.

Customer Lifetime Value (LTV) Analysis

Using the quick_workshop.order_info table to calculate LTV per customer. This query calculates the following metrics:

  • Total Sales: Cumulative sales per customer
  • Total Profit: Cumulative profit per customer
  • Purchase Count: Number of orders per customer
  • Avg Order Value: Average purchase amount per transaction
  • Customer LTV: Total sales used as the LTV metric
SELECT 
    "顧客id" AS "Customer_ID",
    "顧客名" AS "Customer_Name",
    "産業" AS "Industry",
    "セクター" AS "Sector",
    COUNT(DISTINCT "オーダーid") AS "Purchase_Count",
    SUM("売上") AS "Total_Sales_LTV",
    SUM("利益") AS "Total_Profit",
    AVG("売上") AS "Avg_Order_Value",
    MIN("オーダー日付") AS "First_Purchase_Date",
    MAX("オーダー日付") AS "Last_Purchase_Date"
FROM 
    "quick_workshop"."order_info"
GROUP BY 
    "顧客id",
    "顧客名",
    "産業",
    "セクター"
ORDER BY 
    "Total_Sales_LTV" DESC
LIMIT 100

Press the [Accept and run] button on the right to execute the query.

20260609-amazon-sagemaker-data-agent-bdc-2

The following error occurred. I suspected it was either because a column named no did not exist, or that the necessary permissions had not been granted.

COLUMN_NOT_FOUND: Column 'no' cannot be resolved or requester is not authorized to access requested resources

Initially, I tried fixing this using the [Fix with AI] button and through the chat, but ultimately the root cause of the error was that the SageMaker IAM role AmazonSageMakerAdminIAMExecutionRole_1 had not been granted Lake Formation permissions on the table. After granting the permissions and retrying, it succeeded.

20260609-amazon-sagemaker-data-agent-bdc-3

The actual retrieved data is as follows. (An excerpt of the first few rows.)

Customer_ID Customer_Name Industry Sector Purchase_Count Total_Sales_LTV Total_Profit Avg_Order_Value First_Purchase_Date Last_Purchase_Date
1029 シーユーシー ヘルスケア機器・サービス ヘルスケア 58 55719.21 5953.2038 415.815 2018/1/14 2021/12/24
1009 日産自動車 製造業 資本財・サービス 102 43807.5424 3970.0102 253.2227884 2018/3/3 2021/12/25
1080 りそな 各種金融 金融 102 42904.167 5531.8834 223.4592031 2018/3/17 2021/12/30
1011 ゆうちょ銀行 銀行 金融 65 41255.9471 6449.8554 312.5450538 2018/6/15 2021/12/22
1093 Valero Energy エネルギー エネルギー 58 41220.4188 10308.6297 392.5754171 2018/3/18 2021/12/16
1060 ビックカメラ 小売 一般消費財・サービス 58 40360.163 7937.4915 366.9105727 2018/1/21 2021/12/24
1088 日本包材株式会社 コンシューマ製品 生活必需品 87 40326.344 1690.5775 219.164913 2018/3/21 2021/12/31

We were able to confirm the entire flow from querying with business terminology, through table discovery and SQL generation, to the calculation of actual data.

Analysis

Here is a summary of the insights gained from this verification.

  • Business context integration clearly works: We were able to identify order_info from the business concept "LTV," which does not exist as a column name. Moreover, the agent specifically cited the attached glossary tag and summary as the reasons for its selection, confirming that the matching is not based solely on technical metadata. The design philosophy — where glossary terms and READMEs maintained by data stewards directly improve agent accuracy — was tangibly felt.
  • Investment in metadata preparation pays off: Conversely, if the glossary, summary, and README are left empty, this feature will not show its true value. In this verification, the catalog search before metadata was attached could not reach order_info via business terminology. This behavior is exactly as announced — existing catalog preparation investments can be leveraged directly.
  • A connection must be selected first in Query Editor: To have the Data Agent search the catalog, it was necessary to first select an Athena or Redshift connection.
  • AWS Lake Formation permission grants are required: Even if asset creation and table metadata retrieval succeed, querying will fail with an error without access permissions, so permission grants are required. This cannot be changed from the [Fix with AI] button or through chat.

As a limitation, the Data Agent can only be used through the SageMaker Unified Studio UI (Data Notebook / Query Editor), and direct invocation from CLI / SDK is not available. On the other hand, attaching the business metadata that the agent references can be fully automated and codified using the datazone API as shown in this article. I felt that scripting metadata preparation for multiple tables would make it easier to extend the benefits of this feature across the entire organization.

Conclusion

I tried out the Amazon SageMaker Data Agent business context integration using quick_workshop.order_info as the subject. As a result of attaching glossary tags and a summary, the agent correctly identified order_info — a table with no "LTV" column — from the business terminology query "What data can I use for LTV analysis?" and cited the attached metadata as the basis for its decision.

The ability for business users who don't know technical table names to reach data through well-maintained business terminology is of great value. At the same time, even if asset creation and table metadata retrieval succeed, querying will fail with an error without access permissions, so permission grants are required. Since the role of data consumers and the role of those managing data governance are separate, this could also be considered correct behavior.

Since the metadata attachment itself can be automated with datazone create-asset-revision, why not move forward with building the foundation for self-service analytics leveraging the Data Agent, alongside catalog preparation? I hope this article is helpful to someone.

Share this article

AWSのお困り事はクラスメソッドへ