[Update] I Actually Tried Amazon SageMaker Data Agent Now That It Can Discover Data from Business Terminology

[Update] I Actually Tried Amazon SageMaker Data Agent Now That It Can Discover Data from Business Terminology

Amazon SageMaker Data Agent can now leverage business context, enabling data discovery from business terminology rather than technical table names. I actually tried it out, so I'll share the verification results.
2026.06.14

This page has been translated by machine translation. View original

This is Ishikawa from the Cloud Business Division. Amazon SageMaker Data Agent can now reference business context (glossary terms, metadata, and README) from SageMaker Catalog to discover datasets and generate SQL/Python from business terminology rather than technical table names, so I tried it out.

https://aws.amazon.com/jp/about-aws/whats-new/2026/06/amazon-sagemaker-data-agent-bdc/

SageMaker Data Agent is an interactive agent that enables data exploration, analysis, and visualization using natural language from notebooks and the Query Editor in SageMaker Unified Studio. This Data Agent can now reference business context and metadata accumulated in SageMaker Catalog.

For example, even if data users don't remember technical table names like tbl_cust_001, they can simply ask in business terms such as "What data is available about customer churn?" or "Calculate customer retention rate," and the agent will identify the appropriate tables and columns to generate more accurate code. A notable feature is the ability to leverage metadata synchronized from Collibra, Atlan, and Alation.

What is SageMaker Data Agent Business Context Integration?

According to the official documentation, when you ask the Data Agent a question, tables are identified through the following process:

  1. Search for accessible tables from technical metadata in AWS Glue Data Catalog or Redshift
  2. Query business context to find assets matching business terminology
  3. Integrate technical metadata and business metadata to identify the correct table
  4. Generate SQL/PySpark referencing the correct catalog, database, table, and columns

The business context queried by the agent consists of the following four elements registered in SageMaker Catalog:

  • Glossary terms
  • Custom metadata forms
  • Asset summaries
  • README

This flow can be illustrated as follows:

This integration feature is available in SageMaker Unified Studio notebooks (Data Notebook) and the Query Editor. Note that Data Agent is a feature that operates within the SageMaker Unified Studio UI, and at the time of writing this article, there is no way to directly call the agent from the CLI or SDK. Therefore, in this article, business metadata is added using the AWS CLI, and queries to the agent are performed through the Query Editor interface.

Trying It Out

Prerequisites

  • SageMaker Unified Studio domain (IAM-based domain used this time)
  • Project and data assets already published in SageMaker Catalog
  • Verification environment: Region ap-northeast-1

For the verification subject, I used the quick_workshop.order_info table prepared for a workshop. This is a Glue View that joins order details (order) with a customer industry master (customer_industry) on customer ID, and contains 19 columns of Japanese sales data including customer name, sales, profit, industry, and sector.

Verification Scenario

While this table has columns such as sales and profit, there is no column that directly corresponds to the business concept of "LTV (customer lifetime value)." Therefore, the verification steps are:

  1. Attach business metadata (glossary tags, summary, README) to order_info
  2. Query the Data Agent with the business term "What data is available for LTV analysis?"
  3. Confirm whether the agent can identify order_info from the term "LTV," which does not exist in the technical column names

If it can identify this, it serves as proof that business context is working.

Step 1: Confirming the Initial State (Without Business Metadata)

First, let's confirm the state before adding metadata. Since order_info is managed as a DataZone asset on SageMaker Catalog, we check it with get-asset.

$ aws datazone get-asset \
  --domain-identifier dzd-d3w2uawk8vo7a1 \
  --identifier 4iwqc4p3huask9 \
  --query '{description:description, glossaryTerms:glossaryTerms, forms:formsOutput[].formName}' \
  --output json
{
    "description": null,
    "glossaryTerms": null,
    "forms": [
        "GlueViewForm",
        "SubscriptionTermsForm",
        "DataSourceReferenceForm",
        "AssetCommonDetailsForm"
    ]
}

Both description and glossaryTerms are empty, and the asset only has technical metadata forms. The domain had a glossary called "Customer Metrics" with three terms registered — LTV, Churn Rate, and ARR — but none were linked to order_info.

Step 2: Adding Business Metadata

This is the technical core of this verification. DataZone assets follow an immutable revision model, and there is no API like update-asset. Adding metadata is done by creating a new revision with create-asset-revision.

The following three items are added:

  • --description: Asset description (describing business purpose)
  • Glossary tags: Link the "LTV" term via --glossary-terms
  • Summary and README: Stored in the summary / readMe fields of AssetCommonDetailsForm

The fact that summary and README go into AssetCommonDetailsForm can be confirmed by checking the form type definition. Both summary (up to 5,000 characters) and readMe (up to 10,240 characters) have the @searchable attribute, making them exactly the fields that the Data Agent targets for search.

One important note: since create-asset-revision creates a new revision, omitting existing forms from --forms-input will delete them. Therefore, the request is structured to re-supply the existing four forms (such as GlueViewForm) while adding only summary and readMe to AssetCommonDetailsForm. The execution was done by organizing the content into a JSON file and passing it via --cli-input-json.

The added README is the following Markdown. Business terminology is intentionally included here:

## order_info (Order/Customer Analysis View)

A foundational view for sales and customer analysis, combining order details with customer industry classification.

### Main Use Cases
- Calculating Customer Lifetime Value (LTV): Aggregate customer name × sales
- Sales and profit analysis by industry/sector
- Visualization of profit margin (profit ÷ sales)

### Related Business Terms
- LTV (Customer Lifetime Value): The customer name × sales in this view is the source data for calculation

Run create-asset-revision to create a new version with updated metadata for the asset that exists in the current catalog.

create_asset_revision_request.json (Click to expand)
{
  "domainIdentifier": "dzd-d3w2uawk8vo7a1",
  "identifier": "4iwqc4p3huask9",
  "name": "order_info",
  "description": "An analysis view joining order details (order) with the customer industry master (customer_industry) on customer ID. Used as the base data for per-customer sales, profit, and discount analysis, industry/sector sales trends, and customer lifetime value (LTV) calculation.",
  "glossaryTerms": [
    "btm8669qf3vrfd"
  ],
  "formsInput": [
    {
      "formName": "GlueViewForm",
      "typeIdentifier": "amazon.datazone.GlueViewFormType",
      "typeRevision": "12",
      "content": "{\"catalogId\":\"<AWS-ACCOUNT_ID>\",\"databaseName\":\"quick_workshop\",\"columns\":[{\"dataType\":\"int\",\"columnName\":\"no\"},{\"dataType\":\"string\",\"columnName\":\"オーダーid\"},{\"dataType\":\"string\",\"columnName\":\"オーダー日付\"},{\"dataType\":\"int\",\"columnName\":\"日付キー\"},{\"dataType\":\"string\",\"columnName\":\"都道府県\"},{\"dataType\":\"string\",\"columnName\":\"\"},{\"dataType\":\"string\",\"columnName\":\"地域\"},{\"dataType\":\"int\",\"columnName\":\"郵便番号\"},{\"dataType\":\"string\",\"columnName\":\"連絡先\"},{\"dataType\":\"string\",\"columnName\":\"顧客名\"},{\"dataType\":\"int\",\"columnName\":\"顧客id\"},{\"dataType\":\"string\",\"columnName\":\"プロダクト\"},{\"dataType\":\"string\",\"columnName\":\"ライセンス\"},{\"dataType\":\"double\",\"columnName\":\"売上\"},{\"dataType\":\"int\",\"columnName\":\"品数\"},{\"dataType\":\"double\",\"columnName\":\"ディスカウント\"},{\"dataType\":\"double\",\"columnName\":\"利益\"},{\"dataType\":\"string\",\"columnName\":\"産業\"},{\"dataType\":\"string\",\"columnName\":\"セクター\"}],\"query\":\"{\\\"originalSql\\\":\\\"SELECT\\\\n  o.\\\\\\\"no\\\\\\\"\\\\n, o.\\\\\\\"オーダーid\\\\\\\"\\\\n, o.\\\\\\\"オーダー日付\\\\\\\"\\\\n, o.\\\\\\\"日付キー\\\\\\\"\\\\n, o.\\\\\\\"都道府県\\\\\\\"\\\\n, o.\\\\\\\"\\\\\\\"\\\\n, o.\\\\\\\"地域\\\\\\\"\\\\n, o.\\\\\\\"郵便番号\\\\\\\"\\\\n, o.\\\\\\\"連絡先\\\\\\\"\\\\n, o.\\\\\\\"顧客名\\\\\\\"\\\\n, o.\\\\\\\"顧客id\\\\\\\"\\\\n, o.\\\\\\\"プロダクト\\\\\\\"\\\\n, o.\\\\\\\"ライセンス\\\\\\\"\\\\n, o.\\\\\\\"売上\\\\\\\"\\\\n, o.\\\\\\\"品数\\\\\\\"\\\\n, o.\\\\\\\"ディスカウント\\\\\\\"\\\\n, o.\\\\\\\"利益\\\\\\\"\\\\n, ci.\\\\\\\"産業\\\\\\\"\\\\n, ci.\\\\\\\"セクター\\\\\\\"\\\\nFROM\\\\n  (\\\\\\\"quick_workshop\\\\\\\".\\\\\\\"order\\\\\\\" o\\\\nINNER JOIN \\\\\\\"quick_workshop\\\\\\\".\\\\\\\"customer_industry\\\\\\\" ci ON (o.\\\\\\\"顧客id\\\\\\\" = ci.\\\\\\\"顧客id\\\\\\\"))\\\\n\\\",\\\"catalog\\\":\\\"awsdatacatalog\\\",\\\"schema\\\":\\\"quick_workshop\\\",\\\"columns\\\":[{\\\"name\\\":\\\"no\\\",\\\"type\\\":\\\"integer\\\"},{\\\"name\\\":\\\"オーダーid\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"オーダー日付\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"日付キー\\\",\\\"type\\\":\\\"integer\\\"},{\\\"name\\\":\\\"都道府県\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"地域\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"郵便番号\\\",\\\"type\\\":\\\"integer\\\"},{\\\"name\\\":\\\"連絡先\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"顧客名\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"顧客id\\\",\\\"type\\\":\\\"integer\\\"},{\\\"name\\\":\\\"プロダクト\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"ライセンス\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"売上\\\",\\\"type\\\":\\\"double\\\"},{\\\"name\\\":\\\"品数\\\",\\\"type\\\":\\\"integer\\\"},{\\\"name\\\":\\\"ディスカウント\\\",\\\"type\\\":\\\"double\\\"},{\\\"name\\\":\\\"利益\\\",\\\"type\\\":\\\"double\\\"},{\\\"name\\\":\\\"産業\\\",\\\"type\\\":\\\"varchar\\\"},{\\\"name\\\":\\\"セクター\\\",\\\"type\\\":\\\"varchar\\\"}],\\\"owner\\\":\\\"<AWS-ACCOUNT_ID>\\\",\\\"runAsInvoker\\\":false,\\\"properties\\\":{},\\\"isProtected\\\":false,\\\"isMultiDialect\\\":false}\",\"region\":\"ap-northeast-1\",\"tableArn\":\"arn:aws:glue:ap-northeast-1:<AWS-ACCOUNT_ID>:table/quick_workshop/order_info\",\"tableName\":\"order_info\"}"
    },
    {
      "formName": "SubscriptionTermsForm",
      "typeIdentifier": "amazon.datazone.SubscriptionTermsFormType",
      "typeRevision": "2",
      "content": "{\"approvalRequired\":\"YES\"}"
    },
    {
      "formName": "DataSourceReferenceForm",
      "typeIdentifier": "amazon.datazone.DataSourceReferenceFormType",
      "typeRevision": "5",
      "content": "{\"dataSourceIdentifier\":{\"id\":\"6mguy6thu4gtjt\",\"version\":\"latest\"},\"filterableDataSourceId\":\"6mguy6thu4gtjt\",\"dataSourceType\":\"GLUE\"}"
    },
    {
      "formName": "AssetCommonDetailsForm",
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "8",
      "content": "{\"sourceIdentifier\": \"arn:aws:glue:ap-northeast-1:<AWS-ACCOUNT_ID>:table/quick_workshop/order_info\", \"summary\": \"This is the core dataset for customer purchasing behavior analysis and sales analysis. It holds customer name, product, sales, quantity, discount, and profit for each order, with customer industry/sector attributes already joined. It can be used for customer lifetime value (LTV), industry sales rankings, profit margin analysis, and customer segment analysis.\", \"readMe\": \"## order_info (Order/Customer Analysis View)\\n\\nA foundational view for sales and customer analysis, combining order details with customer industry classification.\\n\\n### Main Use Cases\\n- Calculating Customer Lifetime Value (LTV): Aggregate customer name × sales\\n- Sales and profit analysis by industry/sector\\n- Visualization of profit margin (profit ÷ sales)\\n- Analysis of discount impact on profit\\n\\n### Key Columns\\n- Customer Name / Customer ID: Customer identification\\n- Sales / Profit / Discount: Monetary metrics\\n- Industry / Sector: Customer industry classification\\n- Product / License: Sold products\\n\\n### Related Business Terms\\n- LTV (Customer Lifetime Value): The customer name × sales in this view is the source data for calculation\"}"
    }
  ],
  "clientToken": "blog-writer-20260609-cff3d144-enrich-001"
}
% aws datazone create-asset-revision \
  --region ap-northeast-1 \
  --cli-input-json file://create_asset_revision_request.json \
  --query '{revision:revision, description:description, glossaryTerms:glossaryTerms, formsOutput:formsOutput[].formName}' \
  --output json
{
    "revision": "2",
    "description": "An analysis view joining order details (order) with the customer industry master (customer_industry) on customer ID. Used as the base data for per-customer sales, profit, and discount analysis, industry/sector sales trends, and customer lifetime value (LTV) calculation.",
    "glossaryTerms": [
        "btm8669qf3vrfd"
    ],
    "formsOutput": [
        "GlueViewForm",
        "SubscriptionTermsForm",
        "DataSourceReferenceForm",
        "AssetCommonDetailsForm"
    ]
}

Revision 2 was created, with description, glossaryTerms (LTV), and summary/readMe added, while the existing four forms were preserved.

Step 3: Confirming Reflection in the Catalog

What the Data Agent searches is the published listings in the catalog. To reflect the created revision 2 in the catalog, re-publish it.

$ aws datazone create-listing-change-set \
  --domain-identifier dzd-d3w2uawk8vo7a1 \
  --entity-type ASSET --entity-identifier 4iwqc4p3huask9 \
  --entity-revision 2 --action PUBLISH
{
    "listingId": "didt0hra2stxg9",
    "listingRevision": "3"
}

After reflecting the changes, searching the catalog with business terminology now returns order_info as a hit.

$ aws datazone search-listings \
  --domain-identifier dzd-d3w2uawk8vo7a1 \
  --search-text "Customer Lifetime Value LTV" \
  --query 'items[].assetListing.{Name:name, GTerms:glossaryTerms[].name}' \
  --output json
[
    {
        "Name": "order_info",
        "GTerms": [ "LTV" ]
    }
]

At this point, we have established a state where order_info can be reached from the business term "LTV."

Step 4: Querying the Data Agent with Business Terminology in the Query Editor

Sign in to the SageMaker Unified Studio portal and open the "Query Editor" for the project. The Data Agent chat panel is displayed on the right side of the screen. Submit the following question. No table name is specified at all — the query uses only the business term "LTV."

I want to analyze Customer Lifetime Value (LTV). Which table should I use?
Please also explain why you chose that table and provide SQL to approximate LTV.

Select "Athena (SQL)" from the connection selector and send the same question again. The agent then explores from the Glue catalog → database → tables, checks the business metadata, and recommends order_info as follows:

Please see the entire flow below.

20260609-amazon-sagemaker-data-agent-bdc-1

Key points from the response are quoted below:

  1. quick_workshop.order_info (Most recommended)

This table has the "LTV" term tag applied and is designed as the core dataset for customer purchasing behavior analysis and sales analysis.

Reasons for selection:

  • Customer ID (顧客id) enables data aggregation per customer
  • Contains sales data (売上), the basic LTV metric
  • Order date (オーダー日付) enables period analysis
  • Also contains profit (利益) data for more detailed analysis
  • Includes industry and sector information, enabling segment-based LTV analysis
  • Business metadata explicitly states it is dedicated to LTV analysis

What is noteworthy is the reason for selection. The first point explicitly cites the "LTV" term tag added in Step 2, and the second point directly quotes the added summary (core dataset for customer purchasing behavior and sales analysis). Despite the absence of a column named "LTV" in order_info, we were able to confirm that the table was identified based on business metadata. This is the effect of business context integration.

Step 5: Running the Generated SQL (A Stumbling Point)

The agent generated a "Customer LTV Analysis" description and its SQL, and inserted it into a cell.

Customer Lifetime Value (LTV) Analysis

Calculate LTV per customer using the quick_workshop.order_info table. This query calculates the following metrics:

  • Total Sales: Cumulative sales per customer
  • Total Profit: Cumulative profit per customer
  • Purchase Count: Number of orders per customer
  • Average Order Value: Average purchase amount per order
  • Customer LTV: Total sales used as the LTV metric
SELECT 
    "顧客id" AS "Customer_ID",
    "顧客名" AS "Customer_Name",
    "産業" AS "Industry",
    "セクター" AS "Sector",
    COUNT(DISTINCT "オーダーid") AS "Purchase_Count",
    SUM("売上") AS "Total_Sales_LTV",
    SUM("利益") AS "Total_Profit",
    AVG("売上") AS "Avg_Order_Value",
    MIN("オーダー日付") AS "First_Purchase_Date",
    MAX("オーダー日付") AS "Last_Purchase_Date"
FROM 
    "quick_workshop"."order_info"
GROUP BY 
    "顧客id",
    "顧客名",
    "産業",
    "セクター"
ORDER BY 
    "Total_Sales_LTV" DESC
LIMIT 100

Press the [Accept and run] button on the right to execute the query.

20260609-amazon-sagemaker-data-agent-bdc-2

The following error occurred. The suspected cause was either a missing column called no, or that permissions had not been granted.

COLUMN_NOT_FOUND: Column 'no' cannot be resolved or requester is not authorized to access requested resources

Initially, I tried to fix the issue using the [Fix with AI] button and chat, but ultimately the root cause was that Lake Formation permissions for the table had not been granted to the SageMaker IAM role AmazonSageMakerAdminIAMExecutionRole_1. After granting the permissions and retrying, it succeeded.

20260609-amazon-sagemaker-data-agent-bdc-3

The actual data retrieved is as follows (excerpt of the first few rows):

Customer_ID Customer_Name Industry Sector Purchase_Count Total_Sales_LTV Total_Profit Avg_Order_Value First_Purchase_Date Last_Purchase_Date
1029 シーユーシー Healthcare Equipment & Services Healthcare 58 55719.21 5953.2038 415.815 2018/1/14 2021/12/24
1009 日産自動車 Manufacturing Capital Goods & Services 102 43807.5424 3970.0102 253.2227884 2018/3/3 2021/12/25
1080 りそな Diversified Financials Financials 102 42904.167 5531.8834 223.4592031 2018/3/17 2021/12/30
1011 ゆうちょ銀行 Banks Financials 65 41255.9471 6449.8554 312.5450538 2018/6/15 2021/12/22
1093 Valero Energy Energy Energy 58 41220.4188 10308.6297 392.5754171 2018/3/18 2021/12/16
1060 ビックカメラ Retail Consumer Discretionary 58 40360.163 7937.4915 366.9105727 2018/1/21 2021/12/24
1088 日本包材株式会社 Consumer Products Consumer Staples 87 40326.344 1690.5775 219.164913 2018/3/21 2021/12/31

We were able to confirm the entire flow from querying with business terminology to table discovery, SQL generation, and actual data calculation.

Discussion

Here is a summary of the findings obtained from this verification.

  • Business context integration clearly works: We were able to identify order_info from the business concept "LTV," which does not exist as a column name. Moreover, the agent specifically cited the glossary tags and summary we added as the reasons for selection, confirming that the matching is not based solely on technical metadata. The design philosophy — where glossaries and READMEs maintained by data stewards directly improve agent accuracy — was palpable.
  • Investment in metadata maintenance pays off: Conversely, if glossary terms, summaries, and README are left empty, the true value of this feature will not be realized. In this verification, before adding the metadata, catalog searches using business terminology could not reach order_info. This behavior matched exactly with the announcement that existing investments in catalog maintenance can be leveraged directly.
  • A connection must be selected first in Query Editor: To have the Data Agent search the catalog, it was necessary to first select an Athena or Redshift connection.
  • AWS Lake Formation permission grants are required: Even if asset creation and table metadata retrieval are possible, queries will fail with an error without access permissions, so permission grants are required. This cannot be changed via the [Fix with AI] button or chat.

As a limitation, the Data Agent can only be used in the SageMaker Unified Studio UI (Data Notebook / Query Editor), and direct invocation from CLI/SDK is not available. On the other hand, since the business metadata that the agent references can be fully automated and codified using the datazone API as shown in this article, scripting metadata maintenance for multiple tables makes it easier to extend the benefits of this feature across the entire organization.

Closing

I tried out Amazon SageMaker Data Agent's business context integration using quick_workshop.order_info as the subject. As a result of adding glossary tags and a summary, the agent correctly identified order_info — which has no column named LTV — from the business-terminology question "What data is available for LTV analysis?" and cited the added metadata as its reasoning.

It is a great value for business users who don't know technical table names to be able to reach data through well-maintained business terminology. At the same time, even if asset creation and table metadata retrieval are possible, queries will fail with an error without access permissions, so permission grants are required. Since the roles of data consumers and those managing data governance are separate, this can also be considered correct behavior.

Since adding metadata itself can be automated with datazone create-asset-revision, why not move forward with building the foundation for self-service analytics leveraging Data Agent, alongside catalog maintenance? I hope this article is helpful to someone.

Share this article

AWSのお困り事はクラスメソッドへ