[Update] I tried out AWS Glue Data Catalog's new support for business context and semantic search (Preview)
This page has been translated by machine translation. View original
This is Ishikawa from the Cloud Business Division. Semantic search, which allows you to attach a glossary (business context) to tables in AWS Glue Data Catalog and search for data by "meaning," has been made available in preview, so I tried it out.
Until now, table searches in Data Catalog were centered around technical metadata such as table names and column names. With this update, you can attach "business meaning" such as glossary terms, descriptions, and custom metadata to tables, and use the new Glue Search API (search-assets) to discover data based on meaning. AWS cites enabling AI agents to reason based on trusted definitions as the motivation behind this, by providing data with business context.
The preview is available in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland).
What Are Business Context and Semantic Search
The main elements added this time are the following three:
- Glossary and glossary terms: A container that gives standard definitions to business concepts such as "active user," and the individual definitions within it.
- Association with assets: By linking terms to assets such as tables, you can represent "what data a table contains" in the catalog.
- Semantic search (search-assets): An API that searches for data based on the "meaning" of the attached business context and descriptions, rather than the table name itself.
In addition, you can use custom metadata (form types) and skill assets to add extra context such as data usage rules and join paths. These are also intended for use by AI agents (MCP-compatible agents and the AWS Agent Toolkit), but this article verifies the core flow of "glossary → association → semantic search".
Trying It Out
Prerequisites
- At least one table registered in Data Catalog (this article creates a new table for verification)
- An IAM role or user with permissions such as
glue:CreateGlossary,glue:CreateGlossaryTerm,glue:AssociateGlossaryTerms,glue:Search, andglue:GetAsset - Verification environment:
- Region: us-east-1 (N. Virginia)
- AWS CLI: aws-cli/2.35.9
The official documentation includes sample IAM policy examples for the required permissions. Based on the official documentation, there is currently no way to use this feature from the management console (during preview), so verification is done using the aws command.
Creating a Database and Table for Verification
First, prepare a table to be targeted by semantic search. Since actual data is not needed, we only create a table definition with a minimal schema.
% aws s3 mb s3://cm-datalake-20260622 --region us-east-1
make_bucket: cm-datalake-20260622
% aws glue create-database --region us-east-1 \
--database-input '{"Name":"blog_demo","Description":"For technical verification"}'
% aws glue create-table --region us-east-1 --database-name blog_demo \
--table-input '{
"Name":"sales_transactions",
"Description":"Daily sales transactions per user",
"TableType":"EXTERNAL_TABLE",
"Parameters":{"classification":"csv","skip.header.line.count":"1"},
"StorageDescriptor":{
"Columns":[
{"Name":"user_id","Type":"string"},
{"Name":"login_date","Type":"date"},
{"Name":"amount","Type":"double"}
],
"Location":"s3://cm-datalake-20260622/sales_transactions/",
"InputFormat":"org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo":{
"SerializationLibrary":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"Parameters":{"field.delim":","}
}
}
}'
The table was created successfully.
Creating a Glossary
Create a glossary to store business definitions.
% aws glue create-glossary --region us-east-1 \
--name "enterprise_data_glossary" \
--description "Standardized business definitions for data assets across the organization."
{
"Id": "5sbt95rvaddx2f",
"Name": "enterprise_data_glossary",
"Description": "Standardized business definitions for data assets across the organization."
}
The glossary Id is returned in the response. This Id is used when creating terms later.
Creating a Term
Register the business concept "Active User" as a term with a definition. Specify the glossary Id from earlier in --glossary-identifier.
% aws glue create-glossary-term --region us-east-1 \
--glossary-identifier "5sbt95rvaddx2f" \
--name "Active User" \
--short-description "Users who logged in within the last 30 days." \
--long-description "A user who has logged in at least once in the past 30 days."
{
"Id": "bnm5ugtxv5kzvr",
"GlossaryId": "5sbt95rvaddx2f",
"Name": "Active User",
"ShortDescription": "Users who logged in within the last 30 days.",
"LongDescription": "A user who has logged in at least once in the past 30 days."
}
The term Id is returned.
Associating a Term with a Table
Associate the created term with the target table. There was one gotcha here. The official documentation tutorial shows an example of specifying the table ARN with --identifier, but when checking the aws command help (aws glue associate-glossary-terms help), the correct argument name was --asset-identifier.
% aws glue associate-glossary-terms --region us-east-1 \
--asset-identifier "arn:aws:glue:us-east-1:123456789012:table/blog_demo/sales_transactions" \
--glossary-term-identifiers "bnm5ugtxv5kzvr"
{
"AssetIdentifier": "arn:aws:glue:us-east-1:123456789012:table/blog_demo/sales_transactions",
"GlossaryTerms": [
"bnm5ugtxv5kzvr"
]
}
When the association succeeds, a list of terms linked to the target asset is returned.
Searching with Semantic Search
Now for the semantic search. While the What's New announcement and documentation refer to the "Glue Search API" and "aws glue search", when I ran aws glue search just in case, the subcommand did not exist, as shown below.
aws glue search --region us-east-1 --search-text "active users"
aws: [ERROR]: An error occurred (ParamValidation): argument operation: Found invalid choice 'search'
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:
aws help
aws <command> help
aws <command> <subcommand> help
When checking the aws command help (aws glue help), the correct command appears to be search-assets. Let's search using the keyword "active users," which is close to the business concept attached to the term.
% aws glue search-assets --region us-east-1 --search-text "active users"
{
"Items": [
{
"Id": "arn:aws:glue:us-east-1:123456789012:table/blog_demo/sales_transactions",
"AssetName": "sales_transactions",
"AssetDescription": "Daily sales transactions per user",
"UpdatedAt": "2026-06-22T22:26:02.452000+09:00",
"AssetTypeId": "Table"
}
]
}
Only the table to which the term "active users" was attached appeared as a hit. The table name is sales_transactions, and the string "active users" is not contained in either the table name or column names. The reason it still appeared as a hit is that the search based on the business context provided via the glossary is working.
Next, I also checked whether searching in Japanese with "アクティブユーザー" would work, but no results were returned.
% aws glue search-assets --region us-east-1 --search-text "アクティブユーザー"
{
"Items": []
}
Checking the Contents of an Asset
Use get-asset to check what information the table found in the search actually holds. Note that the argument for get-asset is --identifier, which differs from associate-glossary-terms (--asset-identifier).
% aws glue get-asset --region us-east-1 \
--identifier "arn:aws:glue:us-east-1:123456789012:table/blog_demo/sales_transactions"
{
"Id": "arn:aws:glue:us-east-1:123456789012:table/blog_demo/sales_transactions",
"Name": "sales_transactions",
"Description": "Daily sales transactions per user",
"CreatedAt": "2026-06-22T22:26:02.452000+09:00",
"UpdatedAt": "2026-06-22T22:26:02.452000+09:00",
"AssetTypeId": "amazon.glue::Table",
"GlossaryTerms": [
"bnm5ugtxv5kzvr"
],
"Forms": {
"amazon.glue::GlueTable": {
"FormTypeId": "GlueTable",
"Content": "{\"lakeFormationRegistration\":\"NotRegistered\",\"materializedViewStatus\":\"NotMaterializedView\",\"databaseName\":\"blog_demo\",\"serdeLibrary\":\"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe\",\"multiDialectViewStatus\":\"Disabled\",\"inputFormat\":\"org.apache.hadoop.mapred.TextInputFormat\",\"outputFormat\":\"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\",\"retention\":0}"
},
"amazon::Table": {
"FormTypeId": "Table",
"Content": "{\"dataLocation\":\"s3://cm-datalake-20260622/sales_transactions/\",\"dataFormat\":\"csv\",\"type\":\"EXTERNAL_TABLE\"}"
}
},
"Attachments": {},
"IterableForms": {
"columns": {
"FormTypeId": "columns"
}
}
}
In addition to the Id of the associated term (GlossaryTerms), we can see that the table format and column information are stored as Forms.
Filtering by Structure (Asset Type)
Semantic search can also filter by structural attributes in addition to business context. Let's try limiting results to only those where the asset type is Table using --filter-clause.
% aws glue search-assets --region us-east-1 \
--search-text "active users" \
--filter-clause '{"AttributeFilter":{"Attribute":"type","Operator":"equals","Value":{"StringValue":"Table"}}}' \
--max-items 10
{
"Items": [
{
"Id": "arn:aws:glue:us-east-1:123456789012:table/blog_demo/sales_transactions",
"AssetName": "sales_transactions",
"AssetDescription": "Daily sales transactions per user",
"UpdatedAt": "2026-06-22T22:26:02.452000+09:00",
"AssetTypeId": "Table"
}
]
}
As intended, the target table of type Table was returned.
Discussion
Here is a summary of the findings obtained through the verification.
- With a keyword close to the glossary term ("active users"), we were able to pinpoint and discover the table that had the term attached, even though the string was not present anywhere in the table name. We confirmed that search based on business context is working.
- On the other hand, when searching with a rephrased query of the same meaning ("people who signed in recently"), the table we attached did not appear in the results, and unrelated tables were returned instead. While this was not the expected result, it is likely because search accuracy depends on how semantically close the attached context (term name and description) is to the search query. To make data more discoverable, it seems important to enrich term names and descriptions to align with anticipated search terms.
- In catalogs with many existing tables, searching with loosely related keywords returned a wide range of tables in the results. Search results are simply ranked candidates, and the quality of the attached business context directly impacts "discoverability."
- Even immediately after associating a term (associate-glossary-terms), there was no noticeable reflection lag and hits appeared instantly. Although we anticipated that index reflection might take time given the preview status, there were no issues in this small-scale verification.
- Gotchas encountered included:
aws glue searchin the documentation is actuallysearch-assetsin the CLI; the association argument is--asset-identifierrather than--identifieras shown in the documentation; and the argument names are inconsistent betweenassociate-glossary-terms(--asset-identifier) andget-asset(--identifier). As this is a preview, the discrepancies between documentation and CLI notation are expected to be resolved going forward.
The following are points I would like to try in the future:
- Standardization of attributes using custom metadata (form types) and filtered search by attribute
- Attaching data usage rules for AI agents using skill assets
- Integration with catalog search from MCP-compatible agents (aws-data-analytics plugin)
Closing
I went through the entire flow of AWS Glue Data Catalog's business context and semantic search (preview), from creating a glossary to associating it with a table and discovering it via semantic search.
Being able to discover data based on business meaning rather than relying on table names or column names is a significant step forward for both data users and AI agents that reference data. At the same time, since search accuracy depends on the quality of the attached business context, it is important to consider building out term names and descriptions as part of the process.
All operations are performed on Glue Data Catalog metadata, and no additional infrastructure is required. I confirmed that the resources created during verification (term associations, terms, glossaries, tables, and databases) are also removed from search results when deleted in reverse dependency order. When trying the preview, I recommend doing a small-scale verification in a supported region and making sure to clean up afterward.
