I checked how the metadata synchronized with Amazon Bedrock Knowledge Bases is stored in Amazon OpenSearch Service

2025.12.20

This page has been translated by machine translation. View original

 IntroductionHello, this is Jinno from the Consulting Department, a big fan of supermarkets.
When you set up a metadata file (.metadata.json) with Bedrock Knowledge Bases, the metadata is synchronized to Amazon OpenSearch Service, right?
While you typically use this metadata for filtering with the Retrieve or RetrieveAndGenerate commands, there are cases where you might want to query OpenSearch directly. At least I did...!
When you want to filter by metadata using your own logic before making a request to Bedrock, and then perform custom processing afterward
When you want to perform unsupported operations like document aggregation or grouping
In these cases, if you don't know how the information defined in .metadata.json is represented in the fields when synchronized to OpenSearch through Knowledge Bases, you won't know how to write your OpenSearch queries.
So, I decided to check this and verify the behavior by executing queries against the actual metadata I've set up.
 What We'll DoWe'll check how metadata placed in S3 is synchronized to OpenSearch,

and confirm the behavior by directly executing queries from the Dev Tools in the OpenSearch dashboard.
 EnvironmentAmazon Bedrock Knowledge Bases (using OpenSearch Serverless as the vector store)
Built with AWS CDK
The sample CDK code is available here:
https://github.com/yuu551/sample-knowledge
 Resources Created by CDKThis CDK stack uses @cdklabs/generative-ai-cdk-constructs to create the following resources:


Resource
Description


S3 Bucket
For document storage. Sample documents are also automatically deployed

Bedrock Knowledge Bases
Automatically creates OpenSearch Serverless as vector store

S3 Data Source
Links Knowledge Base with S3 bucket

OpenSearch Serverless Collection
Vector store automatically built when creating Knowledge Base

The main settings for the Knowledge Base are as follows:


Setting
Value


Embedding Model
Amazon Titan Text Embeddings V2 (1024 dimensions)

Vector Store
OpenSearch Serverless

Chunking Strategy
Fixed Size

Max Tokens
500

Overlap Percentage
20%

We use VectorKnowledgeBase from generative-ai-cdk-constructs to create OpenSearch Serverless collections, indexes, data access policies, etc.
 Sample DocumentsWe've prepared three types of documents and metadata files for testing:


File
document_type
department
is_public
Notes


annual-report-2024.txt
report
finance
false
Annual report. Has year attribute

tech-blog-aws-bedrock.txt
blog
engineering
true
Technical blog. Has tags attribute (STRING_LIST)

product-manual-v2.txt
manual
product
true
Product manual. Has version attribute

We've set different metadata attributes for each document to verify the behavior of each type.

The sample document metadata will also be uploaded when you deploy the CDK.
 S3 Metadata File FormatTo set metadata for Bedrock Knowledge Bases, place a metadata file named filename.extension.metadata.json in the same S3 path as the target document.

For example, for annual-report-2024.txt, place annual-report-2024.txt.metadata.json in the same location.
Let's look at an example of the metadata file we used:
{
  "metadataAttributes": {
    "document_type": {
      "value": { "type": "STRING", "stringValue": "report" }
    },
    "priority": {
      "value": { "type": "NUMBER", "numberValue": 95 }
    },
    "is_public": {
      "value": { "type": "BOOLEAN", "booleanValue": false }
    },
    "department": {
      "value": { "type": "STRING", "stringValue": "finance" }
    },
    "year": {
      "value": { "type": "NUMBER", "numberValue": 2024 }
    },
    "tags": {
      "value": {
        "type": "STRING_LIST",
        "stringListValue": ["aws", "bedrock", "generative-ai"]
      }
    }
  }
}
Four types can be specified for type: STRING, NUMBER, BOOLEAN, and STRING_LIST.
Note that there are constraints on metadata. You can have up to 50 attribute names, strings up to 2048 characters, and STRING_LIST elements up to 10.
https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_DocumentMetadata.html
https://docs.aws.amazon.com/ja_jp/bedrock/latest/APIReference/API_agent_MetadataAttributeValue.html
 Deployment Steps Prerequisites and Versions UsedNode.js v24.10.0
 Installationgit clone https://github.com/yuu551/sample-knowledge.git
cd sample-knowledge
pnpm install
 Parameter ConfigurationIn lib/parameter.ts, set access permissions for the OpenSearch Serverless dashboard.

Add the ARN of your IAM user or role:
lib/parameter.ts
export const parameter = {
  /**
   * IAM principals allowed to access the OpenSearch Serverless dashboard
   */
  opensearchAccessPrincipals: [
    'arn:aws:iam::123456789012:user/your-username',
    // 'arn:aws:iam::123456789012:role/Admin',
  ] as string[],

  // ... other settings
};
 DeploymentIf it's your first time, you need to bootstrap.

Run the deploy command to execute the deployment:
# First time only
cdk bootstrap

# Deploy
cdk deploy
After deployment is complete, the following output will be displayed:
Outputs:
SampleKnowledgeStack.KnowledgeBaseId = XXXXXXXXXX
SampleKnowledgeStack.DataSourceId = XXXXXXXXXX
SampleKnowledgeStack.DocumentBucketName = sampleknowledgestack-docbucket...
SampleKnowledgeStack.OpenSearchCollectionArn = arn:aws:aoss:...
The infrastructure environment is now created! Let's synchronize from the Knowledge Bases screen.
 Data SynchronizationAfter deployment, execute data source synchronization from the console.

This will vectorize the data and store it in OpenSearch.
Bedrock console → Knowledge bases → Select the created Knowledge Base
Data source → Execute Sync
 Accessing the OpenSearch DashboardAfter data source synchronization is complete, let's check how metadata is stored by accessing the OpenSearch Serverless dashboard.
Open OpenSearch Service in the AWS console, select Serverless > Collections from the left menu. Select the collection created by CDK and click Dashboard.
Once you access the dashboard, open Dev Tools from the left menu.

Here you can issue queries directly to OpenSearch.
Enter your query in the left editor and press the play button to see the results on the right.
If you can't access the dashboard, check if your IAM role has been added to the data access policy.

In this CDK, we grant access permissions to the roles specified in parameter.ts.
 Checking the Storage Format in OpenSearchNow let's see how metadata is actually stored in OpenSearch.
 Checking the MappingFirst, let's check the index mapping. Run the following query in Dev Tools to see how data is registered with what mapping:
GET /bedrock-knowledge-base-default-index/_mapping
Looking at the results, we can see that in addition to the fields automatically created by Bedrock, attributes defined in metadata have been added as fields.
{
  "bedrock-knowledge-base-default-index": {
    "mappings": {
      "properties": {
        "AMAZON_BEDROCK_METADATA": {
          "type": "text",
          "index": false
        },
        "AMAZON_BEDROCK_TEXT_CHUNK": {
          "type": "text"
        },
        "bedrock-knowledge-base-default-vector": {
          "type": "knn_vector",
          "dimension": 1024,
          "method": {
            "engine": "faiss",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "document_type": {
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        },
        "department": {
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        },
        "priority": { "type": "float" },
        "is_public": { "type": "boolean" },
        "year": { "type": "float" },
        "tags": {
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        }
      }
    }
  }
}
I see how it's registered now.

AMAZON_BEDROCK_TEXT_CHUNK is the chunked document text, and bedrock-knowledge-base-default-vector is its vector representation. And attributes defined in metadata like document_type and priority are expanded as individual fields.
This shows that Bedrock reads the .metadata.json during data source synchronization and automatically creates fields for us.
 Checking the Actual DataNext, let's see how the actual documents are stored.
GET /bedrock-knowledge-base-default-index/_search
We can see that metadata is stored as individual attributes in the _source field.
{
  "_source": {
    "document_type": "report",
    "department": "finance",
    "priority": 95,
    "is_public": false,
    "year": 2024,
    "AMAZON_BEDROCK_TEXT_CHUNK": "Document text...",
    "bedrock-knowledge-base-default-vector": [...],
    "tags": ["aws", "bedrock", "generative-ai", "rag"]
  }
}
The metadata values are stored directly as field values!

This means we can filter and aggregate using standard OpenSearch queries.
 Type MappingLet's summarize how metadata types are mapped in OpenSearch.


Metadata Type
OpenSearch Mapping


STRING
text type (with keyword subfield)

NUMBER
float type

BOOLEAN
boolean type

STRING_LIST
text type (with keyword subfield)

The key point is that string-based types become text type.
OpenSearch's text type is tokenized for full-text search, so if you want to search for an exact match, you need to use the .keyword subfield. For example, to search for documents where document_type is report, you query against document_type.keyword.
Also, direct mapping to date type is not supported. If you want to handle dates, store them as NUMBER type as UNIX timestamps or in YYYYMMDD format. In our sample, we defined year as a number.
 Trying Queries in Dev ToolsNow let's actually issue some queries and see if we can filter by metadata.
 Exact Match SearchFirst, let's search for documents where document_type is report and is_public is false.
GET /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "document_type.keyword": "report" } },
        { "term": { "is_public": false } }
      ]
    }
  }
}
Note that you need to add .keyword to string fields. As we saw in the mapping earlier, strings are stored as text type, so you need to use the keyword subfield for exact matching. If you forget this, you won't get the intended results.
Looking at the execution results, we got one hit for the annual report document as expected.
{
  "hits": {
    "total": { "value": 1, "relation": "eq" },
    "hits": [
      {
        "_source": {
          "x-amz-bedrock-kb-source-uri": "s3://sampleknowledgestack-docbucket.../annual-report-2024.txt",
          "year": 2024,
          "is_public": false,
          "priority": 95,
          "department": "finance",
          "AMAZON_BEDROCK_TEXT_CHUNK": "2024 Annual Report  Overview This report summarizes our business activities, financial situation, and future outlook for the 2024 fiscal year...",
          "document_type": "report"
        }
      }
    ]
  }
}
 Filtering by Numeric FieldsNext, let's search for documents from 2024 using the year field.
GET /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "year": 2024 } }
      ]
    }
  }
}
For numeric fields, .keyword is not needed. Metadata defined as NUMBER type is mapped as float type, so you can search directly with a number.
We've successfully retrieved the metadata for the 2024 data.
 Searching STRING_LISTLet's also try fields like tags that are of STRING_LIST type. We'll search for documents where department is engineering and tags includes aws.
GET /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "department.keyword": "engineering" } },
        { "term": { "tags.keyword": "aws" } }
      ]
    }
  }
}
STRING_LIST is also stored internally as an array of text type, so .keyword is needed for exact matching.

If any element in the array matches, it will be a hit, so we can search for "documents containing the aws tag."
{
  "hits": {
    "total": { "value": 1, "relation": "eq" },
    "hits": [
      {
        "_source": {
          "x-amz-bedrock-kb-source-uri": "s3://sampleknowledgestack-docbucket.../tech-blog-aws-bedrock.txt",
          "is_public": true,
          "priority": 70,
          "department": "engineering",
          "AMAZON_BEDROCK_TEXT_CHUNK": "Tech Blog: Getting Started with Generative AI on Amazon Bedrock  Introduction Amazon Bedrock is a fully managed generative AI service provided by AWS...",
          "document_type": "blog",
          "tags": ["aws", "bedrock", "generative-ai", "rag"]
        }
      }
    ]
  }
}
 Aggregation QueriesOpenSearch also supports aggregation queries.

You can group documents and get statistical information.
GET /bedrock-knowledge-base-default-index/_search
{
  "size": 0,
  "aggs": {
    "by_type": {
      "terms": { "field": "document_type.keyword" }
    },
    "by_department": {
      "terms": { "field": "department.keyword" }
    },
    "avg_priority": {
      "avg": { "field": "priority" }
    }
  }
}
Specifying size: 0 returns only the aggregation results without the documents themselves.

This can be useful for getting information like counts by document type or department, or average priority.
Here's what the results look like:
{
  "took": 80,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "avg_priority": {
      "value": 81.66666666666667
    },
    "by_department": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "engineering",
          "doc_count": 1
        },
        {
          "key": "finance",
          "doc_count": 1
        },
        {
          "key": "product",
          "doc_count": 1
        }
      ]
    },
    "by_type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "blog",
          "doc_count": 1
        },
        {
          "key": "manual",
          "doc_count": 1
        },
        {
          "key": "report",
          "doc_count": 1
        }
      ]
    }
  }
}
 ConclusionToday we looked at how Bedrock Knowledge Bases metadata is stored in OpenSearch Serverless!
Knowledge Bases reads the contents of .metadata.json and automatically expands it into OpenSearch fields. It's important to note that you need to use the .keyword subfield for exact string matching.
For cases where you need to issue queries directly to OpenSearch, it's essential to understand how the data is structured.
This has been educational for me! I hope this article was helpful to you too! Thank you for reading to the end!!

I checked how the metadata synchronized with Amazon Bedrock Knowledge Bases is stored in Amazon OpenSearch Service

Introduction

What We'll Do

Environment

Resources Created by CDK

Sample Documents

S3 Metadata File Format

Deployment Steps

Prerequisites and Versions Used

Installation

Parameter Configuration

Deployment

Data Synchronization

Accessing the OpenSearch Dashboard

Checking the Storage Format in OpenSearch

Checking the Mapping

Checking the Actual Data

Type Mapping

Trying Queries in Dev Tools

Exact Match Search

Filtering by Numeric Fields

Searching STRING_LIST

Aggregation Queries

Conclusion

Related articles

AWS Topics

Trending Topics

Products & Services

Features and Series

Resource	Description
S3 Bucket	For document storage. Sample documents are also automatically deployed
Bedrock Knowledge Bases	Automatically creates OpenSearch Serverless as vector store
S3 Data Source	Links Knowledge Base with S3 bucket
OpenSearch Serverless Collection	Vector store automatically built when creating Knowledge Base

Setting	Value
Embedding Model	Amazon Titan Text Embeddings V2 (1024 dimensions)
Vector Store	OpenSearch Serverless
Chunking Strategy	Fixed Size
Max Tokens	500
Overlap Percentage	20%

File	document_type	department	is_public	Notes
annual-report-2024.txt	report	finance	false	Annual report. Has `year` attribute
tech-blog-aws-bedrock.txt	blog	engineering	true	Technical blog. Has `tags` attribute (STRING_LIST)
product-manual-v2.txt	manual	product	true	Product manual. Has `version` attribute

Metadata Type	OpenSearch Mapping
STRING	text type (with keyword subfield)
NUMBER	float type
BOOLEAN	boolean type
STRING_LIST	text type (with keyword subfield)