I checked how the metadata synchronized with Amazon Bedrock Knowledge Bases is stored in Amazon OpenSearch Service

I checked how the metadata synchronized with Amazon Bedrock Knowledge Bases is stored in Amazon OpenSearch Service

2025.12.20

This page has been translated by machine translation. View original

Introduction

Hello, this is Jinno from the Consulting Department, a big fan of supermarkets.

When you set up a metadata file (.metadata.json) with Bedrock Knowledge Bases, the metadata is synchronized to Amazon OpenSearch Service, right?

While you typically use this metadata for filtering with the Retrieve or RetrieveAndGenerate commands, there are cases where you might want to query OpenSearch directly. At least I did...!

  • When you want to filter by metadata using your own logic before making a request to Bedrock, and then perform custom processing afterward
  • When you want to perform unsupported operations like document aggregation or grouping

In these cases, if you don't know how the information defined in .metadata.json is represented in the fields when synchronized to OpenSearch through Knowledge Bases, you won't know how to write your OpenSearch queries.

CleanShot 2025-12-20 at 22.19.48@2x

So, I decided to check this and verify the behavior by executing queries against the actual metadata I've set up.

What We'll Do

We'll check how metadata placed in S3 is synchronized to OpenSearch,
and confirm the behavior by directly executing queries from the Dev Tools in the OpenSearch dashboard.

Environment

  • Amazon Bedrock Knowledge Bases (using OpenSearch Serverless as the vector store)
  • Built with AWS CDK

The sample CDK code is available here:

https://github.com/yuu551/sample-knowledge

Resources Created by CDK

This CDK stack uses @cdklabs/generative-ai-cdk-constructs to create the following resources:

Resource Description
S3 Bucket For document storage. Sample documents are also automatically deployed
Bedrock Knowledge Bases Automatically creates OpenSearch Serverless as vector store
S3 Data Source Links Knowledge Base with S3 bucket
OpenSearch Serverless Collection Vector store automatically built when creating Knowledge Base

The main settings for the Knowledge Base are as follows:

Setting Value
Embedding Model Amazon Titan Text Embeddings V2 (1024 dimensions)
Vector Store OpenSearch Serverless
Chunking Strategy Fixed Size
Max Tokens 500
Overlap Percentage 20%

We use VectorKnowledgeBase from generative-ai-cdk-constructs to create OpenSearch Serverless collections, indexes, data access policies, etc.

Sample Documents

We've prepared three types of documents and metadata files for testing:

File document_type department is_public Notes
annual-report-2024.txt report finance false Annual report. Has year attribute
tech-blog-aws-bedrock.txt blog engineering true Technical blog. Has tags attribute (STRING_LIST)
product-manual-v2.txt manual product true Product manual. Has version attribute

We've set different metadata attributes for each document to verify the behavior of each type.
The sample document metadata will also be uploaded when you deploy the CDK.

S3 Metadata File Format

To set metadata for Bedrock Knowledge Bases, place a metadata file named filename.extension.metadata.json in the same S3 path as the target document.
For example, for annual-report-2024.txt, place annual-report-2024.txt.metadata.json in the same location.

Let's look at an example of the metadata file we used:

{
  "metadataAttributes": {
    "document_type": {
      "value": { "type": "STRING", "stringValue": "report" }
    },
    "priority": {
      "value": { "type": "NUMBER", "numberValue": 95 }
    },
    "is_public": {
      "value": { "type": "BOOLEAN", "booleanValue": false }
    },
    "department": {
      "value": { "type": "STRING", "stringValue": "finance" }
    },
    "year": {
      "value": { "type": "NUMBER", "numberValue": 2024 }
    },
    "tags": {
      "value": {
        "type": "STRING_LIST",
        "stringListValue": ["aws", "bedrock", "generative-ai"]
      }
    }
  }
}

Four types can be specified for type: STRING, NUMBER, BOOLEAN, and STRING_LIST.

Note that there are constraints on metadata. You can have up to 50 attribute names, strings up to 2048 characters, and STRING_LIST elements up to 10.

https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_DocumentMetadata.html

https://docs.aws.amazon.com/ja_jp/bedrock/latest/APIReference/API_agent_MetadataAttributeValue.html

Deployment Steps

Prerequisites and Versions Used

  • Node.js v24.10.0

Installation

git clone https://github.com/yuu551/sample-knowledge.git
cd sample-knowledge
pnpm install

Parameter Configuration

In lib/parameter.ts, set access permissions for the OpenSearch Serverless dashboard.
Add the ARN of your IAM user or role:

lib/parameter.ts
export const parameter = {
  /**
   * IAM principals allowed to access the OpenSearch Serverless dashboard
   */
  opensearchAccessPrincipals: [
    'arn:aws:iam::123456789012:user/your-username',
    // 'arn:aws:iam::123456789012:role/Admin',
  ] as string[],

  // ... other settings
};

Deployment

If it's your first time, you need to bootstrap.
Run the deploy command to execute the deployment:

# First time only
cdk bootstrap

# Deploy
cdk deploy

After deployment is complete, the following output will be displayed:

Outputs:
SampleKnowledgeStack.KnowledgeBaseId = XXXXXXXXXX
SampleKnowledgeStack.DataSourceId = XXXXXXXXXX
SampleKnowledgeStack.DocumentBucketName = sampleknowledgestack-docbucket...
SampleKnowledgeStack.OpenSearchCollectionArn = arn:aws:aoss:...

The infrastructure environment is now created! Let's synchronize from the Knowledge Bases screen.

Data Synchronization

After deployment, execute data source synchronization from the console.
This will vectorize the data and store it in OpenSearch.

  1. Bedrock console → Knowledge bases → Select the created Knowledge Base
  2. Data source → Execute Sync

CleanShot 2025-12-17 at 22.08.46@2x

Accessing the OpenSearch Dashboard

After data source synchronization is complete, let's check how metadata is stored by accessing the OpenSearch Serverless dashboard.

Open OpenSearch Service in the AWS console, select Serverless > Collections from the left menu. Select the collection created by CDK and click Dashboard.

CleanShot 2025-12-20 at 19.26.38@2x

Once you access the dashboard, open Dev Tools from the left menu.
Here you can issue queries directly to OpenSearch.

CleanShot 2025-12-20 at 19.29.06@2x

Enter your query in the left editor and press the play button to see the results on the right.

CleanShot 2025-12-20 at 19.30.11@2x

If you can't access the dashboard, check if your IAM role has been added to the data access policy.
In this CDK, we grant access permissions to the roles specified in parameter.ts.

Checking the Storage Format in OpenSearch

Now let's see how metadata is actually stored in OpenSearch.

Checking the Mapping

First, let's check the index mapping. Run the following query in Dev Tools to see how data is registered with what mapping:

GET /bedrock-knowledge-base-default-index/_mapping

CleanShot 2025-12-20 at 19.32.05@2x

Looking at the results, we can see that in addition to the fields automatically created by Bedrock, attributes defined in metadata have been added as fields.

{
  "bedrock-knowledge-base-default-index": {
    "mappings": {
      "properties": {
        "AMAZON_BEDROCK_METADATA": {
          "type": "text",
          "index": false
        },
        "AMAZON_BEDROCK_TEXT_CHUNK": {
          "type": "text"
        },
        "bedrock-knowledge-base-default-vector": {
          "type": "knn_vector",
          "dimension": 1024,
          "method": {
            "engine": "faiss",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "document_type": {
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        },
        "department": {
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        },
        "priority": { "type": "float" },
        "is_public": { "type": "boolean" },
        "year": { "type": "float" },
        "tags": {
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        }
      }
    }
  }
}

I see how it's registered now.
AMAZON_BEDROCK_TEXT_CHUNK is the chunked document text, and bedrock-knowledge-base-default-vector is its vector representation. And attributes defined in metadata like document_type and priority are expanded as individual fields.

This shows that Bedrock reads the .metadata.json during data source synchronization and automatically creates fields for us.

Checking the Actual Data

Next, let's see how the actual documents are stored.

GET /bedrock-knowledge-base-default-index/_search

We can see that metadata is stored as individual attributes in the _source field.

{
  "_source": {
    "document_type": "report",
    "department": "finance",
    "priority": 95,
    "is_public": false,
    "year": 2024,
    "AMAZON_BEDROCK_TEXT_CHUNK": "Document text...",
    "bedrock-knowledge-base-default-vector": [...],
    "tags": ["aws", "bedrock", "generative-ai", "rag"]
  }
}

The metadata values are stored directly as field values!
This means we can filter and aggregate using standard OpenSearch queries.

Type Mapping

Let's summarize how metadata types are mapped in OpenSearch.

Metadata Type OpenSearch Mapping
STRING text type (with keyword subfield)
NUMBER float type
BOOLEAN boolean type
STRING_LIST text type (with keyword subfield)

The key point is that string-based types become text type.

OpenSearch's text type is tokenized for full-text search, so if you want to search for an exact match, you need to use the .keyword subfield. For example, to search for documents where document_type is report, you query against document_type.keyword.

Also, direct mapping to date type is not supported. If you want to handle dates, store them as NUMBER type as UNIX timestamps or in YYYYMMDD format. In our sample, we defined year as a number.

Trying Queries in Dev Tools

Now let's actually issue some queries and see if we can filter by metadata.

Exact Match Search

First, let's search for documents where document_type is report and is_public is false.

GET /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "document_type.keyword": "report" } },
        { "term": { "is_public": false } }
      ]
    }
  }
}

Note that you need to add .keyword to string fields. As we saw in the mapping earlier, strings are stored as text type, so you need to use the keyword subfield for exact matching. If you forget this, you won't get the intended results.

Looking at the execution results, we got one hit for the annual report document as expected.

{
  "hits": {
    "total": { "value": 1, "relation": "eq" },
    "hits": [
      {
        "_source": {
          "x-amz-bedrock-kb-source-uri": "s3://sampleknowledgestack-docbucket.../annual-report-2024.txt",
          "year": 2024,
          "is_public": false,
          "priority": 95,
          "department": "finance",
          "AMAZON_BEDROCK_TEXT_CHUNK": "2024 Annual Report  Overview This report summarizes our business activities, financial situation, and future outlook for the 2024 fiscal year...",
          "document_type": "report"
        }
      }
    ]
  }
}

CleanShot 2025-12-20 at 21.33.12@2x

CleanShot 2025-12-20 at 21.34.18@2x

Filtering by Numeric Fields

Next, let's search for documents from 2024 using the year field.

GET /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "year": 2024 } }
      ]
    }
  }
}

For numeric fields, .keyword is not needed. Metadata defined as NUMBER type is mapped as float type, so you can search directly with a number.

CleanShot 2025-12-20 at 21.34.51@2x

CleanShot 2025-12-20 at 21.35.27@2x

We've successfully retrieved the metadata for the 2024 data.

Searching STRING_LIST

Let's also try fields like tags that are of STRING_LIST type. We'll search for documents where department is engineering and tags includes aws.

GET /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "department.keyword": "engineering" } },
        { "term": { "tags.keyword": "aws" } }
      ]
    }
  }
}

STRING_LIST is also stored internally as an array of text type, so .keyword is needed for exact matching.
If any element in the array matches, it will be a hit, so we can search for "documents containing the aws tag."

{
  "hits": {
    "total": { "value": 1, "relation": "eq" },
    "hits": [
      {
        "_source": {
          "x-amz-bedrock-kb-source-uri": "s3://sampleknowledgestack-docbucket.../tech-blog-aws-bedrock.txt",
          "is_public": true,
          "priority": 70,
          "department": "engineering",
          "AMAZON_BEDROCK_TEXT_CHUNK": "Tech Blog: Getting Started with Generative AI on Amazon Bedrock  Introduction Amazon Bedrock is a fully managed generative AI service provided by AWS...",
          "document_type": "blog",
          "tags": ["aws", "bedrock", "generative-ai", "rag"]
        }
      }
    ]
  }
}

CleanShot 2025-12-20 at 21.35.53@2x

CleanShot 2025-12-20 at 21.37.01@2x

Aggregation Queries

OpenSearch also supports aggregation queries.
You can group documents and get statistical information.

GET /bedrock-knowledge-base-default-index/_search
{
  "size": 0,
  "aggs": {
    "by_type": {
      "terms": { "field": "document_type.keyword" }
    },
    "by_department": {
      "terms": { "field": "department.keyword" }
    },
    "avg_priority": {
      "avg": { "field": "priority" }
    }
  }
}

Specifying size: 0 returns only the aggregation results without the documents themselves.
This can be useful for getting information like counts by document type or department, or average priority.

Here's what the results look like:

{
  "took": 80,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "avg_priority": {
      "value": 81.66666666666667
    },
    "by_department": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "engineering",
          "doc_count": 1
        },
        {
          "key": "finance",
          "doc_count": 1
        },
        {
          "key": "product",
          "doc_count": 1
        }
      ]
    },
    "by_type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "blog",
          "doc_count": 1
        },
        {
          "key": "manual",
          "doc_count": 1
        },
        {
          "key": "report",
          "doc_count": 1
        }
      ]
    }
  }
}

Conclusion

Today we looked at how Bedrock Knowledge Bases metadata is stored in OpenSearch Serverless!

Knowledge Bases reads the contents of .metadata.json and automatically expands it into OpenSearch fields. It's important to note that you need to use the .keyword subfield for exact string matching.

For cases where you need to issue queries directly to OpenSearch, it's essential to understand how the data is structured.

This has been educational for me! I hope this article was helpful to you too! Thank you for reading to the end!!

Share this article

FacebookHatena blogX

Related articles