I tried Direct Ingestion of Amazon Bedrock Knowledge Base with S3 data source

I tried Direct Ingestion of Amazon Bedrock Knowledge Base with S3 data source

2026.01.25

This page has been translated by machine translation. View original

Introduction

Hello, I'm Kamino from the consulting department, and I'm a big fan of supermarkets.

When building RAG applications using Amazon Bedrock Knowledge Bases, have you ever encountered situations where you want documents to be reflected in Knowledge Bases immediately, and with high frequency? At the very least, I've encountered this recently.

You could simply trigger synchronization every time a file changes, but what about cases where many people want to synchronize large volumes simultaneously? For example, you want your files to be incorporated into RAG when you upload them, but if hundreds of people are updating one file at a time... this would cause synchronization job queues and a poor experience.
Of course, running batch jobs at certain time intervals is one approach, but ideally, it would be nice if only the specific file gets synchronized when an individual uploads it.

Looking for a better solution, I found that Bedrock has a Direct Ingestion feature.

https://dev.classmethod.jp/articles/amazon-bedrock-custom-data-souece-and-direct-ingestion/

Using the IngestKnowledgeBaseDocuments API, you can directly import specified documents into Knowledge Bases without going through a full synchronization job.

This seems close to what we want to do, and it smells promising. Let's dive into this direct ingestion feature by performing create, update, and delete operations to deepen our understanding!!

Prerequisites

Environment

I used the following environment for this demonstration.

Item Version/Value
aws-cdk-lib 2.235.1
Node.js 25.4.0
Region us-west-2

Direct Ingestion

Typically, documents are ingested into Knowledge Bases by uploading documents to an S3 bucket, executing a synchronization job via the StartIngestionJob API, after which Knowledge Bases scans, chunks, and vectorizes the documents.

With direct ingestion, you send documents directly via the IngestKnowledgeBaseDocuments API, and Knowledge Bases immediately performs chunking and vectorization. In other words, it bypasses the scanning process of the synchronization job, allowing for faster document reflection.

Both custom data sources and S3 data sources support direct ingestion, but we'll focus on S3 data sources for this demonstration.

Implementation

Building Knowledge Bases Environment with CDK

First, let's create a Knowledge Bases environment for testing with CDK.
We'll use S3 Vectors as the vector store and create an S3 type data source.

Full Code
lib/cdk-ingest-kb-stack.ts
import * as cdk from "aws-cdk-lib/core";
import { Construct } from "constructs";
import * as s3vectors from "aws-cdk-lib/aws-s3vectors";
import * as s3 from "aws-cdk-lib/aws-s3";
import * as bedrock from "aws-cdk-lib/aws-bedrock";
import * as iam from "aws-cdk-lib/aws-iam";

export interface CdkS3VectorsKbStackProps extends cdk.StackProps {
  /**
   * Vector dimension for embeddings (default: 1024 for Titan Embeddings V2)
   */
  vectorDimension?: number;
  /**
   * Embedding model ID (default: amazon.titan-embed-text-v2:0)
   */
  embeddingModelId?: string;
}

export class CdkIngestKbStack extends cdk.Stack {
  public readonly vectorBucket: s3vectors.CfnVectorBucket;
  public readonly vectorIndex: s3vectors.CfnIndex;
  public readonly knowledgeBase: bedrock.CfnKnowledgeBase;
  public readonly dataSourceBucket: s3.Bucket;

  constructor(scope: Construct, id: string, props?: CdkS3VectorsKbStackProps) {
    super(scope, id, props);

    const vectorDimension = props?.vectorDimension ?? 1024;
    const embeddingModelId =
      props?.embeddingModelId ?? "amazon.titan-embed-text-v2:0";
    const vectorBucketName = `vector-bucket-${cdk.Aws.ACCOUNT_ID}-${cdk.Aws.REGION}`;

    // ===========================================
    // S3 Vector Bucket
    // ===========================================
    this.vectorBucket = new s3vectors.CfnVectorBucket(this, "VectorBucket", {
      vectorBucketName: vectorBucketName,
    });

    // ===========================================
    // S3 Vector Index
    // ===========================================

    this.vectorIndex = new s3vectors.CfnIndex(this, "VectorIndex", {
      vectorBucketName: vectorBucketName,
      indexName: "kb-vector-index",
      dimension: vectorDimension,
      distanceMetric: "cosine",
      dataType: "float32",
    });
    this.vectorIndex.addDependency(this.vectorBucket);

    // ===========================================
    // Data Source S3 Bucket (for documents)
    // ===========================================
    this.dataSourceBucket = new s3.Bucket(this, "DataSourceBucket", {
      bucketName: `kb-datasource-${cdk.Aws.ACCOUNT_ID}-${cdk.Aws.REGION}`,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
    });

    // ===========================================
    // IAM Role for Knowledge Bases
    // ===========================================
    const knowledgeBaseRole = new iam.Role(this, "KnowledgeBaseRole", {
      assumedBy: new iam.ServicePrincipal("bedrock.amazonaws.com"),
      inlinePolicies: {
        BedrockKnowledgeBasePolicy: new iam.PolicyDocument({
          statements: [
            // S3 Vectors permissions
            new iam.PolicyStatement({
              effect: iam.Effect.ALLOW,
              actions: [
                "s3vectors:CreateIndex",
                "s3vectors:DeleteIndex",
                "s3vectors:GetIndex",
                "s3vectors:ListIndexes",
                "s3vectors:PutVectors",
                "s3vectors:GetVectors",
                "s3vectors:DeleteVectors",
                "s3vectors:QueryVectors",
                "s3vectors:ListVectors",
              ],
              resources: [
                // ARN format: arn:aws:s3vectors:REGION:ACCOUNT:bucket/BUCKET_NAME
                `arn:aws:s3vectors:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:bucket/${vectorBucketName}`,
                `arn:aws:s3vectors:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:bucket/${vectorBucketName}/index/*`,
              ],
            }),
            // S3 Data Source permissions
            new iam.PolicyStatement({
              effect: iam.Effect.ALLOW,
              actions: ["s3:GetObject", "s3:ListBucket"],
              resources: [
                this.dataSourceBucket.bucketArn,
                `${this.dataSourceBucket.bucketArn}/*`,
              ],
            }),
            // Bedrock Foundation Model permissions
            new iam.PolicyStatement({
              effect: iam.Effect.ALLOW,
              actions: ["bedrock:InvokeModel"],
              resources: [
                `arn:aws:bedrock:${cdk.Aws.REGION}::foundation-model/${embeddingModelId}`,
              ],
            }),
          ],
        }),
      },
    });

    // ===========================================
    // Bedrock Knowledge Bases with S3 Vectors
    // ===========================================
    this.knowledgeBase = new bedrock.CfnKnowledgeBase(this, "KnowledgeBase", {
      name: "S3VectorsKnowledgeBase",
      description: "Knowledge Bases using S3 Vectors as vector store",
      roleArn: knowledgeBaseRole.roleArn,
      knowledgeBaseConfiguration: {
        type: "VECTOR",
        vectorKnowledgeBaseConfiguration: {
          embeddingModelArn: `arn:aws:bedrock:${cdk.Aws.REGION}::foundation-model/${embeddingModelId}`,
        },
      },
      storageConfiguration: {
        type: "S3_VECTORS",
        s3VectorsConfiguration: {
          vectorBucketArn: this.vectorBucket.attrVectorBucketArn,
          indexName: this.vectorIndex.indexName!,
        },
      },
    });
    this.knowledgeBase.addDependency(this.vectorIndex);
    this.knowledgeBase.node.addDependency(knowledgeBaseRole);

    // ===========================================
    // Bedrock Data Source (S3 Type)
    // ===========================================
    const dataSource = new bedrock.CfnDataSource(this, "DataSource", {
      name: "S3DataSource",
      description: "S3 data source for knowledge base",
      knowledgeBaseId: this.knowledgeBase.attrKnowledgeBaseId,
      dataSourceConfiguration: {
        type: "S3",
        s3Configuration: {
          bucketArn: this.dataSourceBucket.bucketArn,
        },
      },
    });
    dataSource.addDependency(this.knowledgeBase);

    // ===========================================
    // Outputs
    // ===========================================
    new cdk.CfnOutput(this, "VectorBucketArn", {
      value: this.vectorBucket.attrVectorBucketArn,
      description: "ARN of the S3 Vector Bucket",
    });

    new cdk.CfnOutput(this, "VectorIndexArn", {
      value: this.vectorIndex.attrIndexArn,
      description: "ARN of the Vector Index",
    });

    new cdk.CfnOutput(this, "KnowledgeBaseId", {
      value: this.knowledgeBase.attrKnowledgeBaseId,
      description: "ID of the Bedrock Knowledge Bases",
    });

    new cdk.CfnOutput(this, "DataSourceId", {
      value: dataSource.attrDataSourceId,
      description: "ID of the S3 Data Source",
    });

    new cdk.CfnOutput(this, "DataSourceBucketName", {
      value: this.dataSourceBucket.bucketName,
      description: "Name of the S3 bucket for data source documents",
    });
  }
}

We're using S3 Vectors as our vector store because it's cost-effective, and we've set the data source type to S3.
We're using Titan Embeddings V2 as our Embedding Model.

This code creates an S3 bucket for the data source, Knowledge Bases, and S3 Vectors.

Deploy and Initial Synchronization

Let's deploy:

Deploy Command
pnpm dlx cdk deploy

After deployment completes, the Knowledge Base ID, Data Source ID, and bucket name will be output.

Deployment Output
CdkIngestKbStack.DataSourceBucketName = kb-datasource-123456789012-us-west-2
CdkIngestKbStack.DataSourceId = YYYYYYYYYY
CdkIngestKbStack.KnowledgeBaseId = XXXXXXXXXX
CdkIngestKbStack.VectorBucketArn = arn:aws:s3vectors:us-west-2:123456789012:bucket/vector-bucket-123456789012-us-west-2
CdkIngestKbStack.VectorIndexArn = arn:aws:s3vectors:us-west-2:123456789012:bucket/vector-bucket-123456789012-us-west-2/index/kb-vector-index

It's convenient to set these as environment variables for use in subsequent commands.

Setting Environment Variables
export KB_ID="XXXXXXXXXX"        # KnowledgeBaseId output value
export DS_ID="YYYYYYYYYY"        # DataSourceId output value
export BUCKET_NAME="kb-datasource-123456789012-us-east-1"  # DataSourceBucketName output value

Run the initial synchronization using the console or CLI:

Initial Synchronization Command
aws bedrock-agent start-ingestion-job \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID"

Once complete, our foundation is ready, so let's proceed with testing!

Testing

Let's test common operations like creating, updating, and deleting documents!
Since we're assuming integration into an application, we'll use APIs rather than the console for our testing.

Document Creation

First, let's use the IngestKnowledgeBaseDocuments API to ingest a document.
For S3 data sources, we specify S3 files for ingestion.

First, upload a document to S3:

S3 Upload
echo "Amazon Bedrockは、主要なAI企業が提供する高性能な基盤モデルを単一のAPIで利用できるフルマネージドサービスです。" > bedrock-intro.txt

aws s3 cp bedrock-intro.txt s3://${BUCKET_NAME}/documents/bedrock-intro.txt

Next, execute direct ingestion using the ingest-knowledge-base-documents API to ingest it into Knowledge Bases:

Execution Command
aws bedrock-agent ingest-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --documents '[
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
          }
        }
      }
    }
  ]'

When executed, an asynchronous process starts, and a response like this is returned:
The status shows STARTING.

Execution Result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "STARTING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
                }
            },
            "updatedAt": "2026-01-25T12:21:37.536031+00:00"
        }
    ]
}

Ingestion is processed asynchronously, and the status transitions through these states:

Status Description
STARTING Ingestion process starting
IN_PROGRESS Chunking and vectorization in progress
INDEXED Successfully completed ingestion
FAILED Ingestion failed
PARTIALLY_INDEXED Only partially succeeded

You might be concerned about how to determine completion without knowing the current status, but don't worry.
You can check the status using the GetKnowledgeBaseDocuments API with the S3 path as a key.

Let's run this after waiting a bit:

Status Check Command
aws bedrock-agent get-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --document-identifiers '[
    {
      "dataSourceType": "S3",
      "s3": {
        "uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
      }
    }
  ]'

The execution returns a response in the same format as when we performed direct ingestion:

Execution Result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "INDEXED",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
                }
            },
            "statusReason": "",
            "updatedAt": "2026-01-25T12:21:40.992274+00:00"
        }
    ]
}

Looking at the status, it's now INDEXED, different from before. The ingestion has completed successfully.
When integrating into an application, you'll need to account for asynchronous processing by periodically polling the status using GetKnowledgeBaseDocuments. You'll also want to consider completion detection and error handling.

Let's test this in the console by querying the knowledge base.
I'll simply ask: "Tell me about Bedrock."

CleanShot 2026-01-25 at 21.36.13@2x

It's responding with information from our uploaded document!

Document Update

Next, let's update the S3 file we created earlier:

S3 File Update
echo "Amazon Bedrockは、主要なAI企業が提供する高性能な基盤モデルを単一のAPIで利用できるフルマネージドサービスです。Claude、Titan、Mistralなど多数のモデルに対応しています。" > bedrock-intro.txt

aws s3 cp bedrock-intro.txt s3://${BUCKET_NAME}/documents/bedrock-intro.txt

Call IngestKnowledgeBaseDocuments again to update the document:

Execution Command
aws bedrock-agent ingest-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --documents '[
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
          }
        }
      }
    }
  ]'

The response is returned in the same format as when creating. Timestamps are of course updated.

Execution Result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "STARTING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
                }
            },
            "updatedAt": "2026-01-25T12:40:40.567801+00:00"
        }
    ]
}

Existing documents are overwritten by specifying the same S3 URI, so Knowledge Bases is updated when you update the file on S3 and then execute direct ingestion.

Let's check this in the console again:

CleanShot 2026-01-25 at 21.41.43@2x

The updated content is retrieved!

Document Deletion

Next, let's check how to delete a document.
To delete a document, use the DeleteKnowledgeBaseDocuments API:

Execution Command
aws bedrock-agent delete-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --document-identifiers '[
    {
      "dataSourceType": "S3",
      "s3": {
        "uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
      }
    }
  ]'
Execution Result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "DELETING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
                }
            },
            "updatedAt": "2026-01-25T12:42:47.503248+00:00"
        }
    ]
}

The deletion process has started. Deletion is also processed asynchronously, so it will go through the DELETING status before completion.

Note that even if you delete from Knowledge Bases using direct ingestion, the file in the S3 bucket is not deleted.
It will be re-ingested during the next synchronization job, so you should also delete the file from the S3 bucket. Maintaining consistency is an important point to keep in mind.

Let's check the result:

CleanShot 2026-01-25 at 21.45.17@2x

Since there are no synchronized documents anymore, it can't provide an answer. That's as expected.

Batch Processing Multiple Documents

You can process up to 10 documents at once via the console, or 25 documents via API in a single call.
First, let's upload multiple documents to S3:

Multiple File Upload
echo "AWS Lambdaはサーバーレスコンピューティングサービスです。" > lambda.txt
echo "Amazon S3はオブジェクトストレージサービスです。" > s3.txt
echo "Amazon DynamoDBはフルマネージドNoSQLデータベースです。" > dynamodb.txt

aws s3 cp lambda.txt s3://${BUCKET_NAME}/documents/
aws s3 cp s3.txt s3://${BUCKET_NAME}/documents/
aws s3 cp dynamodb.txt s3://${BUCKET_NAME}/documents/

After uploading, execute ingest-knowledge-base-documents just like we did for a single file.
It's simple - just specify multiple files in the array.

Execution Command
aws bedrock-agent ingest-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --documents '[
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/lambda.txt"
          }
        }
      }
    },
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/s3.txt"
          }
        }
      }
    },
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/dynamodb.txt"
          }
        }
      }
    }
  ]'

The execution returns multiple statuses:

Execution Result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "STARTING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/lambda.txt"
                }
            },
            "updatedAt": "2026-01-25T12:46:42.231326+00:00"
        },
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "STARTING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/s3.txt"
                }
            },
            "updatedAt": "2026-01-25T12:46:42.252964+00:00"
        },
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "STARTING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/dynamodb.txt"
                }
            },
            "updatedAt": "2026-01-25T12:46:42.275428+00:00"
        }
    ]
}

Let's try asking "Tell me about Lambda":

CleanShot 2026-01-25 at 21.50.29@2x

Multiple files have been properly ingested without any issues!

Ensuring Idempotency with clientToken

When implementing direct ingestion processing in applications, you often need to implement retry logic for network errors or timeouts, right?
However, in cases where the first request actually succeeded, there's a possibility that the same document could be processed multiple times during retries.

To prevent such problems, the clientToken parameter is provided.

https://docs.aws.amazon.com/ja_jp/bedrock/latest/APIReference/API_agent_IngestKnowledgeBaseDocuments.html

Let's create a file and upload it as a test.

Creating a file and uploading to S3
echo "Amazon CloudWatchはAWSリソースの監視サービスです。" > cloudwatch.txt

aws s3 cp cloudwatch.txt s3://${BUCKET_NAME}/documents/

Let's execute direct ingestion with the clientToken specified.

Execute with clientToken specified
aws bedrock-agent ingest-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --client-token "user123-cloudwatch-20260125124500" \
  --documents '[
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/cloudwatch.txt"
          }
        }
      }
    }
  ]'

When sending this request continuously, you can see the response below, and the status is IN_PROGRESS instead of STARTING.

Execution result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "IN_PROGRESS",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/cloudwatch.txt"
                }
            },
            "statusReason": "",
            "updatedAt": "2026-01-25T13:01:11.857718+00:00"
        }
    ]
}

When sending requests with the same clientToken, only the first one is processed, and subsequent ones are not executed redundantly.
For example, when implementing an API for users to upload documents, creating a unique token using a combination like userID + filename + timestamp would make retries more reliable.

Parameter Description
clientToken A unique string of 33-256 characters. Requests with the same token are idempotent

About Adding Metadata

With S3 data sources, there are restrictions on how to specify metadata.
You cannot specify metadata inline in the request body.

https://docs.aws.amazon.com/bedrock/latest/userguide/kb-direct-ingestion-add.html

Data Source Type Inline specification S3 location specification
CUSTOM
S3 ×

If you want to add metadata to an S3 data source with direct ingestion, you need to upload a metadata file (.metadata.json) to S3 and set metadata.type to S3_LOCATION with the specified URI.

Let's try it. First, let's create metadata and upload it to S3.

Creating and uploading a metadata file
cat << 'EOF' > bedrock-intro.txt.metadata.json
{
  "metadataAttributes": {
    "category": "aws-service",
    "year": 2023
  }
}
EOF

aws s3 cp bedrock-intro.txt.metadata.json s3://${BUCKET_NAME}/documents/

Once the upload is complete, specify parameters to include metadata in the ingestion.

Direct ingestion with metadata
aws bedrock-agent ingest-knowledge-base-documents \
  --knowledge-base-id "$KB_ID" \
  --data-source-id "$DS_ID" \
  --documents '[
    {
      "content": {
        "dataSourceType": "S3",
        "s3": {
          "s3Location": {
            "uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
          }
        }
      },
      "metadata": {
        "type": "S3_LOCATION",
        "s3Location": {
          "uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt.metadata.json"
        }
      }
    }
  ]'

Let's check the response.

Execution result
{
    "documentDetails": [
        {
            "knowledgeBaseId": "XXXXXXXXXX",
            "dataSourceId": "YYYYYYYYYY",
            "status": "STARTING",
            "identifier": {
                "dataSourceType": "S3",
                "s3": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
                }
            },
            "updatedAt": "2026-01-25T13:14:29.316549+00:00"
        }
    ]
}

The response doesn't show whether metadata has been applied.

Let's try searching with metadata filtering.
You can specify filter conditions with the --retrieval-configuration parameter in the retrieve command.

Let's search for documents where category is aws-service.

Search with metadata filtering
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id "$KB_ID" \
  --retrieval-query '{"text": "AWSサービスについて教えて"}' \
  --retrieval-configuration '{
    "vectorSearchConfiguration": {
      "filter": {
        "equals": {
          "key": "category",
          "value": "aws-service"
        }
      }
    }
  }'

Let's check the execution result.

Execution result
{
    "retrievalResults": [
        {
            "content": {
                "text": "Amazon Bedrockは、主要なAI企業が提供する高性能な基盤モデルを単一のAPIで利用できるフルマネージドサービスです。Claude、Titan、Mistralなど多数のモデルに対応しています。",
                "type": "TEXT"
            },
            "location": {
                "s3Location": {
                    "uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
                },
                "type": "S3"
            },
            "metadata": {
                "x-amz-bedrock-kb-source-file-modality": "TEXT",
                "category": "aws-service",
                "year": 2023.0,
                "x-amz-bedrock-kb-chunk-id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
                "x-amz-bedrock-kb-data-source-id": "YYYYYYYYYY"
            },
            "score": 0.592372715473175
        }
    ],
    "guardrailAction": null
}

We can see that the metadata we added is reflected and retrieved!
Let's also check if it doesn't match when year is 2024.

Filtering with year=2024
aws bedrock-agent-runtime retrieve \
  --knowledge-base-id "$KB_ID" \
  --retrieval-query '{"text": "AWSサービスについて教えて"}' \
  --retrieval-configuration '{
    "vectorSearchConfiguration": {
      "filter": {
        "equals": {
          "key": "year",
          "value": "2024"
        }
      }
    }
  }'

Let's check the execution result.

Execution result
{
    "retrievalResults": [],
    "guardrailAction": null
}

Nothing is retrieved! We've confirmed that metadata filtering is working.

Notes on Using Direct Ingestion with S3 Data Sources

When using direct ingestion with S3 data sources, it's important to note that even if documents are removed from Knowledge Bases, files on S3 are not automatically deleted, and they will be overwritten based on the S3 bucket state during the next sync job execution.

Since S3 and Knowledge Bases don't automatically synchronize, you need to be aware of the workflow to maintain consistency between them: add, update, or delete files in the S3 bucket, and then immediately reflect these changes in Knowledge Bases using the direct ingestion API. It's also good to have a mechanism to roll back and ensure consistency in case of errors.

Conclusion

Direct ingestion using S3 data sources requires attention to consistency between S3 and Knowledge Bases, but it's beneficial to be able to reflect documents immediately without waiting for sync jobs.
It seems particularly useful for RAG applications requiring real-time updates or use cases where documents are frequently updated. I also have a case where I want to use it soon, so I plan to incorporate it into my application.

I hope this article has been helpful. Thank you for reading until the end!!

Share this article

FacebookHatena blogX

Related articles