I tried Direct Ingestion of Amazon Bedrock Knowledge Base with S3 data source
This page has been translated by machine translation. View original
Introduction
Hello, I'm Kamino from the consulting department, and I'm a big fan of supermarkets.
When building RAG applications using Amazon Bedrock Knowledge Bases, have you ever encountered situations where you want documents to be reflected in Knowledge Bases immediately, and with high frequency? At the very least, I've encountered this recently.
You could simply trigger synchronization every time a file changes, but what about cases where many people want to synchronize large volumes simultaneously? For example, you want your files to be incorporated into RAG when you upload them, but if hundreds of people are updating one file at a time... this would cause synchronization job queues and a poor experience.
Of course, running batch jobs at certain time intervals is one approach, but ideally, it would be nice if only the specific file gets synchronized when an individual uploads it.
Looking for a better solution, I found that Bedrock has a Direct Ingestion feature.
Using the IngestKnowledgeBaseDocuments API, you can directly import specified documents into Knowledge Bases without going through a full synchronization job.
This seems close to what we want to do, and it smells promising. Let's dive into this direct ingestion feature by performing create, update, and delete operations to deepen our understanding!!
Prerequisites
Environment
I used the following environment for this demonstration.
| Item | Version/Value |
|---|---|
| aws-cdk-lib | 2.235.1 |
| Node.js | 25.4.0 |
| Region | us-west-2 |
Direct Ingestion
Typically, documents are ingested into Knowledge Bases by uploading documents to an S3 bucket, executing a synchronization job via the StartIngestionJob API, after which Knowledge Bases scans, chunks, and vectorizes the documents.
With direct ingestion, you send documents directly via the IngestKnowledgeBaseDocuments API, and Knowledge Bases immediately performs chunking and vectorization. In other words, it bypasses the scanning process of the synchronization job, allowing for faster document reflection.
Both custom data sources and S3 data sources support direct ingestion, but we'll focus on S3 data sources for this demonstration.
Implementation
Building Knowledge Bases Environment with CDK
First, let's create a Knowledge Bases environment for testing with CDK.
We'll use S3 Vectors as the vector store and create an S3 type data source.
Full Code
import * as cdk from "aws-cdk-lib/core";
import { Construct } from "constructs";
import * as s3vectors from "aws-cdk-lib/aws-s3vectors";
import * as s3 from "aws-cdk-lib/aws-s3";
import * as bedrock from "aws-cdk-lib/aws-bedrock";
import * as iam from "aws-cdk-lib/aws-iam";
export interface CdkS3VectorsKbStackProps extends cdk.StackProps {
/**
* Vector dimension for embeddings (default: 1024 for Titan Embeddings V2)
*/
vectorDimension?: number;
/**
* Embedding model ID (default: amazon.titan-embed-text-v2:0)
*/
embeddingModelId?: string;
}
export class CdkIngestKbStack extends cdk.Stack {
public readonly vectorBucket: s3vectors.CfnVectorBucket;
public readonly vectorIndex: s3vectors.CfnIndex;
public readonly knowledgeBase: bedrock.CfnKnowledgeBase;
public readonly dataSourceBucket: s3.Bucket;
constructor(scope: Construct, id: string, props?: CdkS3VectorsKbStackProps) {
super(scope, id, props);
const vectorDimension = props?.vectorDimension ?? 1024;
const embeddingModelId =
props?.embeddingModelId ?? "amazon.titan-embed-text-v2:0";
const vectorBucketName = `vector-bucket-${cdk.Aws.ACCOUNT_ID}-${cdk.Aws.REGION}`;
// ===========================================
// S3 Vector Bucket
// ===========================================
this.vectorBucket = new s3vectors.CfnVectorBucket(this, "VectorBucket", {
vectorBucketName: vectorBucketName,
});
// ===========================================
// S3 Vector Index
// ===========================================
this.vectorIndex = new s3vectors.CfnIndex(this, "VectorIndex", {
vectorBucketName: vectorBucketName,
indexName: "kb-vector-index",
dimension: vectorDimension,
distanceMetric: "cosine",
dataType: "float32",
});
this.vectorIndex.addDependency(this.vectorBucket);
// ===========================================
// Data Source S3 Bucket (for documents)
// ===========================================
this.dataSourceBucket = new s3.Bucket(this, "DataSourceBucket", {
bucketName: `kb-datasource-${cdk.Aws.ACCOUNT_ID}-${cdk.Aws.REGION}`,
removalPolicy: cdk.RemovalPolicy.DESTROY,
autoDeleteObjects: true,
});
// ===========================================
// IAM Role for Knowledge Bases
// ===========================================
const knowledgeBaseRole = new iam.Role(this, "KnowledgeBaseRole", {
assumedBy: new iam.ServicePrincipal("bedrock.amazonaws.com"),
inlinePolicies: {
BedrockKnowledgeBasePolicy: new iam.PolicyDocument({
statements: [
// S3 Vectors permissions
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
actions: [
"s3vectors:CreateIndex",
"s3vectors:DeleteIndex",
"s3vectors:GetIndex",
"s3vectors:ListIndexes",
"s3vectors:PutVectors",
"s3vectors:GetVectors",
"s3vectors:DeleteVectors",
"s3vectors:QueryVectors",
"s3vectors:ListVectors",
],
resources: [
// ARN format: arn:aws:s3vectors:REGION:ACCOUNT:bucket/BUCKET_NAME
`arn:aws:s3vectors:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:bucket/${vectorBucketName}`,
`arn:aws:s3vectors:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:bucket/${vectorBucketName}/index/*`,
],
}),
// S3 Data Source permissions
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
actions: ["s3:GetObject", "s3:ListBucket"],
resources: [
this.dataSourceBucket.bucketArn,
`${this.dataSourceBucket.bucketArn}/*`,
],
}),
// Bedrock Foundation Model permissions
new iam.PolicyStatement({
effect: iam.Effect.ALLOW,
actions: ["bedrock:InvokeModel"],
resources: [
`arn:aws:bedrock:${cdk.Aws.REGION}::foundation-model/${embeddingModelId}`,
],
}),
],
}),
},
});
// ===========================================
// Bedrock Knowledge Bases with S3 Vectors
// ===========================================
this.knowledgeBase = new bedrock.CfnKnowledgeBase(this, "KnowledgeBase", {
name: "S3VectorsKnowledgeBase",
description: "Knowledge Bases using S3 Vectors as vector store",
roleArn: knowledgeBaseRole.roleArn,
knowledgeBaseConfiguration: {
type: "VECTOR",
vectorKnowledgeBaseConfiguration: {
embeddingModelArn: `arn:aws:bedrock:${cdk.Aws.REGION}::foundation-model/${embeddingModelId}`,
},
},
storageConfiguration: {
type: "S3_VECTORS",
s3VectorsConfiguration: {
vectorBucketArn: this.vectorBucket.attrVectorBucketArn,
indexName: this.vectorIndex.indexName!,
},
},
});
this.knowledgeBase.addDependency(this.vectorIndex);
this.knowledgeBase.node.addDependency(knowledgeBaseRole);
// ===========================================
// Bedrock Data Source (S3 Type)
// ===========================================
const dataSource = new bedrock.CfnDataSource(this, "DataSource", {
name: "S3DataSource",
description: "S3 data source for knowledge base",
knowledgeBaseId: this.knowledgeBase.attrKnowledgeBaseId,
dataSourceConfiguration: {
type: "S3",
s3Configuration: {
bucketArn: this.dataSourceBucket.bucketArn,
},
},
});
dataSource.addDependency(this.knowledgeBase);
// ===========================================
// Outputs
// ===========================================
new cdk.CfnOutput(this, "VectorBucketArn", {
value: this.vectorBucket.attrVectorBucketArn,
description: "ARN of the S3 Vector Bucket",
});
new cdk.CfnOutput(this, "VectorIndexArn", {
value: this.vectorIndex.attrIndexArn,
description: "ARN of the Vector Index",
});
new cdk.CfnOutput(this, "KnowledgeBaseId", {
value: this.knowledgeBase.attrKnowledgeBaseId,
description: "ID of the Bedrock Knowledge Bases",
});
new cdk.CfnOutput(this, "DataSourceId", {
value: dataSource.attrDataSourceId,
description: "ID of the S3 Data Source",
});
new cdk.CfnOutput(this, "DataSourceBucketName", {
value: this.dataSourceBucket.bucketName,
description: "Name of the S3 bucket for data source documents",
});
}
}
We're using S3 Vectors as our vector store because it's cost-effective, and we've set the data source type to S3.
We're using Titan Embeddings V2 as our Embedding Model.
This code creates an S3 bucket for the data source, Knowledge Bases, and S3 Vectors.
Deploy and Initial Synchronization
Let's deploy:
pnpm dlx cdk deploy
After deployment completes, the Knowledge Base ID, Data Source ID, and bucket name will be output.
CdkIngestKbStack.DataSourceBucketName = kb-datasource-123456789012-us-west-2
CdkIngestKbStack.DataSourceId = YYYYYYYYYY
CdkIngestKbStack.KnowledgeBaseId = XXXXXXXXXX
CdkIngestKbStack.VectorBucketArn = arn:aws:s3vectors:us-west-2:123456789012:bucket/vector-bucket-123456789012-us-west-2
CdkIngestKbStack.VectorIndexArn = arn:aws:s3vectors:us-west-2:123456789012:bucket/vector-bucket-123456789012-us-west-2/index/kb-vector-index
It's convenient to set these as environment variables for use in subsequent commands.
export KB_ID="XXXXXXXXXX" # KnowledgeBaseId output value
export DS_ID="YYYYYYYYYY" # DataSourceId output value
export BUCKET_NAME="kb-datasource-123456789012-us-east-1" # DataSourceBucketName output value
Run the initial synchronization using the console or CLI:
aws bedrock-agent start-ingestion-job \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID"
Once complete, our foundation is ready, so let's proceed with testing!
Testing
Let's test common operations like creating, updating, and deleting documents!
Since we're assuming integration into an application, we'll use APIs rather than the console for our testing.
Document Creation
First, let's use the IngestKnowledgeBaseDocuments API to ingest a document.
For S3 data sources, we specify S3 files for ingestion.
First, upload a document to S3:
echo "Amazon Bedrockは、主要なAI企業が提供する高性能な基盤モデルを単一のAPIで利用できるフルマネージドサービスです。" > bedrock-intro.txt
aws s3 cp bedrock-intro.txt s3://${BUCKET_NAME}/documents/bedrock-intro.txt
Next, execute direct ingestion using the ingest-knowledge-base-documents API to ingest it into Knowledge Bases:
aws bedrock-agent ingest-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--documents '[
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
}
}
}
}
]'
When executed, an asynchronous process starts, and a response like this is returned:
The status shows STARTING.
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "STARTING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
}
},
"updatedAt": "2026-01-25T12:21:37.536031+00:00"
}
]
}
Ingestion is processed asynchronously, and the status transitions through these states:
| Status | Description |
|---|---|
STARTING |
Ingestion process starting |
IN_PROGRESS |
Chunking and vectorization in progress |
INDEXED |
Successfully completed ingestion |
FAILED |
Ingestion failed |
PARTIALLY_INDEXED |
Only partially succeeded |
You might be concerned about how to determine completion without knowing the current status, but don't worry.
You can check the status using the GetKnowledgeBaseDocuments API with the S3 path as a key.
Let's run this after waiting a bit:
aws bedrock-agent get-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--document-identifiers '[
{
"dataSourceType": "S3",
"s3": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
}
}
]'
The execution returns a response in the same format as when we performed direct ingestion:
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "INDEXED",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
}
},
"statusReason": "",
"updatedAt": "2026-01-25T12:21:40.992274+00:00"
}
]
}
Looking at the status, it's now INDEXED, different from before. The ingestion has completed successfully.
When integrating into an application, you'll need to account for asynchronous processing by periodically polling the status using GetKnowledgeBaseDocuments. You'll also want to consider completion detection and error handling.
Let's test this in the console by querying the knowledge base.
I'll simply ask: "Tell me about Bedrock."

It's responding with information from our uploaded document!
Document Update
Next, let's update the S3 file we created earlier:
echo "Amazon Bedrockは、主要なAI企業が提供する高性能な基盤モデルを単一のAPIで利用できるフルマネージドサービスです。Claude、Titan、Mistralなど多数のモデルに対応しています。" > bedrock-intro.txt
aws s3 cp bedrock-intro.txt s3://${BUCKET_NAME}/documents/bedrock-intro.txt
Call IngestKnowledgeBaseDocuments again to update the document:
aws bedrock-agent ingest-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--documents '[
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
}
}
}
}
]'
The response is returned in the same format as when creating. Timestamps are of course updated.
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "STARTING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
}
},
"updatedAt": "2026-01-25T12:40:40.567801+00:00"
}
]
}
Existing documents are overwritten by specifying the same S3 URI, so Knowledge Bases is updated when you update the file on S3 and then execute direct ingestion.
Let's check this in the console again:

The updated content is retrieved!
Document Deletion
Next, let's check how to delete a document.
To delete a document, use the DeleteKnowledgeBaseDocuments API:
aws bedrock-agent delete-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--document-identifiers '[
{
"dataSourceType": "S3",
"s3": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
}
}
]'
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "DELETING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
}
},
"updatedAt": "2026-01-25T12:42:47.503248+00:00"
}
]
}
The deletion process has started. Deletion is also processed asynchronously, so it will go through the DELETING status before completion.
Note that even if you delete from Knowledge Bases using direct ingestion, the file in the S3 bucket is not deleted.
It will be re-ingested during the next synchronization job, so you should also delete the file from the S3 bucket. Maintaining consistency is an important point to keep in mind.
Let's check the result:

Since there are no synchronized documents anymore, it can't provide an answer. That's as expected.
Batch Processing Multiple Documents
You can process up to 10 documents at once via the console, or 25 documents via API in a single call.
First, let's upload multiple documents to S3:
echo "AWS Lambdaはサーバーレスコンピューティングサービスです。" > lambda.txt
echo "Amazon S3はオブジェクトストレージサービスです。" > s3.txt
echo "Amazon DynamoDBはフルマネージドNoSQLデータベースです。" > dynamodb.txt
aws s3 cp lambda.txt s3://${BUCKET_NAME}/documents/
aws s3 cp s3.txt s3://${BUCKET_NAME}/documents/
aws s3 cp dynamodb.txt s3://${BUCKET_NAME}/documents/
After uploading, execute ingest-knowledge-base-documents just like we did for a single file.
It's simple - just specify multiple files in the array.
aws bedrock-agent ingest-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--documents '[
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/lambda.txt"
}
}
}
},
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/s3.txt"
}
}
}
},
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/dynamodb.txt"
}
}
}
}
]'
The execution returns multiple statuses:
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "STARTING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/lambda.txt"
}
},
"updatedAt": "2026-01-25T12:46:42.231326+00:00"
},
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "STARTING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/s3.txt"
}
},
"updatedAt": "2026-01-25T12:46:42.252964+00:00"
},
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "STARTING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/dynamodb.txt"
}
},
"updatedAt": "2026-01-25T12:46:42.275428+00:00"
}
]
}
Let's try asking "Tell me about Lambda":

Multiple files have been properly ingested without any issues!
Ensuring Idempotency with clientToken
When implementing direct ingestion processing in applications, you often need to implement retry logic for network errors or timeouts, right?
However, in cases where the first request actually succeeded, there's a possibility that the same document could be processed multiple times during retries.
To prevent such problems, the clientToken parameter is provided.
Let's create a file and upload it as a test.
echo "Amazon CloudWatchはAWSリソースの監視サービスです。" > cloudwatch.txt
aws s3 cp cloudwatch.txt s3://${BUCKET_NAME}/documents/
Let's execute direct ingestion with the clientToken specified.
aws bedrock-agent ingest-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--client-token "user123-cloudwatch-20260125124500" \
--documents '[
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/cloudwatch.txt"
}
}
}
}
]'
When sending this request continuously, you can see the response below, and the status is IN_PROGRESS instead of STARTING.
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "IN_PROGRESS",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/cloudwatch.txt"
}
},
"statusReason": "",
"updatedAt": "2026-01-25T13:01:11.857718+00:00"
}
]
}
When sending requests with the same clientToken, only the first one is processed, and subsequent ones are not executed redundantly.
For example, when implementing an API for users to upload documents, creating a unique token using a combination like userID + filename + timestamp would make retries more reliable.
| Parameter | Description |
|---|---|
clientToken |
A unique string of 33-256 characters. Requests with the same token are idempotent |
About Adding Metadata
With S3 data sources, there are restrictions on how to specify metadata.
You cannot specify metadata inline in the request body.
| Data Source Type | Inline specification | S3 location specification |
|---|---|---|
| CUSTOM | ○ | ○ |
| S3 | × | ○ |
If you want to add metadata to an S3 data source with direct ingestion, you need to upload a metadata file (.metadata.json) to S3 and set metadata.type to S3_LOCATION with the specified URI.
Let's try it. First, let's create metadata and upload it to S3.
cat << 'EOF' > bedrock-intro.txt.metadata.json
{
"metadataAttributes": {
"category": "aws-service",
"year": 2023
}
}
EOF
aws s3 cp bedrock-intro.txt.metadata.json s3://${BUCKET_NAME}/documents/
Once the upload is complete, specify parameters to include metadata in the ingestion.
aws bedrock-agent ingest-knowledge-base-documents \
--knowledge-base-id "$KB_ID" \
--data-source-id "$DS_ID" \
--documents '[
{
"content": {
"dataSourceType": "S3",
"s3": {
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt"
}
}
},
"metadata": {
"type": "S3_LOCATION",
"s3Location": {
"uri": "s3://'"${BUCKET_NAME}"'/documents/bedrock-intro.txt.metadata.json"
}
}
}
]'
Let's check the response.
{
"documentDetails": [
{
"knowledgeBaseId": "XXXXXXXXXX",
"dataSourceId": "YYYYYYYYYY",
"status": "STARTING",
"identifier": {
"dataSourceType": "S3",
"s3": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
}
},
"updatedAt": "2026-01-25T13:14:29.316549+00:00"
}
]
}
The response doesn't show whether metadata has been applied.
Let's try searching with metadata filtering.
You can specify filter conditions with the --retrieval-configuration parameter in the retrieve command.
Let's search for documents where category is aws-service.
aws bedrock-agent-runtime retrieve \
--knowledge-base-id "$KB_ID" \
--retrieval-query '{"text": "AWSサービスについて教えて"}' \
--retrieval-configuration '{
"vectorSearchConfiguration": {
"filter": {
"equals": {
"key": "category",
"value": "aws-service"
}
}
}
}'
Let's check the execution result.
{
"retrievalResults": [
{
"content": {
"text": "Amazon Bedrockは、主要なAI企業が提供する高性能な基盤モデルを単一のAPIで利用できるフルマネージドサービスです。Claude、Titan、Mistralなど多数のモデルに対応しています。",
"type": "TEXT"
},
"location": {
"s3Location": {
"uri": "s3://kb-datasource-123456789012-us-west-2/documents/bedrock-intro.txt"
},
"type": "S3"
},
"metadata": {
"x-amz-bedrock-kb-source-file-modality": "TEXT",
"category": "aws-service",
"year": 2023.0,
"x-amz-bedrock-kb-chunk-id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"x-amz-bedrock-kb-data-source-id": "YYYYYYYYYY"
},
"score": 0.592372715473175
}
],
"guardrailAction": null
}
We can see that the metadata we added is reflected and retrieved!
Let's also check if it doesn't match when year is 2024.
aws bedrock-agent-runtime retrieve \
--knowledge-base-id "$KB_ID" \
--retrieval-query '{"text": "AWSサービスについて教えて"}' \
--retrieval-configuration '{
"vectorSearchConfiguration": {
"filter": {
"equals": {
"key": "year",
"value": "2024"
}
}
}
}'
Let's check the execution result.
{
"retrievalResults": [],
"guardrailAction": null
}
Nothing is retrieved! We've confirmed that metadata filtering is working.
Notes on Using Direct Ingestion with S3 Data Sources
When using direct ingestion with S3 data sources, it's important to note that even if documents are removed from Knowledge Bases, files on S3 are not automatically deleted, and they will be overwritten based on the S3 bucket state during the next sync job execution.
Since S3 and Knowledge Bases don't automatically synchronize, you need to be aware of the workflow to maintain consistency between them: add, update, or delete files in the S3 bucket, and then immediately reflect these changes in Knowledge Bases using the direct ingestion API. It's also good to have a mechanism to roll back and ensure consistency in case of errors.
Conclusion
Direct ingestion using S3 data sources requires attention to consistency between S3 and Knowledge Bases, but it's beneficial to be able to reflect documents immediately without waiting for sync jobs.
It seems particularly useful for RAG applications requiring real-time updates or use cases where documents are frequently updated. I also have a case where I want to use it soon, so I plan to incorporate it into my application.
I hope this article has been helpful. Thank you for reading until the end!!

