
I tried building low-cost RAG with Gemma 4 31B + S3 Vectors + AgentCore
This page has been translated by machine translation. View original
Introduction
Hello, I'm Kamino from the Consulting Department, and I'm trying to get better at parking.
The other day, suzuki.ryo wrote an article about the Japanese language performance and pricing of Gemma 4 31B.
As mentioned in that article, Gemma 4 31B's token price is $0.14 / 1M tokens for input and $0.40 / 1M tokens for output. Here's how it compares to other models:
| Model | Input (/ 1M tokens) | Output (/ 1M tokens) | Compared to Gemma 4 31B |
|---|---|---|---|
| Gemma 4 31B | $0.14 | $0.40 | 1x |
| Claude Haiku 4.5 | $0.80 | $4.00 | 5.7〜10x |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 21〜37x |
| Claude Opus 4.8 | $5.00 | $25.00 | 35〜62x |
Gemma4 is incredibly cheap...!!
I thought it would be interesting to build a RAG as an AI agent if it can handle Japanese reasonably well and supports tool calls, so I built a RAG using Amazon Bedrock AgentCore + S3 Vectors.
Since the LLM is cheap, I wanted to keep the vector store costs low too. S3 Vectors charges for storage, Query API calls, and the volume of vectors processed during queries, but for a small number of documents it can be quite affordable. I created a template with Gemma 4 31B + S3 Vectors to dispel the notion that RAG seems expensive and hard to try. Here's the repository:
Here's what it looks like in action:

I built it with the help of AI and it turned out nicely.
I hope this encourages you to try RAG for yourself. Trying it out first is what matters most.
This article introduces the key aspects of the architecture and implementation. Please refer to the repository for the full code and detailed implementation.
Prerequisites
The environment used this time is as follows:
- Node.js v24.16.0
- Python v3.13.11
- AWS CLI
- Docker (used for building the AgentCore Runtime container)
- Gemma 4 31B must be available via bedrock-mantle (as of 2026-06-14, available in us-east-1 etc.; note that the Tokyo region is not available)
Architecture
The architecture uses Amplify for the frontend and is centered around AgentCore.

| Component | Technology |
|---|---|
| LLM | Gemma 4 31B (via bedrock-mantle's OpenAI-compatible API) |
| Agent | Strands Agents SDK |
| Knowledge Base | Bedrock Knowledge Bases + S3 data source |
| Vector Store | S3 Vectors (Titan Embedding v2, 1024 dim, cosine) |
| Conversation Memory | AgentCore Memory |
| Agent Execution Environment | AgentCore Runtime |
| MCP Gateway (optional) | AgentCore Gateway + AWS Knowledge MCP Server |
| Authentication | Cognito |
| History API | API Gateway REST API + Cognito User Pool Authorizer + Lambda + AgentCore Memory + DynamoDB |
| Feedback | API Gateway REST API + Cognito User Pool Authorizer + Lambda + DynamoDB |
| Frontend | React + Vite + Tailwind CSS v4 |
| IaC | Amplify Gen 2 + CDK |
Here's a rough overview of what it can do. It implements the minimum features needed to search via chat.
- Search and answer from the knowledge base in a chat format
- Filter search targets by category (HR, accounting, security, development, operations)
- Maintain conversation history with AgentCore Memory
- Search AWS official documentation as an MCP tool via Gateway (optional)
- Record feedback (like / improve / comment) for answers to DynamoDB
- Answer quality evaluation script using LLM-as-a-Judge (includes a 15-question test set)
Deployment
Clone the repository and run npm install → npx ampx sandbox --once to create all the following resources at once:
- Cognito (authentication)
- Bedrock Knowledge Base + S3 Vectors
- Sample document upload to S3 and KB sync
- AgentCore Runtime (Gemma 4 31B agent)
- AgentCore Memory
- REST API + Lambda + DynamoDB for conversation history retrieval
- REST API + Lambda + DynamoDB for feedback
git clone https://github.com/yuu551/gemma4-31b-agentcore-sample.git
cd gemma4-31b-agentcore-sample
npm install
npx ampx sandbox --once
Once deployment is complete, amplify_outputs.json is generated. The frontend automatically reads the Runtime ARN and Memory ID from this file, so no environment variable configuration is needed.
npm run dev
Opening the browser will display the Cognito authentication screen. After signing up, you can try knowledge search from the chat screen.
To enable AgentCore Gateway, deploy with the environment variable:
ENABLE_GATEWAY=true npx ampx sandbox --once
Enabled/disabled is managed in amplify/parameters.ts.
Why S3 Vectors Was Chosen
Let's compare common vector store options for RAG:
| Vector Store | Minimum Monthly Cost | Billing Model |
|---|---|---|
| OpenSearch Serverless (Classic) | ~$350+ | OCU hourly billing (minimum 2 OCU) |
| S3 Vectors | $0+ | Fully pay-as-you-go |
OpenSearch Serverless (Classic, not NextGen) that's compatible with Knowledge Bases incurs hourly charges for at least the minimum OCU, which can easily reach hundreds of dollars per month for testing purposes. S3 Vectors charges only for what you use, making it perfect for PoCs and internal tools with small document sets. You do need to understand that hybrid search is not available and there is higher latency compared to OpenSearch, but at the stage of just trying things out quickly, you don't need to be too worried about that.
In this case, S3 Vectors is used as the backend for Bedrock Knowledge Bases. By querying through the Retrieve API, we take advantage of managed features like document ingestion and vector conversion, while keeping vector store costs limited to S3 Vectors pay-as-you-go pricing.
Building Knowledge Base + S3 Vectors with CDK
In CDK, the vector store and Knowledge Base are built using L1 constructs. CfnVectorBucket / CfnIndex creates the vector store, and CfnKnowledgeBase's storageConfiguration specifies S3_VECTORS to link them.
import * as s3vectors from "aws-cdk-lib/aws-s3vectors";
import * as bedrock from "aws-cdk-lib/aws-bedrock";
const vectorBucket = new s3vectors.CfnVectorBucket(ragStack, "VectorBucket", {});
const vectorIndex = new s3vectors.CfnIndex(ragStack, "VectorIndex", {
vectorBucketArn: vectorBucket.attrVectorBucketArn,
indexName: "company-docs",
dataType: "float32",
dimension: 1024,
distanceMetric: "cosine",
metadataConfiguration: {
nonFilterableMetadataKeys: [
"AMAZON_BEDROCK_TEXT", "AMAZON_BEDROCK_METADATA",
"x-amz-bedrock-kb-source-uri", "x-amz-bedrock-kb-chunk-id",
"x-amz-bedrock-kb-data-source-id",
],
},
});
const kb = new bedrock.CfnKnowledgeBase(ragStack, "KnowledgeBase", {
name: "agentic-rag-kb",
roleArn: kbRole.roleArn,
knowledgeBaseConfiguration: {
type: "VECTOR",
vectorKnowledgeBaseConfiguration: {
embeddingModelArn: `arn:aws:bedrock:${cdk.Aws.REGION}::foundation-model/amazon.titan-embed-text-v2:0`,
},
},
storageConfiguration: {
type: "S3_VECTORS",
s3VectorsConfiguration: {
vectorBucketArn: vectorBucket.attrVectorBucketArn,
indexName: "company-docs",
},
},
});
The storageConfiguration type is set to S3_VECTORS, and s3VectorsConfiguration links the VectorBucket and Index. The ingestion from the S3 data source to KB is set up so that a document ingestion job runs automatically at deploy time using a CDK custom resource.
The nonFilterableMetadataKeys in metadataConfiguration specifies metadata keys used internally by Knowledge Bases. S3 Vectors metadata has a size limit, and trying to register these keys as filterable will exceed the limit and cause errors during document ingestion. Be careful to put metadata keys that won't be used for filtering into nonFilterableMetadataKeys.
Running Gemma 4 31B with a Strands Agent
Gemma 4 31B is called via bedrock-mantle's OpenAI-compatible API.
Using Strands Agents SDK's OpenAIResponsesModel, it works as an agent capable of tool execution.
from strands import Agent, tool
from strands.models.openai_responses import OpenAIResponsesModel
from aws_bedrock_token_generator import provide_token
BASE_URL = f"https://bedrock-mantle.{REGION}.api.aws/openai/v1"
model = OpenAIResponsesModel(
model_id="google.gemma-4-31b",
client_args={
"api_key": provide_token(region=REGION),
"base_url": BASE_URL,
},
)
agent = Agent(
model=model,
tools=[knowledge_search, list_categories],
system_prompt=SYSTEM_PROMPT,
)
bedrock-mantle uses a different endpoint from regular Bedrock (Converse API), and a bearer token is obtained using aws-bedrock-token-generator. aws-bedrock-token-generator is a library that generates short-term tokens for bedrock-mantle from AWS credentials, and you can use it simply by calling provide_token(region=REGION).
How to call bedrock-mantle from Strands is also covered in the following article, which is worth reading!
Agent Search Tool
As an agent tool, a search function is implemented that calls the Retrieve API of Bedrock Knowledge Bases.
@tool
def knowledge_search(query: str, category: str = "") -> str:
"""Searches internal documents from the knowledge base."""
kwargs = {
"knowledgeBaseId": KNOWLEDGE_BASE_ID,
"retrievalQuery": {"text": query},
"retrievalConfiguration": {
"vectorSearchConfiguration": {"numberOfResults": 5}
},
}
if category and category in CATEGORIES:
kwargs["retrievalConfiguration"]["vectorSearchConfiguration"]["filter"] = {
"equals": {"key": "category", "value": category}
}
response = bedrock_agent.retrieve(**kwargs)
# Format and return results
Rather than querying S3 Vectors directly, searches go through the Knowledge Bases Retrieve API. Embedding conversion is handled automatically by Knowledge Bases, so the agent side just needs to pass a query string. Metadata filtering is also supported, so it's possible to narrow down search targets by category (HR, accounting, security, development, operations).
Sending Requests from SPA to AgentCore Runtime
AgentCore Runtime supports JWT authentication, so it can be called directly from the frontend.
const runtime = new agentcore.Runtime(ragStack, "AgenticRagRuntime", {
runtimeName: "agentic_rag_gemma4",
agentRuntimeArtifact: agentArtifact,
authorizerConfiguration: agentcore.RuntimeAuthorizerConfiguration.usingCognito(
backend.auth.resources.userPool,
[backend.auth.resources.userPoolClient],
),
});
Simply specifying Cognito in the Runtime's authorizerConfiguration allows the frontend to send requests with Authorization: Bearer <accessToken>.
const session = await fetchAuthSession();
const accessToken = session.tokens?.accessToken?.toString();
const response = await fetch(
`https://bedrock-agentcore.${REGION}.amazonaws.com/runtimes/${runtimeArn}/invocations?qualifier=DEFAULT`,
{
method: "POST",
headers: {
Authorization: `Bearer ${accessToken}`,
"Content-Type": "application/json",
Accept: "text/event-stream",
"X-Amzn-Bedrock-AgentCore-Runtime-Session-Id": sessionId,
},
body: JSON.stringify({ prompt, sessionId, userId }),
},
);
The Runtime ARN can be read from amplify_outputs.json. If you output it on the CDK side with backend.addOutput({ custom: { runtime_arn: runtime.agentRuntimeArn } }), the frontend only needs to import it.
Frontend
Since Amplify Gen 2 is CDK-based, Cognito configuration only requires a few lines in amplify/auth/resource.ts. On the frontend side, simply wrapping the entire app with the @aws-amplify/ui-react Authenticator component completes sign-up, login, and token management.
import { Authenticator } from "@aws-amplify/ui-react";
function App() {
return (
<Authenticator>
{({ user }) => <ChatApp user={user} />}
</Authenticator>
);
}
It's great that an authenticated screen can be created this easily during the verification phase.
Streamdown is used for displaying chat responses.
Streamdown is a library that renders markdown flowing in via SSE in real time, suppressing flickering and layout jumps during streaming. It also supports syntax highlighting for code blocks and was simple to integrate.
<Streamdown isAnimating={!!message.isStreaming}>
{processedContent}
</Streamdown>
For frontend implementation details, please refer to the src/ directory in the repository.
Maintaining Conversation History with AgentCore Memory
AgentCore Memory and Strands are integrated to maintain conversation history across sessions.
from bedrock_agentcore.memory.integrations.strands.config import AgentCoreMemoryConfig
from bedrock_agentcore.memory.integrations.strands.session_manager import AgentCoreMemorySessionManager
config = AgentCoreMemoryConfig(
memory_id=MEMORY_ID,
actor_id=user_id,
session_id=session_id,
)
session_manager = AgentCoreMemorySessionManager(
agentcore_memory_config=config,
region_name=REGION,
)
agent = Agent(
model=model,
tools=tools,
system_prompt=SYSTEM_PROMPT,
session_manager=session_manager,
)
Simply passing AgentCoreMemorySessionManager to Strands' session_manager automates saving and restoring conversations. Conversations are isolated per user and per session, so it's safe for multi-user environments.
Managing History List
Conversation content is saved in AgentCore Memory, and when restoring history, the ListEvents API is called via API Gateway + Cognito User Pool Authorizer + Lambda. The configuration uses the JWT sub claim as the actor_id. This ensures that only the logged-in user's own history is displayed.
Additionally, session titles and last updated times displayed in the sidebar are saved as metadata in DynamoDB. After a message is sent, the title and last message are updated, and the sidebar references this metadata.
Searching AWS Documentation via AgentCore Gateway (Optional)
AWS official documentation can be searched as an MCP tool through AgentCore Gateway. It becomes available when deployed with ENABLE_GATEWAY=true.
const gateway = new agentcore.Gateway(ragStack, "KnowledgeGateway", {
gatewayName: "agentic-rag-gateway",
authorizerConfiguration: agentcore.GatewayAuthorizer.usingAwsIam(),
});
new agentcore.CfnGatewayTarget(ragStack, "AWSKnowledgeTarget", {
gatewayIdentifier: gateway.gatewayId,
name: "aws-knowledge",
targetConfiguration: {
mcp: { mcpServer: { endpoint: "https://knowledge-mcp.global.api.aws" } },
},
});
On the agent side, mcp-proxy-for-aws is used to connect to the Gateway with IAM authentication.
from mcp_proxy_for_aws.client import aws_iam_streamablehttp_client
from strands.tools.mcp import MCPClient
mcp_factory = lambda: aws_iam_streamablehttp_client(
endpoint=GATEWAY_URL,
aws_region=REGION,
aws_service="bedrock-agentcore",
)
tools.append(MCPClient(mcp_factory))
The AWS Knowledge MCP Server is an MCP server hosted by AWS that can be used simply by adding it as a Gateway target. It enables cross-searching of both internal documents and AWS official documentation, making it useful as a reference for cases where you want to support technical questions as well. It also serves as a sample of how to use Gateway. There's no need to use it forcibly.
For the configuration using IAM authentication via Gateway, please refer to the following article as needed:
Extending the Gateway
This time only the AWS Knowledge MCP Server is added as a Gateway target, but AgentCore Gateway can also connect Lambda, custom MCP Servers, and even AgentCore Runtime itself as targets. For example, you can add tools for calling internal APIs, integrating with external services, or custom processing.
For example, to call another AgentCore Runtime as a tool via Gateway, specify the Runtime's invocations endpoint as the target and configure IAM authentication.
const runtimeEndpoint =
`https://bedrock-agentcore.${cdk.Aws.REGION}.amazonaws.com/runtimes/` +
encodeURIComponent(anotherRuntime.agentRuntimeArn) +
"/invocations";
new agentcore.CfnGatewayTarget(ragStack, "MyAgentTarget", {
gatewayIdentifier: gateway.gatewayId,
name: "my-agent",
description: "Internal data analysis agent",
targetConfiguration: {
mcp: {
mcpServer: {
endpoint: runtimeEndpoint,
},
},
},
credentialProviderConfigurations: [
{
credentialProviderType: "GATEWAY_IAM_ROLE",
credentialProvider: {
iamCredentialProvider: {
service: "bedrock-agentcore",
},
},
},
],
});
The endpoint format is https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{URL-encoded ARN}/invocations. By specifying an IAM role in credentialProviderConfigurations, the Gateway will attach SigV4 signatures to requests.
Without changing the agent code itself, adding targets like this allows you to expand what's possible.
For the configuration using IAM authentication with MCP Server targets in Gateway, it is covered in detail in the following article, so please refer to it as needed:
Feedback Collection
When operating a RAG, user feedback is essential for improving answer quality. This implementation includes a mechanism for sending "like / needs improvement" and comments for each answer.
API Gateway REST API + Cognito User Pool Authorizer + Lambda + DynamoDB is used as the feedback destination.
const feedbackApi = new apigateway.RestApi(ragStack, "FeedbackApi", {
restApiName: "agentic-rag-feedback",
deployOptions: { stageName: "prod" },
});
const feedbackAuthorizer = new apigateway.CognitoUserPoolsAuthorizer(ragStack, "FeedbackAuthorizer", {
cognitoUserPools: [backend.auth.resources.userPool],
});
feedbackResource.addMethod("POST", new apigateway.LambdaIntegration(feedbackFn), {
authorizer: feedbackAuthorizer,
authorizationType: apigateway.AuthorizationType.COGNITO,
});
Authentication uses Cognito ID tokens, so you can track who gave feedback on which answer. Feedback saved in DynamoDB can be checked with scripts for aggregation and CSV export.
npm run feedback:summary -- --table <TABLE_NAME>
npm run feedback:export -- --table <TABLE_NAME> -o feedback.csv
In addition to automated LLM-as-a-Judge evaluation, collecting actual user feedback provides material for decisions on prompt improvements and document additions.
RAG Quality Evaluation with LLM-as-a-Judge
Once you've built a RAG, you'll be curious about its accuracy. Using Gemma 4 31B itself as the judging LLM, I automatically evaluated the answer quality for 15 test questions from three perspectives. The evaluation script is available in eval/evaluate.py.
npm run eval -- \
--runtime-arn <RUNTIME_ARN> \
--region us-east-1
The Runtime ARN is listed in amplify_outputs.json generated after deployment, under custom.runtime_arn. When run, it retrieves RAG answers for all 15 questions, and Gemma 4 31B acts as the judge, scoring each answer from 1 to 5 on three axes: Faithfulness, Relevancy, and Completeness.
[1/15] q01: How many days per week is remote work possible?...
RAG response: 289 characters, 1 tool call, 6.9 seconds
Evaluation: F=5 R=5 C=5 (6.9 seconds)
Reason: Fully covers the information in the correct answer, including "up to 3 days per week" and "apply from HR system by 5pm the day before," and references the correct document
[2/15] q02: How much does a taxi fare need to be for prior approval in expense claims?...
RAG response: 152 characters, 1 tool call, 6.1 seconds
Evaluation: F=5 R=5 C=5 (4.5 seconds)
Reason: The answer perfectly matches the correct content, correctly referencing the document to respond accurately
Here are the evaluation results for the sample documents:
| Perspective | Content | Score |
|---|---|---|
| Faithfulness | Whether the answer is based on KB content (absence of hallucinations) | 5.00 / 5.00 |
| Relevancy | Whether the correct document is referenced for the answer | 5.00 / 5.00 |
| Completeness | Whether expected information is fully covered | 4.47 / 5.00 |
| Overall | - | 4.82 / 5.00 |
All 15 questions were answered with correct document retrieval. The slightly lower Completeness score was due to a tendency to omit supplementary information such as RPO (Recovery Point Objective) and MFA requirements. Since the main answers are returned accurately, this is sufficient accuracy for a model in this price range.
Note that this is based on a simple set of one-shot questions and simple evaluation criteria, which tends to produce higher scores, so please treat these as reference values. If you replace with your own documents, you can also rewrite the test questions in eval/questions.json and evaluate the new documents using the same script. It's intended to be used together with the feedback collection mechanism to drive an improvement cycle.
Examples of questions used in evaluation
- How many days per week is remote work possible? Please also tell me the application method.
- How much does a taxi fare need to be for prior approval in expense claims?
- Where should I contact if I discover a security incident?
- Are there any days of the week when production deployments are prohibited?
- What level of approval is required for procurement of 5 million yen or more?
Using Your Own Documents
This sample includes 20 fictional business documents, but you can switch to your own documents simply by replacing the contents of seed/docs/.
Place text files and metadata files in pairs:
seed/docs/
├── my-document.txt # Document content
├── my-document.txt.metadata.json # Metadata (category, title)
{
"metadataAttributes": {
"category": {
"value": { "type": "STRING", "stringValue": "hr" }
},
"title": {
"value": { "type": "STRING", "stringValue": "Remote Work Policy" }
}
}
}
Upon redeployment, BucketDeployment uploads to S3, and the custom resource automatically runs the document ingestion job. After replacing documents, rewriting the test questions in eval/questions.json allows you to evaluate the new documents with LLM-as-a-Judge as well.
Cost Estimate
Here is a rough estimate for operating with 1,000 questions per month. Note that this is just an estimate and actual costs depend on usage.
This assumes us-east-1, Gateway enabled, LLM-as-a-Judge not executed, and sample documents only.
| Item | Monthly Estimate |
|---|---|
| Gemma 4 31B (inference) | ~$0.34 |
| Embedding + S3 Vectors | ~$0.05–$0.20 |
| AgentCore Runtime | ~$0.50–$2 |
| AgentCore Memory + Gateway | ~$0.10–$1 |
| Cognito | $0 (within 10k MAU) |
| Amplify Hosting | ~$0.50–$2 |
| API Gateway + Lambda + DynamoDB | ~$0.01–$0.10 |
| Total | ~$2–$5 / month |
LLM inference and AgentCore Runtime are the main cost components. Compared to the minimum cost of OpenSearch Serverless (from ~$350/month), using S3 Vectors brings vector store costs to nearly zero. For phases where you "just want to get something running," such as PoCs and internal tools, this is a very attractive configuration.
Conclusion
I think many people have the impression that building a RAG seems costly and hard to try out.
However, with the configuration introduced this time, I hope you'll find it easy to quickly get a feel for the behavior and experience without too much cost.
If you find potential for adoption here, it's perfectly fine to add features on top of this repository as a base, or to tune it. It would also be interesting to compare accuracy with frontier models like Claude.
For implementation details, please refer to the repository. If you have any feedback, please feel free to submit it via Issues!
I hope this article is helpful in some way. Thank you for reading to the end!!
