Semantic Cache Validation in Amazon ElastiCache (Valkey 8.2)
This page has been translated by machine translation. View original
Introduction
LLM response generation can easily be bottlenecked by model inference processing time and waiting time for external API calls. For past or semantically similar queries, returning cached responses can eliminate the need for inference and external calls. In this article, we verified the implementation of semantic caching using Amazon ElastiCache's Valkey 8.2.
What is Semantic Caching?
Semantic caching is a method that hits the cache for semantically similar questions even when they don't match exactly. It converts questions into vectors called embeddings and searches for similar data using vector search. To evaluate similarity, metrics such as cosine similarity are used.
Verification Environment
- Region: ap-northeast-1
- ElastiCache: Valkey 8.2
- Node.js: v22.16.0
- iovalkey: 0.3.1
- Embedding: Titan Text Embeddings v2 (Bedrock)
Overall Flow
The application converts input text to embeddings and compares them with embeddings stored in ElastiCache using KNN search. If the similarity exceeds a threshold, it returns the cached response.
Verification
Setting up ElastiCache (Valkey 8.2)
Here's a minimal Terraform example. Subnet groups and security groups are omitted for brevity.
resource "aws_elasticache_parameter_group" "valkey8" {
family = "valkey8"
name = "REPLACE_ME-valkey8-params"
parameter {
name = "reserved-memory-percent"
value = "50"
}
}
resource "aws_elasticache_replication_group" "semantic_cache" {
replication_group_id = "REPLACE_ME-semantic-cache"
description = "Valkey 8.2 for semantic cache"
engine = "valkey"
engine_version = "8.2"
node_type = "cache.t4g.micro"
num_cache_clusters = 1
parameter_group_name = aws_elasticache_parameter_group.valkey8.name
subnet_group_name = "REPLACE_ME"
security_group_ids = ["sg-REPLACE_ME"]
transit_encryption_enabled = true
}
Connecting from Node.js via TLS
Let's connect from Node.js to ElastiCache (Valkey) using iovalkey with TLS, and call Bedrock Runtime.
mkdir test-semantic-cache
cd test-semantic-cache
npm init -y
Install the required packages.
npm install iovalkey @aws-sdk/client-bedrock-runtime
Create the following test script:
test-vector-search.js
import Valkey from 'iovalkey';
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
// Configuration
const config = {
elasticacheEndpoint: process.env.ELASTICACHE_ENDPOINT || 'localhost',
elasticachePort: parseInt(process.env.ELASTICACHE_PORT || '6379'),
awsRegion: process.env.AWS_REGION || 'ap-northeast-1',
embeddingModelId: 'amazon.titan-embed-text-v2:0',
indexName: 'idx:test_cache',
similarityThreshold: 0.45,
};
// Bedrock client
const bedrockClient = new BedrockRuntimeClient({ region: config.awsRegion });
/**
* Generate embedding from text
*/
async function generateEmbedding(text) {
const command = new InvokeModelCommand({
modelId: config.embeddingModelId,
contentType: 'application/json',
accept: 'application/json',
body: JSON.stringify({ inputText: text }),
});
const response = await bedrockClient.send(command);
const parsed = JSON.parse(new TextDecoder().decode(response.body));
return parsed.embedding;
}
/**
* Convert Float32Array to Buffer (for Valkey)
*/
function embeddingToBuffer(embedding) {
return Buffer.from(new Float32Array(embedding).buffer);
}
/**
* Main test
*/
async function main() {
console.log('='.repeat(60));
console.log('ElastiCache Valkey Vector Search Test');
console.log('='.repeat(60));
console.log(`Endpoint: ${config.elasticacheEndpoint}:${config.elasticachePort}`);
console.log(`Index: ${config.indexName}`);
console.log(`Similarity Threshold: ${config.similarityThreshold * 100}%`);
console.log('');
// Connect to Valkey
console.log('[1] Connecting to Valkey...');
const valkey = new Valkey({
host: config.elasticacheEndpoint,
port: config.elasticachePort,
tls: config.elasticacheEndpoint !== 'localhost' ? {} : undefined,
maxRetriesPerRequest: 3,
});
valkey.on('error', (err) => {
console.error('Valkey connection error:', err.message);
});
try {
// Check connection
const pong = await valkey.ping();
console.log(` PING response: ${pong}`);
// Check Valkey version
const info = await valkey.info('server');
const versionMatch = info.match(/valkey_version:(\S+)/);
const version = versionMatch ? versionMatch[1] : 'unknown';
console.log(` Valkey version: ${version}`);
if (version !== 'unknown' && parseFloat(version) < 8.2) {
console.error(' ERROR: Valkey 8.2 or higher is required for vector search');
process.exit(1);
}
console.log('');
// [2] Cleanup & Create index
console.log('[2] Creating vector index...');
try {
await valkey.call('FT.DROPINDEX', config.indexName);
console.log(' Dropped existing index');
} catch (err) {
// Ignore if index doesn't exist
}
// Schema with only vector field (TEXT fields will be saved with HSET separately)
await valkey.call(
'FT.CREATE', config.indexName,
'ON', 'HASH',
'PREFIX', '1', 'test:',
'SCHEMA',
'embedding', 'VECTOR', 'FLAT', '6',
'TYPE', 'FLOAT32',
'DIM', '1024',
'DISTANCE_METRIC', 'COSINE'
);
console.log(' Index created successfully');
console.log('');
// [3] Store sample data
console.log('[3] Storing sample data with embeddings...');
const sampleData = [
{
key: 'test:q1',
question: '営業時間は何時から何時までですか?',
answer: '営業時間は平日9時から18時までです。土日祝日はお休みをいただいております。',
},
{
key: 'test:q2',
question: '返品はできますか?',
answer: '商品到着後7日以内であれば、未使用品に限り返品を承っております。',
},
{
key: 'test:q3',
question: '配送料はいくらですか?',
answer: '5,000円以上のご購入で送料無料です。5,000円未満の場合は全国一律500円です。',
},
];
for (const item of sampleData) {
console.log(` Generating embedding for: "${item.question.substring(0, 30)}..."`);
const embedding = await generateEmbedding(item.question);
const embeddingBuffer = embeddingToBuffer(embedding);
await valkey.hset(item.key, {
question: item.question,
answer: item.answer,
embedding: embeddingBuffer,
});
console.log(` Stored: ${item.key}`);
}
console.log('');
// [4] Test vector search
console.log('[4] Testing vector search...');
const testQueries = [
'何時に開いていますか?', // Similar to: 営業時間
'返品の条件を教えてください', // Similar to: 返品
'送料はかかりますか?', // Similar to: 配送料
'クレジットカードは使えますか?', // Not similar: new query
];
let cacheHits = 0;
for (const query of testQueries) {
console.log(`\n Query: "${query}"`);
const queryEmbedding = await generateEmbedding(query);
const queryBuffer = embeddingToBuffer(queryEmbedding);
// Not using SORTBY as it caused errors in this test environment
const results = await valkey.call(
'FT.SEARCH', config.indexName,
'*=>[KNN 1 @embedding $vec AS score]',
'PARAMS', '2', 'vec', queryBuffer,
'DIALECT', '2'
);
// Parse results
const numResults = results[0];
if (numResults > 0) {
const docId = results[1];
const docData = results[2];
// docData is in [field, value, field, value, ...] format
const dataMap = {};
for (let i = 0; i < docData.length; i += 2) {
if (docData[i] !== 'embedding') {
dataMap[docData[i]] = docData[i + 1];
}
}
const score = parseFloat(dataMap.score);
const similarity = 1 - score; // Convert COSINE distance to similarity
// Get question and answer (not included in index, get separately)
const storedData = await valkey.hgetall(docId);
console.log(` -> Best match: ${docId}`);
console.log(` -> Similarity: ${(similarity * 100).toFixed(2)}%`);
console.log(` -> Threshold: ${config.similarityThreshold * 100}%`);
if (similarity >= config.similarityThreshold) {
console.log(` -> CACHE HIT`);
console.log(` -> Answer: "${storedData.answer.substring(0, 50)}..."`);
cacheHits++;
} else {
console.log(` -> CACHE MISS: Similarity below threshold`);
}
} else {
console.log(' -> No results found');
}
}
console.log('');
// [5] Cleanup
console.log('[5] Cleaning up test data...');
for (const item of sampleData) {
await valkey.del(item.key);
}
await valkey.call('FT.DROPINDEX', config.indexName);
console.log(' Test data and index removed');
console.log('');
console.log('='.repeat(60));
console.log('Test completed successfully!');
console.log(`Cache hits: ${cacheHits} / ${testQueries.length}`);
console.log('='.repeat(60));
} catch (error) {
console.error('\nError:', error.message);
if (error.stack) {
console.error(error.stack);
}
process.exit(1);
} finally {
await valkey.quit();
}
}
main();
Set environment variables:
export ELASTICACHE_ENDPOINT=REPLACE_ME
export ELASTICACHE_PORT=6379
export AWS_REGION=ap-northeast-1
Run the script and check the results:
node test-vector-search.js
Detailed explanation of test-vector-search.js
Detailed explanation of test-vector-search.js
First, we create an index for vector search. In this test, the schema only defines embedding, while question and answer are just stored in the HASH.
await valkey.call(
'FT.CREATE', config.indexName,
'ON', 'HASH',
'PREFIX', '1', 'test:',
'SCHEMA',
'embedding', 'VECTOR', 'FLAT', '6',
'TYPE', 'FLOAT32',
'DIM', '1024',
'DISTANCE_METRIC', 'COSINE'
);
Next, we generate embeddings from the question text and store them in the HASH. The embeddings are assumed to be Float32 arrays and are converted to binary format for Valkey.
function embeddingToBuffer(embedding) {
return Buffer.from(new Float32Array(embedding).buffer);
}
await valkey.hset(item.key, {
question: item.question,
answer: item.answer,
embedding: embeddingBuffer,
});
We generate embeddings using Bedrock Runtime's InvokeModel:
async function generateEmbedding(text) {
const command = new InvokeModelCommand({
modelId: config.embeddingModelId,
contentType: 'application/json',
accept: 'application/json',
body: JSON.stringify({ inputText: text }),
});
const response = await bedrockClient.send(command);
const parsed = JSON.parse(new TextDecoder().decode(response.body));
return parsed.embedding;
}
Then we create an embedding for the input text and perform a KNN search to find the closest match. In this environment, SORTBY caused errors, so we're not using it in the script.
const results = await valkey.call(
'FT.SEARCH', config.indexName,
'*=>[KNN 1 @embedding $vec AS score]',
'PARAMS', '2', 'vec', queryBuffer,
'DIALECT', '2'
);
Using the returned docId, we retrieve the answer with HGETALL. This is because we chose not to include answer in the index schema.
Finally, we convert the score to similarity and determine if it's a cache hit based on the threshold. In this test, score is treated as COSINE distance, and we convert it with similarity = 1 - score.
const score = parseFloat(dataMap.score);
const similarity = 1 - score; // Convert COSINE distance to similarity
For Japanese text, we set similarityThreshold to 0.45, which was determined from the measurements described later.
Results
| Item | Result |
|---|---|
| ElastiCache Valkey 8.2 | Vector search confirmed working |
| Node.js client | Works with iovalkey |
| Index creation (FT.CREATE) | Successful |
| Vector search (FT.SEARCH) | Successful |
| Exact match similarity | 100% |
Japanese Text Similarity
We measured similarity for Japanese text using Titan Text Embeddings v2. The stored data was "営業時間は何時から何時までですか?" (What are your business hours?).
| Query | Stored Data | Similarity |
|---|---|---|
| 営業時間は何時から何時までですか? | Same | 100% |
| 営業時間を教えてください | 営業時間は何時から何時までですか? | 48.67% |
| 何時に開いていますか | 営業時間は何時から何時までですか? | 37.56% |
| 返品はできますか | 営業時間は何時から何時までですか? | 15.01% |
Additional Experiment: Response Time with Cache Hits (ECS Fargate Measurements)
In a separate test, we measured the improvement in response time using semantic caching on a Node.js environment running on ECS Fargate (ap-northeast-1). The scenario involved the first query missing the cache (empty cache), and the second query hitting the cache with identical or similar input.
Scenario
- First query: Generate answer for "AWS支援について教えて" (Tell me about AWS support) and store the result in cache
- Second query: Same question hits the cache and returns the cached response
Similarly, negative sentiment detection (NEG detection) was also cached, and we confirmed that similar negative expressions would hit the cache.
Response Time Comparison
| Process | Without Cache | With Cache | Notes |
|---|---|---|---|
| Answer generation | 3,571 ms | 133 ms | Second time returns answer from cache |
| Negative sentiment detection | ~1,500 ms | ~90 ms | Second time returns detection result from cache |
For identical input on the second query, answer generation time was reduced from 3,571 ms to 133 ms. Semantic caching eliminates waiting time for LLM calls, directly improving conversation tempo.
Reduction in Bedrock API Calls
| Scenario | Call Count | Breakdown |
|---|---|---|
| First time (no cache) | 3 calls | Embedding + Answer generation + Negative sentiment detection |
| Second time (with cache) | 1 call | Embedding only |
On the second query, only embedding generation was executed, eliminating the need for answer generation and negative sentiment detection calls.
Discussion
The value of this verification lies in providing evidence for decisions about implementing semantic caching. In voice IVR and chat systems, response gaps significantly affect user experience. Therefore, being able to explain quantitatively rather than just assuming "it should be faster" is important. Based on these results, semantic caching can be considered an effective means of maintaining conversation tempo in scenarios where similar questions are frequently repeated.
Setting an appropriate similarity threshold is often challenging when implementing this approach. If the threshold is too high, cache hits will be rare; if too low, false positives will increase. Our measurements suggest that for Japanese text, a threshold between 0.40 and 0.50 is a reasonable starting point. In production, the threshold should be re-evaluated and adjusted based on the language and content of expected text.
In designs where multiple LLM calls are made per input, waiting time and costs can easily increase. Semantic caching can effectively mitigate this by eliminating multiple calls at once when a hit occurs. However, additional ElastiCache usage costs should be considered, and cost-effectiveness should be evaluated based on expected hit rates and traffic.
Conclusion
In this article, we verified semantic caching using Amazon ElastiCache's Valkey 8.2 with vector search capabilities. We confirmed that KNN search works via FT.CREATE and FT.SEARCH commands from Node.js (iovalkey) with TLS connection. For Japanese text, we found similarity scores for paraphrased expressions typically range from 0.40 to 0.50. Our additional experiment showed response time reduction from 3,571 ms to 133 ms with cache hits, providing quantitative evidence for conversation tempo improvement.



