Semantic Cache Validation in Amazon ElastiCache (Valkey 8.2)

I will validate semantic caching by using vector search in Amazon ElastiCache for Valkey 8.2, combined with Bedrock Titan Text Embeddings v2. I connected via TLS from Node.js (iovalkey) and measured Japanese similarity, thresholds, and response time improvements during cache hits.

越井琢巳 (Koshii Takumi)

2026.01.23

This page has been translated by machine translation. View original

Introduction

LLM response generation can easily be bottlenecked by model inference processing time and waiting time for external API calls. For past or semantically similar queries, returning cached responses can eliminate the need for inference and external calls. In this article, we verified the implementation of semantic caching using Amazon ElastiCache's Valkey 8.2.

What is Semantic Caching?

Semantic caching is a method that hits the cache for semantically similar questions even when they don't match exactly. It converts questions into vectors called embeddings and searches for similar data using vector search. To evaluate similarity, metrics such as cosine similarity are used.

Verification Environment

Region: ap-northeast-1
ElastiCache: Valkey 8.2
Node.js: v22.16.0
- iovalkey: 0.3.1
Embedding: Titan Text Embeddings v2 (Bedrock)

Overall Flow

The application converts input text to embeddings and compares them with embeddings stored in ElastiCache using KNN search. If the similarity exceeds a threshold, it returns the cached response.

Verification

Setting up ElastiCache (Valkey 8.2)

Here's a minimal Terraform example. Subnet groups and security groups are omitted for brevity.

resource "aws_elasticache_parameter_group" "valkey8" {
  family = "valkey8"
  name   = "REPLACE_ME-valkey8-params"

  parameter {
    name  = "reserved-memory-percent"
    value = "50"
  }
}

resource "aws_elasticache_replication_group" "semantic_cache" {
  replication_group_id       = "REPLACE_ME-semantic-cache"
  description                = "Valkey 8.2 for semantic cache"
  engine                     = "valkey"
  engine_version             = "8.2"

  node_type          = "cache.t4g.micro"
  num_cache_clusters = 1

  parameter_group_name       = aws_elasticache_parameter_group.valkey8.name
  subnet_group_name          = "REPLACE_ME"
  security_group_ids         = ["sg-REPLACE_ME"]
  transit_encryption_enabled = true
}

Connecting from Node.js via TLS

Let's connect from Node.js to ElastiCache (Valkey) using iovalkey with TLS, and call Bedrock Runtime.

mkdir test-semantic-cache
cd test-semantic-cache
npm init -y

Install the required packages.

npm install iovalkey @aws-sdk/client-bedrock-runtime

Create the following test script:

test-vector-search.js

import Valkey from 'iovalkey';
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';

// Configuration
const config = {
  elasticacheEndpoint: process.env.ELASTICACHE_ENDPOINT || 'localhost',
  elasticachePort: parseInt(process.env.ELASTICACHE_PORT || '6379'),
  awsRegion: process.env.AWS_REGION || 'ap-northeast-1',
  embeddingModelId: 'amazon.titan-embed-text-v2:0',
  indexName: 'idx:test_cache',
  similarityThreshold: 0.45,
};

// Bedrock client
const bedrockClient = new BedrockRuntimeClient({ region: config.awsRegion });

/**
 * Generate embedding from text
 */
async function generateEmbedding(text) {
  const command = new InvokeModelCommand({
    modelId: config.embeddingModelId,
    contentType: 'application/json',
    accept: 'application/json',
    body: JSON.stringify({ inputText: text }),
  });

  const response = await bedrockClient.send(command);
  const parsed = JSON.parse(new TextDecoder().decode(response.body));
  return parsed.embedding;
}

/**
 * Convert Float32Array to Buffer (for Valkey)
 */
function embeddingToBuffer(embedding) {
  return Buffer.from(new Float32Array(embedding).buffer);
}

/**
 * Main test
 */
async function main() {
  console.log('='.repeat(60));
  console.log('ElastiCache Valkey Vector Search Test');
  console.log('='.repeat(60));
  console.log(`Endpoint: ${config.elasticacheEndpoint}:${config.elasticachePort}`);
  console.log(`Index: ${config.indexName}`);
  console.log(`Similarity Threshold: ${config.similarityThreshold * 100}%`);
  console.log('');

  // Connect to Valkey
  console.log('[1] Connecting to Valkey...');
  const valkey = new Valkey({
    host: config.elasticacheEndpoint,
    port: config.elasticachePort,
    tls: config.elasticacheEndpoint !== 'localhost' ? {} : undefined,
    maxRetriesPerRequest: 3,
  });

  valkey.on('error', (err) => {
    console.error('Valkey connection error:', err.message);
  });

  try {
    // Check connection
    const pong = await valkey.ping();
    console.log(`   PING response: ${pong}`);

    // Check Valkey version
    const info = await valkey.info('server');
    const versionMatch = info.match(/valkey_version:(\S+)/);
    const version = versionMatch ? versionMatch[1] : 'unknown';
    console.log(`   Valkey version: ${version}`);

    if (version !== 'unknown' && parseFloat(version) < 8.2) {
      console.error('   ERROR: Valkey 8.2 or higher is required for vector search');
      process.exit(1);
    }
    console.log('');

    // [2] Cleanup & Create index
    console.log('[2] Creating vector index...');
    try {
      await valkey.call('FT.DROPINDEX', config.indexName);
      console.log('   Dropped existing index');
    } catch (err) {
      // Ignore if index doesn't exist
    }

    // Schema with only vector field (TEXT fields will be saved with HSET separately)
    await valkey.call(
      'FT.CREATE', config.indexName,
      'ON', 'HASH',
      'PREFIX', '1', 'test:',
      'SCHEMA',
      'embedding', 'VECTOR', 'FLAT', '6',
      'TYPE', 'FLOAT32',
      'DIM', '1024',
      'DISTANCE_METRIC', 'COSINE'
    );
    console.log('   Index created successfully');
    console.log('');

    // [3] Store sample data
    console.log('[3] Storing sample data with embeddings...');
    const sampleData = [
      {
        key: 'test:q1',
        question: '営業時間は何時から何時までですか？',
        answer: '営業時間は平日9時から18時までです。土日祝日はお休みをいただいております。',
      },
      {
        key: 'test:q2',
        question: '返品はできますか？',
        answer: '商品到着後7日以内であれば、未使用品に限り返品を承っております。',
      },
      {
        key: 'test:q3',
        question: '配送料はいくらですか？',
        answer: '5,000円以上のご購入で送料無料です。5,000円未満の場合は全国一律500円です。',
      },
    ];

    for (const item of sampleData) {
      console.log(`   Generating embedding for: "${item.question.substring(0, 30)}..."`);
      const embedding = await generateEmbedding(item.question);
      const embeddingBuffer = embeddingToBuffer(embedding);

      await valkey.hset(item.key, {
        question: item.question,
        answer: item.answer,
        embedding: embeddingBuffer,
      });
      console.log(`   Stored: ${item.key}`);
    }
    console.log('');

    // [4] Test vector search
    console.log('[4] Testing vector search...');
    const testQueries = [
      '何時に開いていますか？',           // Similar to: 営業時間
      '返品の条件を教えてください',       // Similar to: 返品
      '送料はかかりますか？',             // Similar to: 配送料
      'クレジットカードは使えますか？',   // Not similar: new query
    ];

    let cacheHits = 0;

    for (const query of testQueries) {
      console.log(`\n   Query: "${query}"`);

      const queryEmbedding = await generateEmbedding(query);
      const queryBuffer = embeddingToBuffer(queryEmbedding);

      // Not using SORTBY as it caused errors in this test environment
      const results = await valkey.call(
        'FT.SEARCH', config.indexName,
        '*=>[KNN 1 @embedding $vec AS score]',
        'PARAMS', '2', 'vec', queryBuffer,
        'DIALECT', '2'
      );

      // Parse results
      const numResults = results[0];
      if (numResults > 0) {
        const docId = results[1];
        const docData = results[2];

        // docData is in [field, value, field, value, ...] format
        const dataMap = {};
        for (let i = 0; i < docData.length; i += 2) {
          if (docData[i] !== 'embedding') {
            dataMap[docData[i]] = docData[i + 1];
          }
        }

        const score = parseFloat(dataMap.score);
        const similarity = 1 - score;  // Convert COSINE distance to similarity

        // Get question and answer (not included in index, get separately)
        const storedData = await valkey.hgetall(docId);

        console.log(`   -> Best match: ${docId}`);
        console.log(`   -> Similarity: ${(similarity * 100).toFixed(2)}%`);
        console.log(`   -> Threshold: ${config.similarityThreshold * 100}%`);

        if (similarity >= config.similarityThreshold) {
          console.log(`   -> CACHE HIT`);
          console.log(`   -> Answer: "${storedData.answer.substring(0, 50)}..."`);
          cacheHits++;
        } else {
          console.log(`   -> CACHE MISS: Similarity below threshold`);
        }
      } else {
        console.log('   -> No results found');
      }
    }
    console.log('');

    // [5] Cleanup
    console.log('[5] Cleaning up test data...');
    for (const item of sampleData) {
      await valkey.del(item.key);
    }
    await valkey.call('FT.DROPINDEX', config.indexName);
    console.log('   Test data and index removed');
    console.log('');

    console.log('='.repeat(60));
    console.log('Test completed successfully!');
    console.log(`Cache hits: ${cacheHits} / ${testQueries.length}`);
    console.log('='.repeat(60));

  } catch (error) {
    console.error('\nError:', error.message);
    if (error.stack) {
      console.error(error.stack);
    }
    process.exit(1);
  } finally {
    await valkey.quit();
  }
}

main();

Set environment variables:

export ELASTICACHE_ENDPOINT=REPLACE_ME
export ELASTICACHE_PORT=6379
export AWS_REGION=ap-northeast-1

Run the script and check the results:

node test-vector-search.js

Detailed explanation of `test-vector-search.js`

Detailed explanation of test-vector-search.js

First, we create an index for vector search. In this test, the schema only defines embedding, while question and answer are just stored in the HASH.

    await valkey.call(
      'FT.CREATE', config.indexName,
      'ON', 'HASH',
      'PREFIX', '1', 'test:',
      'SCHEMA',
      'embedding', 'VECTOR', 'FLAT', '6',
      'TYPE', 'FLOAT32',
      'DIM', '1024',
      'DISTANCE_METRIC', 'COSINE'
    );

Next, we generate embeddings from the question text and store them in the HASH. The embeddings are assumed to be Float32 arrays and are converted to binary format for Valkey.

function embeddingToBuffer(embedding) {
  return Buffer.from(new Float32Array(embedding).buffer);
}

      await valkey.hset(item.key, {
        question: item.question,
        answer: item.answer,
        embedding: embeddingBuffer,
      });

We generate embeddings using Bedrock Runtime's InvokeModel:

async function generateEmbedding(text) {
  const command = new InvokeModelCommand({
    modelId: config.embeddingModelId,
    contentType: 'application/json',
    accept: 'application/json',
    body: JSON.stringify({ inputText: text }),
  });

  const response = await bedrockClient.send(command);
  const parsed = JSON.parse(new TextDecoder().decode(response.body));
  return parsed.embedding;
}

Then we create an embedding for the input text and perform a KNN search to find the closest match. In this environment, SORTBY caused errors, so we're not using it in the script.

      const results = await valkey.call(
        'FT.SEARCH', config.indexName,
        '*=>[KNN 1 @embedding $vec AS score]',
        'PARAMS', '2', 'vec', queryBuffer,
        'DIALECT', '2'
      );

Using the returned docId, we retrieve the answer with HGETALL. This is because we chose not to include answer in the index schema.

Finally, we convert the score to similarity and determine if it's a cache hit based on the threshold. In this test, score is treated as COSINE distance, and we convert it with similarity = 1 - score.

        const score = parseFloat(dataMap.score);
        const similarity = 1 - score;  // Convert COSINE distance to similarity

For Japanese text, we set similarityThreshold to 0.45, which was determined from the measurements described later.

Results

Item	Result
ElastiCache Valkey 8.2	Vector search confirmed working
Node.js client	Works with iovalkey
Index creation (FT.CREATE)	Successful
Vector search (FT.SEARCH)	Successful
Exact match similarity	100%

Japanese Text Similarity

We measured similarity for Japanese text using Titan Text Embeddings v2. The stored data was "営業時間は何時から何時までですか？" (What are your business hours?).

Query	Stored Data	Similarity
営業時間は何時から何時までですか？	Same	100%
営業時間を教えてください	営業時間は何時から何時までですか？	48.67%
何時に開いていますか	営業時間は何時から何時までですか？	37.56%
返品はできますか	営業時間は何時から何時までですか？	15.01%

Additional Experiment: Response Time with Cache Hits (ECS Fargate Measurements)

In a separate test, we measured the improvement in response time using semantic caching on a Node.js environment running on ECS Fargate (ap-northeast-1). The scenario involved the first query missing the cache (empty cache), and the second query hitting the cache with identical or similar input.

Scenario

First query: Generate answer for "AWS支援について教えて" (Tell me about AWS support) and store the result in cache
Second query: Same question hits the cache and returns the cached response

Similarly, negative sentiment detection (NEG detection) was also cached, and we confirmed that similar negative expressions would hit the cache.

Response Time Comparison

Process	Without Cache	With Cache	Notes
Answer generation	3,571 ms	133 ms	Second time returns answer from cache
Negative sentiment detection	~1,500 ms	~90 ms	Second time returns detection result from cache

For identical input on the second query, answer generation time was reduced from 3,571 ms to 133 ms. Semantic caching eliminates waiting time for LLM calls, directly improving conversation tempo.

Reduction in Bedrock API Calls

Scenario	Call Count	Breakdown
First time (no cache)	3 calls	Embedding + Answer generation + Negative sentiment detection
Second time (with cache)	1 call	Embedding only

On the second query, only embedding generation was executed, eliminating the need for answer generation and negative sentiment detection calls.

Discussion

The value of this verification lies in providing evidence for decisions about implementing semantic caching. In voice IVR and chat systems, response gaps significantly affect user experience. Therefore, being able to explain quantitatively rather than just assuming "it should be faster" is important. Based on these results, semantic caching can be considered an effective means of maintaining conversation tempo in scenarios where similar questions are frequently repeated.

Setting an appropriate similarity threshold is often challenging when implementing this approach. If the threshold is too high, cache hits will be rare; if too low, false positives will increase. Our measurements suggest that for Japanese text, a threshold between 0.40 and 0.50 is a reasonable starting point. In production, the threshold should be re-evaluated and adjusted based on the language and content of expected text.

In designs where multiple LLM calls are made per input, waiting time and costs can easily increase. Semantic caching can effectively mitigate this by eliminating multiple calls at once when a hit occurs. However, additional ElastiCache usage costs should be considered, and cost-effectiveness should be evaluated based on expected hit rates and traffic.

Conclusion

In this article, we verified semantic caching using Amazon ElastiCache's Valkey 8.2 with vector search capabilities. We confirmed that KNN search works via FT.CREATE and FT.SEARCH commands from Node.js (iovalkey) with TLS connection. For Japanese text, we found similarity scores for paraphrased expressions typically range from 0.40 to 0.50. Our additional experiment showed response time reduction from 3,571 ms to 133 ms with cache hits, providing quantitative evidence for conversation tempo improvement.