Understanding intention with Nova, stability with Titan. Comparing search accuracy of 4 Embedding models using LLM-as-a-Judge

Understanding intention with Nova, stability with Titan. Comparing search accuracy of 4 Embedding models using LLM-as-a-Judge

Stored approximately 60,000 blog posts in Amazon S3 Vectors and thoroughly compared the search accuracy of four Embedding models (Nova, Titan, etc.) using LLM-as-a-Judge. I'll introduce the clear "strengths and weaknesses" of each model revealed through automated scoring, and guidelines for practical differentiation.
2026.02.17

This page has been translated by machine translation. View original

In the previous article, we registered about 60,000 blog posts from DevelopersIO in Amazon S3 Vectors and built a serverless semantic search environment.

https://dev.classmethod.jp/articles/s3-vectors-bedrock-semantic-search/

This time, we compared the search accuracy of four Embedding models using this environment. We'll explain how we reached the conclusion to "use Nova Embed and Titan V2 in combination according to their strengths" by having LLM-as-a-Judge automatically score search results for identical queries.

Comparison Target Models

Model Provider Dimensions Execution Environment
Amazon Nova Embed AWS (Bedrock) 1024 Bedrock API
Amazon Titan Embed V2 AWS (Bedrock) 1024 Bedrock API
multilingual-e5-large Microsoft (OSS) 1024 EC2
ruri-base Hugging Face (OSS) 768 EC2

We compared two Bedrock managed models and two OSS models.

For the approximately 60,000 blog posts mentioned above, we vectorized Japanese summaries of about 165 characters and English summaries of about 60 words with each model, and registered them as separate indices in S3 Vectors (4 models × 2 languages = 8 indices).

Evaluation Method

Search Queries (33 types × 12 patterns)

Assuming actual use cases, we prepared the following 12 patterns. These are broadly divided into "same-language search," where the query and search target are in the same language, and "cross-lingual search," which searches across different languages.

Same-Language Search:

# Test Pattern Query Example Purpose
1 Japanese→Japanese "Lambda コールドスタート 対策" Basic performance
3 English→English "RAG hallucination prevention" English basic performance
4 Single word→Japanese "React" "Lambda" Noise resistance
5 Single word→English "S3" "CDK" Noise resistance (English)
6 Synonym→Japanese "サーバーレス関数 タイムアウト 変更" Synonym understanding
8 Intent/Issue→Japanese "コンテナが勝手に再起動を繰り返す" Intent understanding
10 Niche tech→Japanese "Projen を使ったプロジェクト管理" Low-frequency topics

Cross-Lingual Search:

# Test Pattern Query Example Purpose
2 English→Japanese "Lambda cold start mitigation" Cross-lingual basic
7 Synonym (EN→JP) "Increase Lambda execution time" Cross-lingual synonyms
9 Intent/Issue (EN→JP) "ECS Fargate tasks restarting loop" Cross-lingual intent understanding
11 Niche tech (EN→JP) "Benefits of using IAM Roles Anywhere" Cross-lingual low-frequency
12 Japanese→English "Lambda コールドスタート 対策" Reverse cross-lingual

Evaluation Metric: nDCG@10 × LLM-as-a-Judge

We passed the Top-10 results of each search to an LLM for scoring on a scale of 0-100. To improve judgment accuracy, we had the LLM read not just the article titles but also the RAG summaries in both Japanese and English.

Score Meaning
100 Contains a direct answer/solution to the query
80 Closely related and very useful
50 Topic is correct but falls short of addressing the intent
20 Contains the same words but different content
0 Unrelated

We calculated nDCG (Normalized Discounted Cumulative Gain) from the scores. This metric is higher when better results appear at the top, with 1.0 being ideal.

We used Gemini 3.0 Pro as the Judge. For reference, we also scored the same data with Claude Sonnet 4.5 (Amazon Bedrock) to compare between Judges.

Test Environment

Item Value
S3 Vectors indices 8 (4 models × 2 languages)
Number of articles About 60,000 / index
Search queries 33 types
Test patterns 12 types (7 same-language + 5 cross-lingual)
Search count 180 times
Judge Gemini 3.0 Pro (main) / Claude Sonnet 4.5 (reference)
Region Bedrock: us-east-1 / S3 Vectors: us-west-2

Results

Overall Ranking

Overall nDCG@10

Rank Model nDCG@3 nDCG@10
1 E5-Large 0.7922 0.7375
2 Titan V2 0.7826 0.7157
3 Nova Embed 0.7438 0.6869
4 Ruri-Base 0.6600 0.6045

Overall, E5-Large ranked 1st, but the picture changes significantly when looking at individual test patterns.

Same-Language Search

Same-Language Radar

Same-Language Bar

Test Content Nova Titan E5 Ruri 1st Place
test1 Basic (ja→ja) 0.7415 0.8205 0.8711 0.6898 E5
test3 Basic (en→en) 0.8310 0.8529 0.8953 0.8108 E5
test4 Single word (ja) 0.6790 0.6868 0.7441 0.5775 E5
test5 Single word (en) 0.6254 0.8082 0.8116 0.5596 E5
test6 Synonym (ja→ja) 0.8956 0.7640 0.7510 0.7111 Nova
test8 Intent (ja→ja) 0.7187 0.5586 0.6813 0.6022 Nova
test10 Niche (ja→ja) 0.1576 0.5701 0.4846 0.3738 Titan

For basic keyword searches (tests 1, 3, 4, 5), E5-Large won across the board. Meanwhile, Nova dominated in synonym recognition (test6) and intent understanding (test8).

Cross-Lingual Search

Cross-Lingual Radar

Cross-Lingual Bar

Test Content Nova Titan E5 Ruri 1st Place
test2 Basic (en→ja) 0.8302 0.7969 0.7904 0.6810 Nova
test7 Synonym (en→ja) 0.8755 0.7837 0.7358 0.6642 Nova
test9 Intent (en→ja) 0.6895 0.5746 0.6271 0.4453 Nova
test11 Niche (en→ja) 0.3667 0.4224 0.4451 0.3504 E5
test12 Basic (ja→en) 0.7250 0.9511 0.9603 0.6529 E5

For en→ja, Nova ranked 1st in 3 out of 4 patterns, showing outstanding ability to find Japanese articles from English queries.

However, in our additional ja→en test (test12: Japanese query → English index), the results were reversed. With the query "RAG ハルシネーション 対策" (RAG hallucination countermeasures), Nova's Top-6 were all unrelated articles (conference reports, anime introductions, etc.) resulting in nDCG@10 = 0.33, while Titan (0.95) and E5 (0.96) maintained stable accuracy. We discovered that Nova's cross-lingual capability has asymmetry between en→ja and ja→en.

Same-Language vs Cross-Lingual

Same vs Cross

Comparing same-language and cross-lingual searches in the same category revealed interesting trends.

  • Nova: Showed almost no drop in accuracy for en→ja (Basic: 0.74→0.83, Synonym: 0.90→0.88). However, it collapsed on some ja→en queries, showing asymmetry in cross-lingual capabilities
  • Titan / E5: Maintained stable accuracy for ja→en. Due to the higher information content in English summaries, ja→en sometimes scored higher than ja→ja
  • Ruri: Showed significant accuracy decline in cross-lingual searches (Intent: 0.60→0.45, react_hooks ja→en: nDCG=0.0)
  • Niche: All models struggled with both same-language and cross-lingual searches

Heatmap (All Patterns)

Heatmap

Model Characteristics

Nova Embed: Strong in Intent Understanding and Synonyms, Weak in Niche Topics

Nova excelled in synonym recognition (test6: 0.8956) and intent understanding (test8: 0.7187), outperforming other models.

Here's a specific example of a synonym-type query (test6). We searched for "サーバーレス関数 タイムアウト 変更" (serverless function timeout change), which doesn't use AWS's official terminology.

Nova Embed (Top-3):

  1. Fixed an issue where Lambda functions execute multiple times by setting timeout and retry count → score: 100 ✅
  2. How to check if a Lambda function has timed out from its logs → score: 75
  3. Created a Lambda to recover Lambda functions that failed due to timeout → score: 65

Titan V2 (Top-3):

  1. How to change the execution timeout time of AWS Systems Manager RunCommand? → score: 75
  2. How to change the integration timeout in API Gateway → score: 95
  3. [New Feature] Changing timeout values for AWS Elastic Load Balancing (ELB) → score: 40

Nova understood that "serverless function" = "Lambda" semantically, with all Top-3 results being Lambda-related. Titan was literal with "timeout change" as keywords, mixing in non-Lambda services like SSM and API Gateway.

Here's an example of intent understanding (test8). We input the natural language description of a problem situation: "コンテナが勝手に再起動を繰り返す" (containers repeatedly restart on their own).

Nova Embed (Top-3):

  1. Unintended restarts during AMI creation → score: 20
  2. Auto-restart with systemd during exceptions → score: 75
  3. ECS tasks with CannotPullContainerError during ECR image updates → score: 95 ✅

Titan V2 (Top-3):

  1. Unintended restarts during AMI creation → score: 30
  2. Handling EC2 maintenance requiring restart → score: 35
  3. How to keep Windows containers running in ECS → score: 75

Nova successfully included a high-scoring ECS-related article (score: 95) in its Top-3. Titan only matched on the words "restart" and "container," failing to bring articles that matched the intent to the top positions.

Conversely, for niche technology (test10: 0.1576), the positions were reversed. For the query "Projen を使ったプロジェクト管理" (Project management using Projen):

Nova Embed (Top-3):

  1. Backlog management using GitHub Projects (product/sprint) → score: 20
  2. How to manage Issues and sub-Issues appropriately using Project → score: 20
  3. Introducing project management to (non-engineer) staff work → score: 15

Titan V2 (Top-3):

  1. Regular AWS CDK App development became comfortable using projen → score: 95 ✅
  2. [Report + Try] projen – a CDK for software project configuration #CDK Day → score: 95 ✅
  3. Project Management #1 → score: 15

Nova was drawn to the general meaning of "project management" and failed to find any Projen-related articles. Titan accurately captured the proper noun "Projen" and successfully hit related articles in its Top-2. This is a good example where Nova's semantic understanding worked against it, and Titan complemented it.

Titan Embed V2: Stable Second Place

While Titan doesn't have outstanding strengths, it consistently ranked 2nd-3rd across all patterns. It took 1st place for niche technology (test10: 0.5701) where Nova failed, showing it has few weaknesses and is a balanced model.

OSS Models (E5-Large / Ruri-Base): High Basic Performance but Operational Challenges

E5-Large ranked 1st in Japanese basic search (test1: 0.8711), English basic search (test3: 0.8953), and single-word searches (test4/5). It's the most accurate model for textbook search queries. However, it fell behind Nova in synonym recognition and intent understanding.

Ruri-Base ranked last in all 12 patterns, but considering its handicap of having fewer dimensions (768) than other models (1024), it performed reasonably well.

However, OSS models presented operational challenges. Real-time vectorization of search terms requires model loading and inference. It's difficult to achieve practical latency in CPU-only environments like Lambda, and keeping GPU instances running constantly isn't cost-effective.
While there was potential for using these models for similar article searches with pre-vectorized data, we decided against adopting OSS models since we had no plans for model fine-tuning or custom optimization, and instead focused on Bedrock's managed models.

Vectorization Speed

Vectorization Speed

Model Initialization Inference (33 items) Per item Execution Environment
Titan V2 0.1s 5.0s 152ms Bedrock API
Nova Embed 0.1s 8.1s 246ms Bedrock API
E5-Large 57.8s 21.0s 635ms EC2
Ruri-Base 4.9s 87.0s 2,638ms EC2

Models via Bedrock API were fast with no initialization needed. OSS models required Docker container startup and model loading, with E5-Large taking about a minute for initialization, which was a drawback.

Reference: Sonnet's Second Opinion

To verify Judge bias, we also had Claude Sonnet 4.5 (Amazon Bedrock) score the same data.

Judge Correlation

Gemini used the 5-level scale (0/20/50/80/100) as instructed, which clearly differentiated between models. Sonnet, however, used continuous values from 0-100 (69% of 1,640 evaluations used non-specified values), resulting in all models clustering between nDCG@10 = 0.90-0.95 with little differentiation.

Test Gemini 1st Sonnet 1st Match
test2 Cross-lingual basic Nova Nova
test6 Synonyms Nova Nova
test7 Synonyms (EN→JP) Nova Nova
test8 Intent understanding Nova Nova
test9 Intent (EN→JP) Nova Nova
test10 Niche technology Titan Titan
test1 Basic (JP→JP) E5 Titan
test3 Basic (EN→EN) E5 Titan
test4 Single word (JP) E5 Ruri
test5 Single word (EN) E5 Titan
test11 Niche (EN→JP) E5 Nova

The top-ranked models matched in 6/11 cases (55%). Patterns where Nova excelled (cross-lingual, synonyms, intent understanding) and where Titan excelled (niche technology) were consistent across both Judges, but basic searches and single-word searches varied between E5 and Titan. Since LLM-as-a-Judge rankings can change depending on the Judge selected, please consider these results as relative trends.

Conclusion

Bedrock Model Selection Guidelines

Use Case Recommended Model Reason
Cross-lingual search (EN→JP) Nova Embed Highest accuracy for English query → Japanese content
Handling variants and synonyms Nova Embed Understands "serverless function" → "Lambda"
Natural language problem searches Nova Embed Highest intent understanding capability
Keyword search & niche technology Titan V2 Stable with few weaknesses, complements Nova's weak areas

Overall Assessment

This comparison clearly revealed each model's strengths and weaknesses.

In terms of "overall score," the OSS model E5-Large ranked 1st, but for our semantic search platform, we prioritize use cases where users input their "problems" vaguely in natural language, rather than searches with clear keywords.
Therefore, instead of focusing on overall score, we valued strengths in "intent understanding (test8)" and "synonym recognition (test6)," and decided to adopt Nova Embed as our main model, as it's a Bedrock managed model that doesn't require infrastructure management like GPUs.

However, through testing, we also identified areas where Nova struggles, such as niche technologies and single-word keyword searches. For these areas, we're considering using Titan V2, which has few weaknesses and is stable, to complement Nova in a combined approach that leverages the strengths of each model.

Share this article

FacebookHatena blogX