Understanding intention with Nova, stability with Titan. Comparing search accuracy of 4 Embedding models using LLM-as-a-Judge

Stored approximately 60,000 blog posts in Amazon S3 Vectors and thoroughly compared the search accuracy of four Embedding models (Nova, Titan, etc.) using LLM-as-a-Judge. I'll introduce the clear "strengths and weaknesses" of each model revealed through automated scoring, and guidelines for practical differentiation.

suzuki.ryo

2026.02.17

This page has been translated by machine translation. View original

In the previous article, we registered about 60,000 blog posts from DevelopersIO in Amazon S3 Vectors and built a serverless semantic search environment.
https://dev.classmethod.jp/articles/s3-vectors-bedrock-semantic-search/
This time, we compared the search accuracy of four Embedding models using this environment. We'll explain how we reached the conclusion to "use Nova Embed and Titan V2 in combination according to their strengths" by having LLM-as-a-Judge automatically score search results for identical queries.
 Comparison Target Models

Model
Provider
Dimensions
Execution Environment


Amazon Nova Embed
AWS (Bedrock)
1024
Bedrock API

Amazon Titan Embed V2
AWS (Bedrock)
1024
Bedrock API

multilingual-e5-large
Microsoft (OSS)
1024
EC2

ruri-base
Hugging Face (OSS)
768
EC2

We compared two Bedrock managed models and two OSS models.
For the approximately 60,000 blog posts mentioned above, we vectorized Japanese summaries of about 165 characters and English summaries of about 60 words with each model, and registered them as separate indices in S3 Vectors (4 models × 2 languages = 8 indices).
 Evaluation Method Search Queries (33 types × 12 patterns)Assuming actual use cases, we prepared the following 12 patterns. These are broadly divided into "same-language search," where the query and search target are in the same language, and "cross-lingual search," which searches across different languages.
Same-Language Search:


#
Test Pattern
Query Example
Purpose


1
Japanese→Japanese
"Lambda コールドスタート 対策"
Basic performance

3
English→English
"RAG hallucination prevention"
English basic performance

4
Single word→Japanese
"React" "Lambda"
Noise resistance

5
Single word→English
"S3" "CDK"
Noise resistance (English)

6
Synonym→Japanese
"サーバーレス関数 タイムアウト 変更"
Synonym understanding

8
Intent/Issue→Japanese
"コンテナが勝手に再起動を繰り返す"
Intent understanding

10
Niche tech→Japanese
"Projen を使ったプロジェクト管理"
Low-frequency topics

Cross-Lingual Search:


#
Test Pattern
Query Example
Purpose


2
English→Japanese
"Lambda cold start mitigation"
Cross-lingual basic

7
Synonym (EN→JP)
"Increase Lambda execution time"
Cross-lingual synonyms

9
Intent/Issue (EN→JP)
"ECS Fargate tasks restarting loop"
Cross-lingual intent understanding

11
Niche tech (EN→JP)
"Benefits of using IAM Roles Anywhere"
Cross-lingual low-frequency

12
Japanese→English
"Lambda コールドスタート 対策"
Reverse cross-lingual

 Evaluation Metric: nDCG@10 × LLM-as-a-JudgeWe passed the Top-10 results of each search to an LLM for scoring on a scale of 0-100. To improve judgment accuracy, we had the LLM read not just the article titles but also the RAG summaries in both Japanese and English.


Score
Meaning


100
Contains a direct answer/solution to the query

80
Closely related and very useful

50
Topic is correct but falls short of addressing the intent

20
Contains the same words but different content

0
Unrelated

We calculated nDCG (Normalized Discounted Cumulative Gain) from the scores. This metric is higher when better results appear at the top, with 1.0 being ideal.
We used Gemini 3.0 Pro as the Judge. For reference, we also scored the same data with Claude Sonnet 4.5 (Amazon Bedrock) to compare between Judges.
 Test Environment

Item
Value


S3 Vectors indices
8 (4 models × 2 languages)

Number of articles
About 60,000 / index

Search queries
33 types

Test patterns
12 types (7 same-language + 5 cross-lingual)

Search count
180 times

Judge
Gemini 3.0 Pro (main) / Claude Sonnet 4.5 (reference)

Region
Bedrock: us-east-1 / S3 Vectors: us-west-2

 Results Overall Ranking


Rank
Model
nDCG@3
nDCG@10


1
E5-Large
0.7922
0.7375

2
Titan V2
0.7826
0.7157

3
Nova Embed
0.7438
0.6869

4
Ruri-Base
0.6600
0.6045

Overall, E5-Large ranked 1st, but the picture changes significantly when looking at individual test patterns.
 Same-Language Search


Test
Content
Nova
Titan
E5
Ruri
1st Place


test1
Basic (ja→ja)
0.7415
0.8205
0.8711
0.6898
E5

test3
Basic (en→en)
0.8310
0.8529
0.8953
0.8108
E5

test4
Single word (ja)
0.6790
0.6868
0.7441
0.5775
E5

test5
Single word (en)
0.6254
0.8082
0.8116
0.5596
E5

test6
Synonym (ja→ja)
0.8956
0.7640
0.7510
0.7111
Nova

test8
Intent (ja→ja)
0.7187
0.5586
0.6813
0.6022
Nova

test10
Niche (ja→ja)
0.1576
0.5701
0.4846
0.3738
Titan

For basic keyword searches (tests 1, 3, 4, 5), E5-Large won across the board. Meanwhile, Nova dominated in synonym recognition (test6) and intent understanding (test8).
 Cross-Lingual Search


Test
Content
Nova
Titan
E5
Ruri
1st Place


test2
Basic (en→ja)
0.8302
0.7969
0.7904
0.6810
Nova

test7
Synonym (en→ja)
0.8755
0.7837
0.7358
0.6642
Nova

test9
Intent (en→ja)
0.6895
0.5746
0.6271
0.4453
Nova

test11
Niche (en→ja)
0.3667
0.4224
0.4451
0.3504
E5

test12
Basic (ja→en)
0.7250
0.9511
0.9603
0.6529
E5

For en→ja, Nova ranked 1st in 3 out of 4 patterns, showing outstanding ability to find Japanese articles from English queries.
However, in our additional ja→en test (test12: Japanese query → English index), the results were reversed. With the query "RAG ハルシネーション 対策" (RAG hallucination countermeasures), Nova's Top-6 were all unrelated articles (conference reports, anime introductions, etc.) resulting in nDCG@10 = 0.33, while Titan (0.95) and E5 (0.96) maintained stable accuracy. We discovered that Nova's cross-lingual capability has asymmetry between en→ja and ja→en.
 Same-Language vs Cross-Lingual
Comparing same-language and cross-lingual searches in the same category revealed interesting trends.
Nova: Showed almost no drop in accuracy for en→ja (Basic: 0.74→0.83, Synonym: 0.90→0.88). However, it collapsed on some ja→en queries, showing asymmetry in cross-lingual capabilities
Titan / E5: Maintained stable accuracy for ja→en. Due to the higher information content in English summaries, ja→en sometimes scored higher than ja→ja
Ruri: Showed significant accuracy decline in cross-lingual searches (Intent: 0.60→0.45, react_hooks ja→en: nDCG=0.0)
Niche: All models struggled with both same-language and cross-lingual searches
 Heatmap (All Patterns)
 Model Characteristics Nova Embed: Strong in Intent Understanding and Synonyms, Weak in Niche TopicsNova excelled in synonym recognition (test6: 0.8956) and intent understanding (test8: 0.7187), outperforming other models.
Here's a specific example of a synonym-type query (test6). We searched for "サーバーレス関数 タイムアウト 変更" (serverless function timeout change), which doesn't use AWS's official terminology.
Nova Embed (Top-3):
Fixed an issue where Lambda functions execute multiple times by setting timeout and retry count → score: 100 ✅
How to check if a Lambda function has timed out from its logs → score: 75
Created a Lambda to recover Lambda functions that failed due to timeout → score: 65
Titan V2 (Top-3):
How to change the execution timeout time of AWS Systems Manager RunCommand? → score: 75
How to change the integration timeout in API Gateway → score: 95
[New Feature] Changing timeout values for AWS Elastic Load Balancing (ELB) → score: 40
Nova understood that "serverless function" = "Lambda" semantically, with all Top-3 results being Lambda-related. Titan was literal with "timeout change" as keywords, mixing in non-Lambda services like SSM and API Gateway.
Here's an example of intent understanding (test8). We input the natural language description of a problem situation: "コンテナが勝手に再起動を繰り返す" (containers repeatedly restart on their own).
Nova Embed (Top-3):
Unintended restarts during AMI creation → score: 20
Auto-restart with systemd during exceptions → score: 75
ECS tasks with CannotPullContainerError during ECR image updates → score: 95 ✅
Titan V2 (Top-3):
Unintended restarts during AMI creation → score: 30
Handling EC2 maintenance requiring restart → score: 35
How to keep Windows containers running in ECS → score: 75
Nova successfully included a high-scoring ECS-related article (score: 95) in its Top-3. Titan only matched on the words "restart" and "container," failing to bring articles that matched the intent to the top positions.
Conversely, for niche technology (test10: 0.1576), the positions were reversed. For the query "Projen を使ったプロジェクト管理" (Project management using Projen):
Nova Embed (Top-3):
Backlog management using GitHub Projects (product/sprint) → score: 20
How to manage Issues and sub-Issues appropriately using Project → score: 20
Introducing project management to (non-engineer) staff work → score: 15
Titan V2 (Top-3):
Regular AWS CDK App development became comfortable using projen → score: 95 ✅
[Report + Try] projen – a CDK for software project configuration #CDK Day → score: 95 ✅
Project Management #1 → score: 15
Nova was drawn to the general meaning of "project management" and failed to find any Projen-related articles. Titan accurately captured the proper noun "Projen" and successfully hit related articles in its Top-2. This is a good example where Nova's semantic understanding worked against it, and Titan complemented it.
 Titan Embed V2: Stable Second PlaceWhile Titan doesn't have outstanding strengths, it consistently ranked 2nd-3rd across all patterns. It took 1st place for niche technology (test10: 0.5701) where Nova failed, showing it has few weaknesses and is a balanced model.
 OSS Models (E5-Large / Ruri-Base): High Basic Performance but Operational ChallengesE5-Large ranked 1st in Japanese basic search (test1: 0.8711), English basic search (test3: 0.8953), and single-word searches (test4/5). It's the most accurate model for textbook search queries. However, it fell behind Nova in synonym recognition and intent understanding.
Ruri-Base ranked last in all 12 patterns, but considering its handicap of having fewer dimensions (768) than other models (1024), it performed reasonably well.
However, OSS models presented operational challenges. Real-time vectorization of search terms requires model loading and inference. It's difficult to achieve practical latency in CPU-only environments like Lambda, and keeping GPU instances running constantly isn't cost-effective.

While there was potential for using these models for similar article searches with pre-vectorized data, we decided against adopting OSS models since we had no plans for model fine-tuning or custom optimization, and instead focused on Bedrock's managed models.
 Vectorization Speed


Model
Initialization
Inference (33 items)
Per item
Execution Environment


Titan V2
0.1s
5.0s
152ms
Bedrock API

Nova Embed
0.1s
8.1s
246ms
Bedrock API

E5-Large
57.8s
21.0s
635ms
EC2

Ruri-Base
4.9s
87.0s
2,638ms
EC2

Models via Bedrock API were fast with no initialization needed. OSS models required Docker container startup and model loading, with E5-Large taking about a minute for initialization, which was a drawback.
 Reference: Sonnet's Second OpinionTo verify Judge bias, we also had Claude Sonnet 4.5 (Amazon Bedrock) score the same data.
Gemini used the 5-level scale (0/20/50/80/100) as instructed, which clearly differentiated between models. Sonnet, however, used continuous values from 0-100 (69% of 1,640 evaluations used non-specified values), resulting in all models clustering between nDCG@10 = 0.90-0.95 with little differentiation.


Test
Gemini 1st
Sonnet 1st
Match


test2 Cross-lingual basic
Nova
Nova
✓

test6 Synonyms
Nova
Nova
✓

test7 Synonyms (EN→JP)
Nova
Nova
✓

test8 Intent understanding
Nova
Nova
✓

test9 Intent (EN→JP)
Nova
Nova
✓

test10 Niche technology
Titan
Titan
✓

test1 Basic (JP→JP)
E5
Titan
✗

test3 Basic (EN→EN)
E5
Titan
✗

test4 Single word (JP)
E5
Ruri
✗

test5 Single word (EN)
E5
Titan
✗

test11 Niche (EN→JP)
E5
Nova
✗

The top-ranked models matched in 6/11 cases (55%). Patterns where Nova excelled (cross-lingual, synonyms, intent understanding) and where Titan excelled (niche technology) were consistent across both Judges, but basic searches and single-word searches varied between E5 and Titan. Since LLM-as-a-Judge rankings can change depending on the Judge selected, please consider these results as relative trends.
 Conclusion Bedrock Model Selection Guidelines

Use Case
Recommended Model
Reason


Cross-lingual search (EN→JP)
Nova Embed
Highest accuracy for English query → Japanese content

Handling variants and synonyms
Nova Embed
Understands "serverless function" → "Lambda"

Natural language problem searches
Nova Embed
Highest intent understanding capability

Keyword search & niche technology
Titan V2
Stable with few weaknesses, complements Nova's weak areas

 Overall AssessmentThis comparison clearly revealed each model's strengths and weaknesses.
In terms of "overall score," the OSS model E5-Large ranked 1st, but for our semantic search platform, we prioritize use cases where users input their "problems" vaguely in natural language, rather than searches with clear keywords.

Therefore, instead of focusing on overall score, we valued strengths in "intent understanding (test8)" and "synonym recognition (test6)," and decided to adopt Nova Embed as our main model, as it's a Bedrock managed model that doesn't require infrastructure management like GPUs.
However, through testing, we also identified areas where Nova struggles, such as niche technologies and single-word keyword searches. For these areas, we're considering using Titan V2, which has few weaknesses and is stable, to complement Nova in a combined approach that leverages the strengths of each model.

Understanding intention with Nova, stability with Titan. Comparing search accuracy of 4 Embedding models using LLM-as-a-Judge

Comparison Target Models

Evaluation Method

Search Queries (33 types × 12 patterns)

Evaluation Metric: nDCG@10 × LLM-as-a-Judge

Test Environment

Results

Overall Ranking

Same-Language Search

Cross-Lingual Search

Same-Language vs Cross-Lingual

Heatmap (All Patterns)

Model Characteristics

Nova Embed: Strong in Intent Understanding and Synonyms, Weak in Niche Topics

Titan Embed V2: Stable Second Place

OSS Models (E5-Large / Ruri-Base): High Basic Performance but Operational Challenges

Vectorization Speed

Reference: Sonnet's Second Opinion

Conclusion

Bedrock Model Selection Guidelines

Overall Assessment

AWS Topics

Trending Topics

Products & Services

Features and Series

Model	Provider	Dimensions	Execution Environment
Amazon Nova Embed	AWS (Bedrock)	1024	Bedrock API
Amazon Titan Embed V2	AWS (Bedrock)	1024	Bedrock API
multilingual-e5-large	Microsoft (OSS)	1024	EC2
ruri-base	Hugging Face (OSS)	768	EC2

#	Test Pattern	Query Example	Purpose
1	Japanese→Japanese	"Lambda コールドスタート対策"	Basic performance
3	English→English	"RAG hallucination prevention"	English basic performance
4	Single word→Japanese	"React" "Lambda"	Noise resistance
5	Single word→English	"S3" "CDK"	Noise resistance (English)
6	Synonym→Japanese	"サーバーレス関数タイムアウト変更"	Synonym understanding
8	Intent/Issue→Japanese	"コンテナが勝手に再起動を繰り返す"	Intent understanding
10	Niche tech→Japanese	"Projen を使ったプロジェクト管理"	Low-frequency topics

#	Test Pattern	Query Example	Purpose
2	English→Japanese	"Lambda cold start mitigation"	Cross-lingual basic
7	Synonym (EN→JP)	"Increase Lambda execution time"	Cross-lingual synonyms
9	Intent/Issue (EN→JP)	"ECS Fargate tasks restarting loop"	Cross-lingual intent understanding
11	Niche tech (EN→JP)	"Benefits of using IAM Roles Anywhere"	Cross-lingual low-frequency
12	Japanese→English	"Lambda コールドスタート対策"	Reverse cross-lingual

Score	Meaning
100	Contains a direct answer/solution to the query
80	Closely related and very useful
50	Topic is correct but falls short of addressing the intent
20	Contains the same words but different content
0	Unrelated

Item	Value
S3 Vectors indices	8 (4 models × 2 languages)
Number of articles	About 60,000 / index
Search queries	33 types
Test patterns	12 types (7 same-language + 5 cross-lingual)
Search count	180 times
Judge	Gemini 3.0 Pro (main) / Claude Sonnet 4.5 (reference)
Region	Bedrock: us-east-1 / S3 Vectors: us-west-2

Rank	Model	nDCG@3	nDCG@10
1	E5-Large	0.7922	0.7375
2	Titan V2	0.7826	0.7157
3	Nova Embed	0.7438	0.6869
4	Ruri-Base	0.6600	0.6045

Test	Content	Nova	Titan	E5	Ruri	1st Place
test1	Basic (ja→ja)	0.7415	0.8205	0.8711	0.6898	E5
test3	Basic (en→en)	0.8310	0.8529	0.8953	0.8108	E5
test4	Single word (ja)	0.6790	0.6868	0.7441	0.5775	E5
test5	Single word (en)	0.6254	0.8082	0.8116	0.5596	E5
test6	Synonym (ja→ja)	0.8956	0.7640	0.7510	0.7111	Nova
test8	Intent (ja→ja)	0.7187	0.5586	0.6813	0.6022	Nova
test10	Niche (ja→ja)	0.1576	0.5701	0.4846	0.3738	Titan

Test	Content	Nova	Titan	E5	Ruri	1st Place
test2	Basic (en→ja)	0.8302	0.7969	0.7904	0.6810	Nova
test7	Synonym (en→ja)	0.8755	0.7837	0.7358	0.6642	Nova
test9	Intent (en→ja)	0.6895	0.5746	0.6271	0.4453	Nova
test11	Niche (en→ja)	0.3667	0.4224	0.4451	0.3504	E5
test12	Basic (ja→en)	0.7250	0.9511	0.9603	0.6529	E5

Model	Initialization	Inference (33 items)	Per item	Execution Environment
Titan V2	0.1s	5.0s	152ms	Bedrock API
Nova Embed	0.1s	8.1s	246ms	Bedrock API
E5-Large	57.8s	21.0s	635ms	EC2
Ruri-Base	4.9s	87.0s	2,638ms	EC2

Test	Gemini 1st	Sonnet 1st	Match
test2 Cross-lingual basic	Nova	Nova	✓
test6 Synonyms	Nova	Nova	✓
test7 Synonyms (EN→JP)	Nova	Nova	✓
test8 Intent understanding	Nova	Nova	✓
test9 Intent (EN→JP)	Nova	Nova	✓
test10 Niche technology	Titan	Titan	✓
test1 Basic (JP→JP)	E5	Titan	✗
test3 Basic (EN→EN)	E5	Titan	✗
test4 Single word (JP)	E5	Ruri	✗
test5 Single word (EN)	E5	Titan	✗
test11 Niche (EN→JP)	E5	Nova	✗

Use Case	Recommended Model	Reason
Cross-lingual search (EN→JP)	Nova Embed	Highest accuracy for English query → Japanese content
Handling variants and synonyms	Nova Embed	Understands "serverless function" → "Lambda"
Natural language problem searches	Nova Embed	Highest intent understanding capability
Keyword search & niche technology	Titan V2	Stable with few weaknesses, complements Nova's weak areas