I investigated the conditions under which "CSV (for FAQ data)" is semantically searched in Vertex AI Search
This page has been translated by machine translation. View original
Background
After uploading FAQ data to a Vertex AI Search structured data store (CSV) and trying to search it, I found the search accuracy was poor, and queries with different phrasing did not yield appropriate results.
For example, I wanted the query "Is it a problem to create multiple accounts?" to match an FAQ like "Is it against the terms to have multiple accounts per person?", but it didn't return this result.
When checking the search API response, I found semanticState: DISABLED was returned. It seems semantic search was disabled, and only keyword matching was being performed.
To solve this problem, I tested several approaches.
Test Environment
- Enterprise edition: Enabled (
SEARCH_TIER_ENTERPRISE,SEARCH_ADD_ON_LLM) - API: Discovery Engine v1alpha
- Data: FAQ data (48 entries, title/question/answer)
Test 1: Structured Data Store (CSV)
Procedure
In the console's data store creation screen, there's an option for "CSV (for FAQ data)".

I used this to import a CSV from Cloud Storage as a structured data store. The CSV has 3 columns: title, question, and answer.
title,answer,question
Multiple account usage,We ask that you use 1 account per person...,Is it against the terms to have multiple accounts per person?
WAF blocking,Your IP address was incorrectly identified by the automated threat detection service...,My posting was blocked after continuous submissions...
...(total of 48 entries)
The configuration of the created data store is as follows:
{
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}",
"displayName": "faq-csv-structured",
"industryVertical": "GENERIC",
"createTime": "2026-03-13T08:17:53.671221Z",
"solutionTypes": [
"SOLUTION_TYPE_SEARCH"
],
"defaultSchemaId": "default_schema",
"languageInfo": {
"languageCode": "ja",
"normalizedLanguageCode": "ja",
"language": "ja"
},
"billingEstimation": {
"structuredDataSize": "26472",
"structuredDataUpdateTime": "2026-03-16T04:56:45.813085130Z"
},
"documentProcessingConfig": {
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/documentProcessingConfig",
"defaultParsingConfig": {
"layoutParsingConfig": {}
}
},
"servingConfigDataStore": {},
"naturalLanguageQueryUnderstandingConfig": {
"mode": "ENABLED"
},
"federatedSearchConfig": {}
}
industryVertical: GENERIC— A general industry category, not specialized for media, healthcare, etc.solutionTypes: SOLUTION_TYPE_SEARCH— Data store for search purposesnaturalLanguageQueryUnderstandingConfig: ENABLED— Natural language query interpretation feature is enabled. (However, this doesn't necessarily enable semantic search)
The schema automatically defined three fields corresponding to the CSV columns. keyPropertyMapping is set with FAQ CSV-specific title, question, and answer.
{
"structSchema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"title": {
"type": "string",
"keyPropertyMapping": "title",
"retrievable": true
},
"question": {
"type": "string",
"keyPropertyMapping": "question",
"retrievable": true
},
"answer": {
"type": "string",
"keyPropertyMapping": "answer",
"retrievable": true
}
}
},
"fieldConfigs": [
{"fieldPath": "title", "fieldType": "STRING", "keyPropertyType": "TITLE"},
{"fieldPath": "question", "fieldType": "STRING", "keyPropertyType": "QUESTION"},
{"fieldPath": "answer", "fieldType": "STRING", "keyPropertyType": "ANSWER"}
]
}
Search Request
I sent a standard search request specifying this structured data store.
$ curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search" \
-d '{
"query": "Is it a problem to create multiple accounts?",
"dataStoreSpecs": [{
"dataStore": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}"
}]
}'
{
"attributionToken": "...",
"guidedSearchResult": {},
"summary": {},
"queryExpansionInfo": {},
"semanticState": "DISABLED"
}
semanticState: DISABLED was returned. Even with Enterprise edition enabled, semantic search was not activated.
While researching further, I found an interesting description in the Dialogflow CX documentation.
Reference: Dialogflow CX - Data store agents
Note: CSV files can also be imported as unstructured content. (...) The matching requirements are less strict compared to FAQ CSV data stores, and answers might be rewritten by the agent rather than returned verbatim.
This suggests that the same CSV data might have different matching behaviors depending on the import method (FAQ structured vs. unstructured).
Based on this, I decided to proceed with testing importing the CSV as an unstructured data store rather than directly as a structured data store.
Test 2: Unstructured Data Store (HTML Conversion)
Procedure
This time, I imported it as unstructured "documents".

I converted each CSV row (48 entries) into individual HTML files.
<!DOCTYPE html>
<html>
<head><title>Multiple account usage</title></head>
<body>
<h1>Multiple account usage</h1>
<h2>Question</h2>
<p>Is it against the terms to have multiple accounts per person?</p>
<h2>Answer</h2>
<p>...</p>
</body>
</html>
I uploaded the HTML files to a GCS bucket, created an unstructured data store (Cloud Storage) from the console, and connected it to an engine.
The configuration of the created data store is as follows:
{
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}",
"displayName": "faq-html-unstructured",
"industryVertical": "GENERIC",
"createTime": "2026-03-16T01:22:14.863267Z",
"solutionTypes": [
"SOLUTION_TYPE_SEARCH"
],
"contentConfig": "CONTENT_REQUIRED",
"defaultSchemaId": "default_schema",
"billingEstimation": {
"unstructuredDataSize": "34337",
"unstructuredDataUpdateTime": "2026-03-16T02:40:36.968677615Z"
},
"documentProcessingConfig": {
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/documentProcessingConfig",
"chunkingConfig": {
"layoutBasedChunkingConfig": {
"chunkSize": 500
}
},
"defaultParsingConfig": {
"layoutParsingConfig": {}
}
},
"servingConfigDataStore": {}
}
contentConfig: CONTENT_REQUIRED— Content is required for unstructured data stores. This setting wasn't present in the structured data store from Test 1chunkingConfig— Layout-based chunking is enabled (chunk size 500). In unstructured data stores, documents are automatically split into chunkslayoutParsingConfig— Layout parser is enabled. It analyzes HTML structure and reflects it in the index
Search Request
I searched with the same query as in Test 1.
$ curl -s -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search" \
-d '{
"query": "Is it a problem to create multiple accounts?",
"dataStoreSpecs": [{
"dataStore": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}"
}]
}'
{
"results": [
{
"id": "a72d4a6a75ef09d25ad8d011f0c9cc33",
"document": {
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/branches/0/documents/a72d4a6a75ef09d25ad8d011f0c9cc33",
"id": "a72d4a6a75ef09d25ad8d011f0c9cc33",
"derivedStructData": {
"link": "gs://{BUCKET}/faq_html/037.html",
"title": "Multiple account usage",
"snippets": [
{
"snippet_status": "SUCCESS",
"snippet": "... I'd like to <b>create</b> a company <b>account</b> separate from my personal <b>account</b>. Is having <b>multiple accounts</b> against the terms? ## Answer Having <b>multiple accounts</b> is not <b>a problem</b> according to the terms. However, later ..."
}
]
}
},
"modelScores": {
"relevance_score": { "values": [1] }
},
"rankSignals": {
"keywordSimilarityScore": 3.2140481,
"relevanceScore": 0.99168247,
"semanticSimilarityScore": 0.81432873,
"topicalityRank": 1,
"documentAge": 492680.63,
"boostingFactor": 0,
"defaultRank": 1
}
},
...
],
"totalSize": 3,
"attributionToken": "...",
"guidedSearchResult": {},
"summary": {},
"queryExpansionInfo": {},
"semanticState": "ENABLED"
}
semanticState: ENABLED was achieved.
The top result "Multiple account usage" has semanticSimilarityScore: 0.814 and relevance_score: 1 (highest score), showing that it properly matches even with different phrasing. By just changing to an unstructured data store, semantic search started working for queries that didn't return results in Test 1.
This is the simplest solution, but I also tried another approach.
Test 3: Structured Data Store with Custom Embeddings
The official documentation mentions that adding custom embeddings to structured data enables vector search. I tested whether this method would also set semanticState to ENABLED.
Reference: Bring your own embeddings
Procedure
- Generated embeddings (768-dimensional) for each FAQ document (question + answer) using the
gemini-embedding-001model - Created a JSONL file containing the embeddings
{"id": "faq-001", "structData": {"title": "...", "question": "...", "answer": "...", "embedding_vector": [0.1, 0.2, ...]}}
- Created a data store via API (
contentConfig: NO_CONTENT) - Set
keyPropertyMapping: "embedding_vector"anddimension: 768for theembedding_vectorfield in the schema - Imported the data (48 successful entries)
- Connected to an engine
The configuration of the created data store is as follows:
{
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}",
"displayName": "faq-custom-embeddings",
"industryVertical": "GENERIC",
"createTime": "2026-03-16T04:14:49.765248Z",
"solutionTypes": [
"SOLUTION_TYPE_SEARCH"
],
"contentConfig": "NO_CONTENT",
"defaultSchemaId": "default_schema",
"billingEstimation": {
"structuredDataSize": "829736",
"structuredDataUpdateTime": "2026-03-16T04:56:45.813085130Z"
},
"servingConfigDataStore": {},
"naturalLanguageQueryUnderstandingConfig": {
"mode": "ENABLED"
}
}
The schema is as follows:
{
"structSchema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"title": {
"type": "string",
"retrievable": true,
"keyPropertyMapping": "title"
},
"question": {
"type": "string",
"retrievable": true
},
"answer": {
"type": "string",
"retrievable": true
},
"embedding_vector": {
"type": "array",
"items": { "type": "number" },
"keyPropertyMapping": "embedding_vector",
"dimension": 768
}
}
},
"fieldConfigs": [
{"fieldPath": "embedding_vector", "fieldType": "NUMBER", "keyPropertyType": "EMBEDDING_VECTOR"},
{"fieldPath": "title", "fieldType": "STRING", "keyPropertyType": "TITLE"},
{"fieldPath": "question", "fieldType": "STRING"},
{"fieldPath": "answer", "fieldType": "STRING"}
]
}
contentConfig: NO_CONTENT— Required for importing documents with onlystructData. UsingCONTENT_REQUIREDwould result in import errorsembedding_vectorfield is set withkeyPropertyMapping: "embedding_vector"anddimension: 768. This setting must be configured immediately after creating the data store (before importing) as it cannot be added when existing documents are present- Unlike Test 1,
questionandanswerhave nokeyPropertyMappingset. This is because it's created as a standard structured data store, not as an FAQ CSV
Search Request
Following the official documentation, I tried passing the query's embedding vector in embeddingSpec and using ranking_expression to incorporate vector similarity into ranking.
$ curl -s -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search" \
-d '{
"query": "Is it a problem to create multiple accounts?",
"dataStoreSpecs": [{
"dataStore": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}"
}],
"embeddingSpec": {
"embeddingVectors": [{
"fieldPath": "embedding_vector",
"vector": [0.028, 0.003, -0.016, ...]
}]
},
"ranking_expression": "0.5 * relevance_score + 0.5 * dotProduct(embedding_vector)"
}'
{
"results": [
{
"id": "faq-037",
"document": {
"name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/branches/0/documents/faq-037",
"id": "faq-037",
"structData": {
"title": "Multiple account usage",
"question": "I'd like to create a company account separate from my personal account. Is having multiple accounts against the terms?",
"answer": "Having multiple accounts is not a problem according to the terms. However, later migration of content between accounts is not possible."
},
"derivedStructData": {
"snippets": [
{
"snippet_status": "NO_SNIPPET_AVAILABLE",
"snippet": "No snippet is available for this page."
}
]
}
},
"modelScores": {
"dotProduct(embedding_vector)": { "values": [0.8300950527191162] }
},
"rankSignals": {
"keywordSimilarityScore": 2.3365948,
"topicalityRank": 1,
"defaultRank": 1
}
},
...
],
"totalSize": 48,
"attributionToken": "...",
"nextPageToken": "...",
"guidedSearchResult": {},
"summary": {},
"queryExpansionInfo": {},
"semanticState": "ENABLED"
}
semanticState: ENABLED was achieved.
- 1st: Multiple account usage (dotProduct: 0.830)
- 2nd: Account freeze/unfreeze (dotProduct: 0.608)
- 3rd: Account deletion/cancellation (dotProduct: 0.605)
All 48 items are included in the search results, and semantic search is functioning correctly.
Notes on Using Custom Embeddings
Here are some points I noticed during testing:
- You need to generate and pass the query's embedding vector with each search request. It won't work with just a standard search request
- Use
dotProduct(field_name)inranking_expressionto incorporate ranking - The schema's
keyPropertyMapping: "embedding_vector"cannot be added when existing documents are present. It must be set immediately after creating the data store (before importing) - The embedding dimension must be in the range of 1-768. Since
gemini-embedding-001defaults to 3072 dimensions, you need to specifyoutput_dimensionality: 768to stay within the limit
Why Do We Need to Pass the Query Vector Ourselves?
With custom embeddings, the vectors on the data side are generated by user-chosen models. Vector similarity search assumes that both data and query vectors are generated by the same model with the same settings. Therefore, it wouldn't make sense for Vertex AI Search to automatically generate query vectors with its internal model. Users must generate query vectors using the same model and pass them via embeddingSpec.
In contrast, with unstructured data stores, Vertex AI Search automatically generates embeddings for both data and queries using the same internal model, so users don't need to worry about embeddings at all.
Side Note: Case Where Semantic Search Is Enabled for Structured Data Stores in Blended Search
After the above tests with engines connected to a single data store, I found one more interesting behavior worth mentioning.
In a blended search engine (with multiple data stores connected), when searching without specifying dataStoreSpecs to target all data stores, semanticState: ENABLED was returned and structured data store content was included in the results.
| Condition | semanticState |
|---|---|
Blended search engine without dataStoreSpecs (searching all data stores) |
ENABLED |
Same engine with dataStoreSpecs specifying only the CSV structured data store |
DISABLED |
It seems that searching alongside a data store that supports semantic search (like a website with Advanced indexing) causes semantic search to extend to the structured data store as well. However, simply adding a website data store with 0 documents didn't enable it, suggesting that the presence of a data store with indexed documents is necessary.
I'm not sure if this behavior is by design or if it's just a matter of the engine's overall search mode switching. Since restricting to just the structured data store with dataStoreSpecs reverts back to DISABLED, it doesn't appear that semantic search is being enabled for the structured data store itself. For practical purposes, I recommend using the methods summarized in the table below for reliable results.
Conclusion
Here's a summary of the test results:
| Data Store Type | semanticState |
|---|---|
| Structured Data Store (CSV / with keyPropertyMapping) | DISABLED |
| Structured Data Store (CSV / without keyPropertyMapping) | ENABLED |
| Structured + Custom Embeddings | ENABLED |
| Unstructured Data Store (HTML) | ENABLED |
Blended Search (without dataStoreSpecs) |
ENABLED |
I found that semantic search is disabled for structured data stores with certain keyPropertyMappings.
Using an unstructured data store (HTML conversion) is also effective for enabling semantic search. Although there's initial work to convert CSV to HTML, once converted, semantic search is enabled with regular search requests.
The custom embedding approach offers the advantage of choosing your own embedding model, but it requires generating and sending query vectors with each search, making the application implementation more complex.
The blended search method without specifying dataStoreSpecs is convenient, but since it reverts to DISABLED when specifying just the structured data store, it's not suitable if you want to narrow down the search target data stores.