I investigated the conditions under which "CSV (for FAQ data)" is semantically searched in Vertex AI Search

I investigated the conditions under which "CSV (for FAQ data)" is semantically searched in Vertex AI Search

Have you ever had trouble with semantic search not being enabled when searching CSV data registered in a data store with Vertex AI Search (Discovery Engine API)? This article summarizes the process and results of investigating the conditions under which semantic search is enabled for each data store.
2026.03.17

This page has been translated by machine translation. View original

Background

After uploading FAQ data to a Vertex AI Search structured data store (CSV) and trying to search it, I found the search accuracy was poor, and queries with different phrasing did not yield appropriate results.

For example, I wanted the query "Is it a problem to create multiple accounts?" to match an FAQ like "Is it against the terms to have multiple accounts per person?", but it didn't return this result.

When checking the search API response, I found semanticState: DISABLED was returned. It seems semantic search was disabled, and only keyword matching was being performed.

To solve this problem, I tested several approaches.

Test Environment

  • Enterprise edition: Enabled (SEARCH_TIER_ENTERPRISE, SEARCH_ADD_ON_LLM)
  • API: Discovery Engine v1alpha
  • Data: FAQ data (48 entries, title/question/answer)

Test 1: Structured Data Store (CSV)

Procedure

In the console's data store creation screen, there's an option for "CSV (for FAQ data)".

vertex-ai-search-csv-import-options

I used this to import a CSV from Cloud Storage as a structured data store. The CSV has 3 columns: title, question, and answer.

faq.csv
title,answer,question
Multiple account usage,We ask that you use 1 account per person...,Is it against the terms to have multiple accounts per person?
WAF blocking,Your IP address was incorrectly identified by the automated threat detection service...,My posting was blocked after continuous submissions...
...(total of 48 entries)

The configuration of the created data store is as follows:

{
  "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}",
  "displayName": "faq-csv-structured",
  "industryVertical": "GENERIC",
  "createTime": "2026-03-13T08:17:53.671221Z",
  "solutionTypes": [
    "SOLUTION_TYPE_SEARCH"
  ],
  "defaultSchemaId": "default_schema",
  "languageInfo": {
    "languageCode": "ja",
    "normalizedLanguageCode": "ja",
    "language": "ja"
  },
  "billingEstimation": {
    "structuredDataSize": "26472",
    "structuredDataUpdateTime": "2026-03-16T04:56:45.813085130Z"
  },
  "documentProcessingConfig": {
    "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/documentProcessingConfig",
    "defaultParsingConfig": {
      "layoutParsingConfig": {}
    }
  },
  "servingConfigDataStore": {},
  "naturalLanguageQueryUnderstandingConfig": {
    "mode": "ENABLED"
  },
  "federatedSearchConfig": {}
}
  • industryVertical: GENERIC — A general industry category, not specialized for media, healthcare, etc.
  • solutionTypes: SOLUTION_TYPE_SEARCH — Data store for search purposes
  • naturalLanguageQueryUnderstandingConfig: ENABLED — Natural language query interpretation feature is enabled. (However, this doesn't necessarily enable semantic search)

The schema automatically defined three fields corresponding to the CSV columns. keyPropertyMapping is set with FAQ CSV-specific title, question, and answer.

{
  "structSchema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "keyPropertyMapping": "title",
        "retrievable": true
      },
      "question": {
        "type": "string",
        "keyPropertyMapping": "question",
        "retrievable": true
      },
      "answer": {
        "type": "string",
        "keyPropertyMapping": "answer",
        "retrievable": true
      }
    }
  },
  "fieldConfigs": [
    {"fieldPath": "title", "fieldType": "STRING", "keyPropertyType": "TITLE"},
    {"fieldPath": "question", "fieldType": "STRING", "keyPropertyType": "QUESTION"},
    {"fieldPath": "answer", "fieldType": "STRING", "keyPropertyType": "ANSWER"}
  ]
}

Search Request

I sent a standard search request specifying this structured data store.

$ curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search" \
  -d '{
    "query": "Is it a problem to create multiple accounts?",
    "dataStoreSpecs": [{
      "dataStore": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}"
    }]
  }'
{
  "attributionToken": "...",
  "guidedSearchResult": {},
  "summary": {},
  "queryExpansionInfo": {},
  "semanticState": "DISABLED"
}

semanticState: DISABLED was returned. Even with Enterprise edition enabled, semantic search was not activated.

While researching further, I found an interesting description in the Dialogflow CX documentation.

Reference: Dialogflow CX - Data store agents

Note: CSV files can also be imported as unstructured content. (...) The matching requirements are less strict compared to FAQ CSV data stores, and answers might be rewritten by the agent rather than returned verbatim.

This suggests that the same CSV data might have different matching behaviors depending on the import method (FAQ structured vs. unstructured).

Based on this, I decided to proceed with testing importing the CSV as an unstructured data store rather than directly as a structured data store.

Test 2: Unstructured Data Store (HTML Conversion)

Procedure

This time, I imported it as unstructured "documents".

vertex-ai-search-unstructured-import-options

I converted each CSV row (48 entries) into individual HTML files.

faq-001.html
<!DOCTYPE html>
<html>
<head><title>Multiple account usage</title></head>
<body>
  <h1>Multiple account usage</h1>
  <h2>Question</h2>
  <p>Is it against the terms to have multiple accounts per person?</p>
  <h2>Answer</h2>
  <p>...</p>
</body>
</html>

I uploaded the HTML files to a GCS bucket, created an unstructured data store (Cloud Storage) from the console, and connected it to an engine.

The configuration of the created data store is as follows:

{
  "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}",
  "displayName": "faq-html-unstructured",
  "industryVertical": "GENERIC",
  "createTime": "2026-03-16T01:22:14.863267Z",
  "solutionTypes": [
    "SOLUTION_TYPE_SEARCH"
  ],
  "contentConfig": "CONTENT_REQUIRED",
  "defaultSchemaId": "default_schema",
  "billingEstimation": {
    "unstructuredDataSize": "34337",
    "unstructuredDataUpdateTime": "2026-03-16T02:40:36.968677615Z"
  },
  "documentProcessingConfig": {
    "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/documentProcessingConfig",
    "chunkingConfig": {
      "layoutBasedChunkingConfig": {
        "chunkSize": 500
      }
    },
    "defaultParsingConfig": {
      "layoutParsingConfig": {}
    }
  },
  "servingConfigDataStore": {}
}
  • contentConfig: CONTENT_REQUIRED — Content is required for unstructured data stores. This setting wasn't present in the structured data store from Test 1
  • chunkingConfig — Layout-based chunking is enabled (chunk size 500). In unstructured data stores, documents are automatically split into chunks
  • layoutParsingConfig — Layout parser is enabled. It analyzes HTML structure and reflects it in the index

Search Request

I searched with the same query as in Test 1.

$ curl -s -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search" \
  -d '{
    "query": "Is it a problem to create multiple accounts?",
    "dataStoreSpecs": [{
      "dataStore": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}"
    }]
  }'
{
  "results": [
    {
      "id": "a72d4a6a75ef09d25ad8d011f0c9cc33",
      "document": {
        "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/branches/0/documents/a72d4a6a75ef09d25ad8d011f0c9cc33",
        "id": "a72d4a6a75ef09d25ad8d011f0c9cc33",
        "derivedStructData": {
          "link": "gs://{BUCKET}/faq_html/037.html",
          "title": "Multiple account usage",
          "snippets": [
            {
              "snippet_status": "SUCCESS",
              "snippet": "... I'd like to <b>create</b> a company <b>account</b> separate from my personal <b>account</b>. Is having <b>multiple accounts</b> against the terms? ## Answer Having <b>multiple accounts</b> is not <b>a problem</b> according to the terms. However, later ..."
            }
          ]
        }
      },
      "modelScores": {
        "relevance_score": { "values": [1] }
      },
      "rankSignals": {
        "keywordSimilarityScore": 3.2140481,
        "relevanceScore": 0.99168247,
        "semanticSimilarityScore": 0.81432873,
        "topicalityRank": 1,
        "documentAge": 492680.63,
        "boostingFactor": 0,
        "defaultRank": 1
      }
    },
    ...
  ],
  "totalSize": 3,
  "attributionToken": "...",
  "guidedSearchResult": {},
  "summary": {},
  "queryExpansionInfo": {},
  "semanticState": "ENABLED"
}

semanticState: ENABLED was achieved.

The top result "Multiple account usage" has semanticSimilarityScore: 0.814 and relevance_score: 1 (highest score), showing that it properly matches even with different phrasing. By just changing to an unstructured data store, semantic search started working for queries that didn't return results in Test 1.

This is the simplest solution, but I also tried another approach.

Test 3: Structured Data Store with Custom Embeddings

The official documentation mentions that adding custom embeddings to structured data enables vector search. I tested whether this method would also set semanticState to ENABLED.

Reference: Bring your own embeddings

Procedure

  1. Generated embeddings (768-dimensional) for each FAQ document (question + answer) using the gemini-embedding-001 model
  2. Created a JSONL file containing the embeddings
faq-embeddings.jsonl
{"id": "faq-001", "structData": {"title": "...", "question": "...", "answer": "...", "embedding_vector": [0.1, 0.2, ...]}}
  1. Created a data store via API (contentConfig: NO_CONTENT)
  2. Set keyPropertyMapping: "embedding_vector" and dimension: 768 for the embedding_vector field in the schema
  3. Imported the data (48 successful entries)
  4. Connected to an engine

The configuration of the created data store is as follows:

{
  "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}",
  "displayName": "faq-custom-embeddings",
  "industryVertical": "GENERIC",
  "createTime": "2026-03-16T04:14:49.765248Z",
  "solutionTypes": [
    "SOLUTION_TYPE_SEARCH"
  ],
  "contentConfig": "NO_CONTENT",
  "defaultSchemaId": "default_schema",
  "billingEstimation": {
    "structuredDataSize": "829736",
    "structuredDataUpdateTime": "2026-03-16T04:56:45.813085130Z"
  },
  "servingConfigDataStore": {},
  "naturalLanguageQueryUnderstandingConfig": {
    "mode": "ENABLED"
  }
}

The schema is as follows:

{
  "structSchema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "retrievable": true,
        "keyPropertyMapping": "title"
      },
      "question": {
        "type": "string",
        "retrievable": true
      },
      "answer": {
        "type": "string",
        "retrievable": true
      },
      "embedding_vector": {
        "type": "array",
        "items": { "type": "number" },
        "keyPropertyMapping": "embedding_vector",
        "dimension": 768
      }
    }
  },
  "fieldConfigs": [
    {"fieldPath": "embedding_vector", "fieldType": "NUMBER", "keyPropertyType": "EMBEDDING_VECTOR"},
    {"fieldPath": "title", "fieldType": "STRING", "keyPropertyType": "TITLE"},
    {"fieldPath": "question", "fieldType": "STRING"},
    {"fieldPath": "answer", "fieldType": "STRING"}
  ]
}
  • contentConfig: NO_CONTENT — Required for importing documents with only structData. Using CONTENT_REQUIRED would result in import errors
  • embedding_vector field is set with keyPropertyMapping: "embedding_vector" and dimension: 768. This setting must be configured immediately after creating the data store (before importing) as it cannot be added when existing documents are present
  • Unlike Test 1, question and answer have no keyPropertyMapping set. This is because it's created as a standard structured data store, not as an FAQ CSV

Search Request

Following the official documentation, I tried passing the query's embedding vector in embeddingSpec and using ranking_expression to incorporate vector similarity into ranking.

$ curl -s -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/engines/{ENGINE_ID}/servingConfigs/default_search:search" \
  -d '{
    "query": "Is it a problem to create multiple accounts?",
    "dataStoreSpecs": [{
      "dataStore": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}"
    }],
    "embeddingSpec": {
      "embeddingVectors": [{
        "fieldPath": "embedding_vector",
        "vector": [0.028, 0.003, -0.016, ...]
      }]
    },
    "ranking_expression": "0.5 * relevance_score + 0.5 * dotProduct(embedding_vector)"
  }'
{
  "results": [
    {
      "id": "faq-037",
      "document": {
        "name": "projects/{PROJECT_NUMBER}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/branches/0/documents/faq-037",
        "id": "faq-037",
        "structData": {
          "title": "Multiple account usage",
          "question": "I'd like to create a company account separate from my personal account. Is having multiple accounts against the terms?",
          "answer": "Having multiple accounts is not a problem according to the terms. However, later migration of content between accounts is not possible."
        },
        "derivedStructData": {
          "snippets": [
            {
              "snippet_status": "NO_SNIPPET_AVAILABLE",
              "snippet": "No snippet is available for this page."
            }
          ]
        }
      },
      "modelScores": {
        "dotProduct(embedding_vector)": { "values": [0.8300950527191162] }
      },
      "rankSignals": {
        "keywordSimilarityScore": 2.3365948,
        "topicalityRank": 1,
        "defaultRank": 1
      }
    },
    ...
  ],
  "totalSize": 48,
  "attributionToken": "...",
  "nextPageToken": "...",
  "guidedSearchResult": {},
  "summary": {},
  "queryExpansionInfo": {},
  "semanticState": "ENABLED"
}

semanticState: ENABLED was achieved.

  • 1st: Multiple account usage (dotProduct: 0.830)
  • 2nd: Account freeze/unfreeze (dotProduct: 0.608)
  • 3rd: Account deletion/cancellation (dotProduct: 0.605)

All 48 items are included in the search results, and semantic search is functioning correctly.

Notes on Using Custom Embeddings

Here are some points I noticed during testing:

  • You need to generate and pass the query's embedding vector with each search request. It won't work with just a standard search request
  • Use dotProduct(field_name) in ranking_expression to incorporate ranking
  • The schema's keyPropertyMapping: "embedding_vector" cannot be added when existing documents are present. It must be set immediately after creating the data store (before importing)
  • The embedding dimension must be in the range of 1-768. Since gemini-embedding-001 defaults to 3072 dimensions, you need to specify output_dimensionality: 768 to stay within the limit

Why Do We Need to Pass the Query Vector Ourselves?

With custom embeddings, the vectors on the data side are generated by user-chosen models. Vector similarity search assumes that both data and query vectors are generated by the same model with the same settings. Therefore, it wouldn't make sense for Vertex AI Search to automatically generate query vectors with its internal model. Users must generate query vectors using the same model and pass them via embeddingSpec.

In contrast, with unstructured data stores, Vertex AI Search automatically generates embeddings for both data and queries using the same internal model, so users don't need to worry about embeddings at all.

After the above tests with engines connected to a single data store, I found one more interesting behavior worth mentioning.

In a blended search engine (with multiple data stores connected), when searching without specifying dataStoreSpecs to target all data stores, semanticState: ENABLED was returned and structured data store content was included in the results.

Condition semanticState
Blended search engine without dataStoreSpecs (searching all data stores) ENABLED
Same engine with dataStoreSpecs specifying only the CSV structured data store DISABLED

It seems that searching alongside a data store that supports semantic search (like a website with Advanced indexing) causes semantic search to extend to the structured data store as well. However, simply adding a website data store with 0 documents didn't enable it, suggesting that the presence of a data store with indexed documents is necessary.

I'm not sure if this behavior is by design or if it's just a matter of the engine's overall search mode switching. Since restricting to just the structured data store with dataStoreSpecs reverts back to DISABLED, it doesn't appear that semantic search is being enabled for the structured data store itself. For practical purposes, I recommend using the methods summarized in the table below for reliable results.

Conclusion

Here's a summary of the test results:

Data Store Type semanticState
Structured Data Store (CSV / with keyPropertyMapping) DISABLED
Structured Data Store (CSV / without keyPropertyMapping) ENABLED
Structured + Custom Embeddings ENABLED
Unstructured Data Store (HTML) ENABLED
Blended Search (without dataStoreSpecs) ENABLED

I found that semantic search is disabled for structured data stores with certain keyPropertyMappings.

Using an unstructured data store (HTML conversion) is also effective for enabling semantic search. Although there's initial work to convert CSV to HTML, once converted, semantic search is enabled with regular search requests.

The custom embedding approach offers the advantage of choosing your own embedding model, but it requires generating and sending query vectors with each search, making the application implementation more complex.

The blended search method without specifying dataStoreSpecs is convenient, but since it reverts to DISABLED when specifying just the structured data store, it's not suitable if you want to narrow down the search target data stores.

Share this article

FacebookHatena blogX