
I tried running NVIDIA VSS 3.2.0 GA on DGX Spark
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Division.
NVIDIA's VSS (Video Search and Summarization) was released as GA version 3.2.0 on June 16. This is the first General Availability release in the 3.X series.
To put it simply, VSS is a reference implementation suite for summarizing, searching, and performing alert detection on video using VLM (Vision Language Model). It consists of multiple microservices and agent workflows launched via Docker Compose or Helm. Previously available as EA (Early Access), with this GA release I decided to spin up the full 3.2.0 configuration on my local DGX Spark and verify everything hands-on — from the startup experience to the current state of new features.
The DGX Spark is a compact AI workstation equipped with a single GB10 (ARM64), featuring a unique architecture where the CPU and GPU share 128 GiB of Unified Memory. There are behavioral differences compared to x86 + H100 / RTX PRO 6000 configurations, so I'll note those where relevant.
I've also written articles covering previous versions of VSS, so if you're considering migration, please check those out as well.
Overview of 3.2.0 GA
First, let me summarize what changed in the 3.2.0 GA release. I've compiled the NEW / CHANGED / FIXED / BREAKING items from the release notes into a table.
| Category | Main Topics |
|---|---|
| NEW | GitHub source release for all microservices and agent workflows (Apache-2.0 + MIT), Agent Skills (EA), NemoClaw + VSS (EA), RT-CV-3D (Sparse4D v2.2) + Auto Calibration, audio-enabled video understanding (Nemotron 3 Nano Omni) |
| CHANGED | /v1/generate_captions_alerts → /v1/generate_captions rename, removal of Envoy/SDR routing from base profile, deployment structure refactored to developer-profiles/ + services/ include model |
| FIXED | HTTP 409 returned for duplicate stream/camera IDs (old behavior: silent overwrite), Riva ASR NIM removed from compose bundle |
| BREAKING | Items from the above CHANGED + FIXED that will break existing client code or compose files from previous versions |
The two changes I found most significant are "full source release" and "deployment structure overhaul." This article proceeds with hands-on verification centered around these two changes.
Note that RT-CV-3D / Auto Calibration requires a multi-camera environment and won't be covered in depth here. Audio-enabled video understanding (Nemotron 3 Nano Omni) also involves a major change with the Riva ASR NIM replacement, so I plan to verify that in a separate article.
Deployment Structure Overhaul
The 3.2 deployment structure uses a modular configuration linking deploy/docker/{developer-profiles, industry-profiles, services}/ via includes. A startup script called dev-profile.sh auto-detects the GPU, determines the HARDWARE_PROFILE, and assembles the profile-specific compose files.
The dashed lines indicate the services picked up by the base profile (agent / Cosmos-Reason2-8B / Nemotron-Nano-9B-v2 FP8 / ui · vios · infra). By tracing the dashed lines for alerts or lvs, you can read what each other profile pulls in from the same diagram.
The GPU detection logic is around line 91 of dev-profile.sh:
case "${gpu_name}" in
*gb10*) echo "DGX-SPARK" ;;
...
esac
It reads the GPU name from nvidia-smi, auto-detects DGX-SPARK, and writes HARDWARE_PROFILE=DGX-SPARK to generated.env. This means a DGX Spark with a GB10 will automatically land on the correct profile without any explicit configuration.
Also worth noting: HAProxy acts as the API Gateway on port 7777, and there's no separate vss-api-gateway image. In the 3.X series, it seems HAProxy has settled into the role of both ingress and API Gateway.
Hands-On Points When Running on DGX Spark
From here, I'll organize what becomes visible when launching the base profile on DGX Spark, covering three areas: LLM, Alert, and startup time.
Local LLM Runs in a vLLM Container
When I ran docker inspect on the LLM container while the base profile was running, what was running was not a NIM but a plain vLLM container.
$ docker inspect nvidia-nemotron-nano-9b-v2-fp8 --format '{{.Config.Image}}'
nvcr.io/nvidia/vllm:25.12.post1-py3
The actual startup command looks like this:
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8 \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.40 \
--port 8000 \
--mamba_ssm_cache_dtype float32 \
--enable-auto-tool-choice \
--tool-parser-plugin /opt/toolcall_parser/nemotron_toolcall_parser_no_streaming.py \
--tool-call-parser nemotron_json
The official compose (deploy/docker/services/nim/nvidia-nemotron-nano-9b-v2-fp8/compose.yml) even includes a comment like this:
# Nemotron-Nano-V2 and tool-parser (nemotron_toolcall_parser_no_streaming.py) require vLLM 25.12+; 25.10 does not support Nemotron.
This means you can now read the reasoning behind "why VSS chose vLLM 25.12+" directly in the source. The tool call parser for Nemotron is also handled by an init container called nvidia-nemotron-nano-9b-v2-fp8-toolcall-init, which fetches it from HuggingFace and places it in a volume — so there's no need to manually mount a tool parser yourself.
The official prerequisites still list "Fully local deployment for all agent workflows ... is planned for a future release" as a future plan, but given the existence of
hw-DGX-SPARK.envand the structure of the compose files above, Local LLM is effectively working for the base profile in practice. It's worth keeping in mind that the official stance of "Remote LLM only" and the implementation reality of "Local LLM is ready to go" coexist here.
Alerts Consolidated into a Single vss-alert-verification
The container structure for the Alert workflow in 3.2 is consolidated as a single service within deploy/docker/services/alert/compose.yml.
services:
alert-bridge:
image: nvcr.io/nvidia/vss-core/vss-alert-verification:3.2.0
container_name: vss-alert-bridge
An interesting detail: the image name is vss-alert-verification, but the container name remains vss-alert-bridge. This is a thoughtful touch to avoid breaking compose references and monitoring scripts migrated from previous versions.
While it's a single service, it differentiates between the following four functional modes via environment variables:
| Environment Variable | Function Handled |
|---|---|
ALWAYS_ON_RULES_CONFIG |
Rule configuration for Always-on alerts |
VLM_AS_VERIFIER_CONFIG_FILE |
Behavior configuration for VLM-as-Verifier |
VLM_AS_VERIFIER_ALERT_TYPE_CONFIG_FILE |
Alert type definitions for VLM-as-Verifier |
RTVI_VLM_BASE_URL / RTVI_VLM_MODEL_TO_USE |
Endpoint and model for Real-Time VLM |
The path to the VLM-as-Verifier configuration file (vlm-as-verifier/configs/config.yml) remains unchanged, so migrating settings from previous versions should be relatively straightforward.
Base Profile Deployment Time
I used time to measure the elapsed time from running dev-profile.sh up -p base -H DGX-SPARK ... until the health check passed through HAProxy.
| Phase | Duration |
|---|---|
down existing cleanup |
~2 seconds |
| Image pull (vLLM + NIM set) | ~55 seconds |
| Container startup cascade | ~13 minutes (dominated by VLM NIM compilation wait) |
| Total (measured) | 14 minutes 0 seconds |
The measured value includes about 5 minutes of port conflict recovery work during startup due to my specific environment, so subtracting that gives a pure startup cascade of about 9 minutes and a total of around 10 minutes. On a clean DGX Spark without Langfuse or other services running in parallel, you should expect values close to this.
The bottleneck is the TRT-LLM compilation time inside the NIM for the VLM (Cosmos-Reason2-8B, FP8 dynamic + KV8), which takes 8–9 minutes on a cold start. The NIM_DISABLE_CUDA_GRAPH=1 setting in hw-DGX-SPARK.env is likely intended to shorten this cold start time even slightly.
The reason the image pull completed in just 55 seconds is that layer caches from pulling different versions of vLLM and NIM images in previous testing sessions were effective. On a completely fresh environment, it's safer to budget an additional few minutes to 10 minutes.
API Behaviors to Know
For those migrating from previous versions or reusing existing client code, here's a summary of API behavioral changes. All sources can be grepped from the GitHub repository at the v3.2.0 tag.
Caption Generation Endpoint
Stream caption generation is called via /v1/generate_captions.
curl -X POST http://localhost:7777/v1/generate_captions -d '{...}'
The route implementation is around line 1013 of services/video-summarization/src/via_server.py. Any code still using the old name /v1/generate_captions_alerts will need to be migrated. Grepping the repository shows it's almost entirely gone:
$ grep -rn "generate_captions_alerts" services/ | wc -l
1
The single remaining occurrence is just a docstring in services/alert/alert-agent-web/app/api/realtime_schemas.py:218 quoting the old name as "Same as RTVI VLM generate_captions_alerts: ...". The rename is clean at the API level.
Duplicate Stream / Camera IDs Rejected with 409
Sending the same stream ID / camera ID again will be rejected with HTTP 409 + DuplicateStreamId / DuplicateCameraId. The code looks like this:
# Near services/rtvi/rt-vlm/src/utils/asset_manager.py:1265
if camera_id:
existing_asset_id = self._camera_id_map.get(camera_id)
if existing_asset_id and existing_asset_id in self._asset_map:
raise ServiceException(
f"Live stream with camera_id '{camera_id}' already exists",
"DuplicateCameraId",
409,
)
if stream_id:
asset_id = str(stream_id)
if asset_id in self._asset_map:
raise ServiceException(
f"Live stream with stream_id '{asset_id}' already exists",
"DuplicateStreamId",
409,
)
This guard is present in both rt-vlm and rt-embed. If you were under the impression that sending the same ID would overwrite silently, you'll be surprised — so it's safer to add 409 handling.
Same RTSP URL Treated as Independent Jobs
Calling /v1/generate_captions multiple times on the same RTSP URL creates a separate job with an independent request ID (UUID v4) each time. The relevant logic in asset_manager.py generates a new UUID via str(uuid.uuid4()) if no stream_id is specified, making it convenient when you want to run separate parallel processes on the same RTSP stream.
Base Profile Routing Has No Envoy/SDR
In the base profile, the configuration connects directly to Stream Processing without going through Envoy + SDR routing. This is explicitly documented with a comment in dev-profile-base/.env:
# Direct streamprocessing (no SDR/Envoy/SDRC router on :10000)
The alerts / lvs / search profiles still retain sdrc/<mode>/configs/*.yml.tmpl, so only the base profile intentionally omits the routing layer. Be aware that configurations involving custom Envoy filters/routes, Istio or Linkerd dependencies, or ENVOY_* environment variables will not work with the base profile.
DGX Spark's Official Position
The official prerequisites describe the DGX Spark's positioning as follows:
"AGX/IGX Thor and DGX Spark platforms currently support the listed remote-LLM configurations. Fully local deployment for all agent workflows (base, summarization, alerts, and search) is planned for a future release."
Officially it's Remote LLM only, but by combining hw-DGX-SPARK.env with the official vLLM compose, Local LLM works out of the box for the base profile. The alerts / search profiles still require Remote LLM.
VIOS images consolidate x86_64 / AGX Thor / DGX Spark into a single OCI image index, while RTVI / RT-CV / RT-CV-3D still require a separate SBSA-specific tag (*:3.2.0-sbsa) — an architecture dependency that remains. When dev-profile.sh detects DGX-SPARK, it automatically writes RTVI_VLM_IMAGE_TAG=3.2.0-sbsa, so you don't need to worry about this manually.
Bonus: Locally Building services/agent and Swapping the Image
Now that 3.2 has "published full source for all microservices and agent workflows on GitHub," I wanted to actually experience that benefit on real hardware.
The services/agent/ directory in the repository contains the full source for the VSS Agent, including the Dockerfile. The license is Apache 2.0.
$ head -1 services/agent/LICENSE.md
Apache License
Version 2.0, January 2004
The repository as a whole uses a dual Apache 2.0 + MIT license, with the top-level LICENSE explicitly stating: "Apache-2.0 applies to all code in the repository except the services/ui/ directory. MIT applies to the original code under the services/ui/ directory." Since this article covers services/agent, Apache 2.0 applies.
Let's try swapping the official image with one we build ourselves.
Build Command
The Dockerfile is at services/agent/docker/Dockerfile, and the build context is the services/ directory.
docker build \
-f services/agent/docker/Dockerfile \
-t my-vss-agent:local \
services/
The build is multi-stage, with each stage serving the following purpose:
| Stage | Base | Role |
|---|---|---|
| builder | python:3.13-bookworm |
Dependency resolution with uv, source-compile pycairo for ARM64 |
| security-patches | debian:bookworm |
Patch libssl3 to a CVE-fixed version |
| runtime | nvcr.io/nvidia/distroless/python:3.13-v3.1.7 |
Minimized distroless image |
| agent-runtime | Derived from runtime | ENTRYPOINT /vss-agent/.venv/bin/nat serve |
It's a production-grade build, even including FFmpeg source in the image for LGPL compliance (services/agent/3rdparty/ffmpeg/FFmpeg-n8.0.1.tar.gz, with verify_ffmpeg_tarball.py checking its presence and validity at build time). Budget 10–20 minutes for the build to be safe.
Pitfall: The security-patches Stage URL Goes Stale
This is where I encountered "a classic pitfall of the open-source era." The security-patches stage in the Dockerfile directly wgets libssl3 from the Debian security repo:
# v3.2.0 original (excerpt)
wget -O /patches/libssl3.deb http://security.debian.org/debian-security/pool/.../libssl3_3.0.19-1~deb12u2_arm64.deb
At the time of writing this article, 3.0.19-1~deb12u2 had been removed from Debian security's current pool and returned HTTP 404. The practical solution is to rewrite it to automatically fetch the current version while preserving the CVE-fix intent.
- RUN apt-get update && \
- apt-get install -y --no-install-recommends wget ca-certificates && \
- mkdir -p /patches && \
- if [ "$TARGETARCH" = "amd64" ]; then \
- wget -O /patches/libssl3.deb http://security.debian.org/debian-security/pool/.../libssl3_3.0.19-1~deb12u2_amd64.deb; \
- elif [ "$TARGETARCH" = "arm64" ]; then \
- wget -O /patches/libssl3.deb http://security.debian.org/debian-security/pool/.../libssl3_3.0.19-1~deb12u2_arm64.deb; \
- fi && \
- cd /patches && \
- dpkg-deb -x libssl3.deb /patches/libssl3-extracted && \
- rm -rf /var/lib/apt/lists/*
+ RUN apt-get update && \
+ apt-get install -y --no-install-recommends ca-certificates && \
+ mkdir -p /patches && \
+ cd /patches && \
+ apt-get download libssl3 && \
+ mv libssl3_*.deb libssl3.deb && \
+ dpkg-deb -x libssl3.deb /patches/libssl3-extracted && \
+ rm -rf /var/lib/apt/lists/*
Using apt-get download libssl3 automatically fetches the latest patched version currently available in bookworm-security. This was a great lesson that "being grateful for source releases" comes paired with "needing to deal with the aging of external dependencies."
After the fix, the build completed in about 8 minutes on the DGX Spark, and the final image size was 1.98 GB (including the distroless runtime). The official nvcr.io/nvidia/vss-core/vss-agent:3.2.0 image also shows 1.98 GB in docker images. The fact that a local build from the Dockerfile matches the official image size to the byte is clear evidence that the official image is reproducible from the Dockerfile.
Swapping the Image
The relevant line in deploy/docker/services/agent/compose.yml looks like this:
vss-agent:
# for release, change this to the versioned image from the registry
image: nvcr.io/nvidia/vss-core/vss-agent:${VSS_AGENT_VERSION}
As the comment explicitly says "change this to the versioned image from the registry," it's designed for easy substitution by users. To swap in my-vss-agent:local and restart only vss-agent:
# Edit the image: line in compose.yml
sed -i 's|nvcr.io/nvidia/vss-core/vss-agent:.*|my-vss-agent:local|' \
deploy/docker/services/agent/compose.yml
# Move to deploy/docker and restart only the vss-agent container (leave dependencies running)
cd deploy/docker
docker compose --env-file developer-profiles/dev-profile-base/generated.env \
up -d --no-deps --force-recreate vss-agent
Without --no-deps, dependent services including vLLM / NIM / Phoenix will also restart, making you wait another 10 minutes. A subtly important flag.
After startup, first verify the image has been swapped with docker inspect vss-agent:
$ docker inspect vss-agent --format '{{.Config.Image}}'
my-vss-agent:local
Then confirm the application layer responds by hitting /health directly:
$ curl -s http://localhost:8000/health
{"value":{"isAlive":true}}
And the Web UI through HAProxy is also fine:
$ curl -s -o /dev/null -w "HTTP %{http_code}\n" http://localhost:7777/
HTTP 200
Once all of these pass, you should have your own locally-built code running the Chat tab. I actually opened the Chat tab, uploaded a sample video (services/alert/warmup/test.mp4, a warehouse scene), and sent "Summarize this video in 2-3 lines." The result is shown in the screenshot below.

Opening the Reasoning Trace lets you follow the process of the agent composing a summary based on VLM output. I think this one screenshot conveys that a hybrid configuration of locally-built vss-agent + official NIM VLM + Local vLLM LLM can properly communicate with the UI through the API Gateway as my-vss-agent:local.
Next Time
Finally, let me touch on a service that quietly exited in 3.2.
Riva ASR NIM has been removed from docker-compose. It had been integrated for audio-enabled video understanding, but in 3.2 it exits with a comment # RIVA ASR is not yet supported, replaced by a design that handles audio understanding natively through the Nemotron 3 Nano Omni VLM's audio path.
# .env sample (base profile + Omni)
VLM_NAME=Nemotron-Nano-V3-Omni-GA0420-FP8
RTVI_VLM_MODEL_TO_USE=vllm-compatible
VLM_MODEL_SUPPORTS_AUDIO=true
The audio understanding design change is a fairly significant topic, so next time I plan to chase that down on real hardware.
Summary
Here's a quick summary of what became visible after launching VSS 3.2.0 GA on DGX Spark with the base profile:
| Aspect | Hands-on feel with 3.2 GA + DGX Spark |
|---|---|
| LLM | Official compose starts Nemotron-Nano-9B-v2 FP8 in a vLLM container. Tool-call parser is also auto-fetched by an init container |
| Alert | Single vss-alert-verification image differentiates Always-on / VLM-as-Verifier / Real-Time via environment variables |
| Deploy time | ~9 minutes for the pure base profile startup cascade, ~10 minutes total (dominated by VLM NIM compilation wait) |
| API | Caption endpoint is /v1/generate_captions, duplicate IDs return 409, base profile has no Envoy/SDR |
| Source release | All microservices and Agent workflows published on GitHub under Apache-2.0 + MIT, Dockerfiles included |
While the official documentation says "DGX Spark is Remote LLM only," the base profile implementation has Local LLM working out of the box, making it fully self-contained for PoC and verification use cases. With Dockerfiles now published, it's become entirely normal to build your own code, swap it in, and verify behavior firsthand.
Reference Links
- VSS docs (latest = 3.2.0)
- VSS 3.2.0 Release Notes
- VSS Prerequisites (hardware / Profile / DGX Spark Remote LLM constraints)
- VSS Helm chart deployment guide
- GitHub: video-search-and-summarization v3.2.0 release
- GitHub: LICENSE (Apache-2.0 + MIT dual)
- GitHub: services/agent full source
- GitHub: deploy/docker (compose layout)
- GitHub: services/agent/docker/Dockerfile (the Dockerfile built in this article)
- HuggingFace: NVIDIA-Nemotron-Nano-9B-v2 (source of the tool parser)
- build.nvidia.com: VSS Blueprint card
- Previous article: Investigating VSS 3.1.0 EA and the current state of manufacturing VSS as seen at Hannover Messe
- First article: Trying VSS 3.0.0 EA on DGX Spark
