
I thought about practical use cases for NVIDIA VSS + AI Agents + Skills in familiar real-world settings
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
A little while back, I summarized the VSS 3.1.0 EA setup differences and Hannover Messe 2026 manufacturing use cases in the following article.
The day after that article was published (2026-05-13), a follow-up article appeared on the NVIDIA Developer Blog.
The main topic is "handling VSS with AI Agents + Skills," introducing a configuration where VSS is called in natural language from four types of agents: Codex / Claude Code / OpenClaw / NemoClaw. It's an extension of what I described in the previous article as "VSS doesn't surface as a standalone product" — this time, the interface on the user side has become one step more accessible.
In this article, I'll step back a bit from the heavy manufacturing use cases and organize, with hypothetical use cases, how "VSS Skills could make an impact" in familiar settings like retail stores, reception areas, small warehouses, and restaurants.
VSS Has Entered the Era of AI Agents + Skills
The main message of the NVIDIA Developer Blog is that "VSS can now be handled with AI Agents + Skills." Specifically, Skills — units of functionality — are distributed to the agent side, and when an agent sends a video query in natural language, the Skills call VSS's API and return results.
Skills are essentially "small folders centered around SKILL.md" defined by the specification at agentskills.io, distributed in a form that agents can read and execute. VSS Skills are published on GitHub under NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/skills, and as of May 2026, 10 skills are available.
skills/
├── alerts/ ← Adding, managing, and monitoring alerts on video streams
├── deploy/ ← Deploying and removing VSS profile docker compose
├── report/ ← Generating analysis reports via /generate endpoint
├── rt-vlm/ ← Real-time VLM (caption / alert / stream / OpenAI compatible)
├── video-analytics/ ← Querying video metrics in Elasticsearch via VA-MCP
├── video-search/ ← Video search with faceted embeddings + VLM critique
├── video-summarization/ ← Video summarization via LVS microservice
├── video-understanding/ ← Q&A about video content using VLM
├── vios/ ← Video IO + Storage (recording, timeline, clip extraction)
└── vss-frag/ ← Long-form summarization, Enterprise RAG, HITL via video_search_frag extension
The VSS core selects from 5 Developer Profiles based on "which workflow to run."
| Profile | Main Role |
|---|---|
base |
VLM Q&A and report generation for short clips |
alerts (verification) |
Combination of CV pipeline + Behavior Analytics + VLM verification |
alerts (VLM) |
Continuous VLM anomaly detection on live streams |
search |
Searching video archives with natural language + faceted embeddings |
lvs |
Chunked summarization of long videos (Long Video Summarization) |
In other words, it's a two-layer structure: Skills are "the interface on the user side," and Profiles are "how VSS is run." As shown in the diagram below, from the agent's perspective, Skills are like the handles of drawers, and VSS Profiles are the contents inside those drawers.
Installation only requires sending a single natural language prompt to the agent, which creates symbolic links for the entire Skill folder into ~/.claude/skills/<name>/ or ~/.codex/skills/<name>/. For generic hosts without agent-specific paths, ~/.agents/skills/<name>/ compliant with the agentskills.io specification is also provided. Since it uses symbolic links, doing a git pull on the repository side simultaneously updates the Skills for all agents — a pleasantly clever design.
The Positioning of the 4 Supported Agents
The official blog explicitly states that "VSS Skills can be called from any of Codex / Claude Code / OpenClaw / NemoClaw." Simply placing the same Skill folder according to each agent's conventions allows any agent to interact with VSS through the same natural language interface.
Faceted Embedding Search and the Agentic Reasoning Layer
Digging a bit into the search profile reveals the sophistication of its video search engine. The official blog prefaces with "video search is one of the most challenging areas in modern information retrieval," then highlights two core capabilities.
The first is a method combining embedding vectors by facet (referred to as Multi-Embedding Search in the official blog), which creates separate embedding indexes for different facets such as objects, events, and attributes, then integrates the results for ranking. When trying to retrieve parallel conditions like "worker in red uniform," "person climbing a ladder," or "not wearing a helmet" using only a single type of vector similarity, the priorities of each element tend to conflict. The idea is that by separating indexes by type and then combining them, you can maintain balance among conditions without sacrificing recall.
The second is the Agentic Reasoning Layer, which decomposes complex queries into sub-queries (Query Decomposition), runs verification loops (Verification Loops) for each sub-query, and finally eliminates semantic duplicates (Semantic Deduplication). The flow looks like the diagram below.
The representative example in the official blog has OpenClaw process three 10-minute warehouse videos with the following request:
I have a set of warehouse videos located at
~/warehouse_videos. I need to find any instances of a worker climbing a ladder and verify they are wearing a hardhat and safety vest. Can you do this with the VSS Search profile that is deployed?
OpenClaw calls the search profile via Skill, uses Query Decomposition to verify the three conditions — "ladder use," "wearing a helmet," and "wearing a safety vest" — as individual sub-queries, then picks up candidates with faceted embeddings, has the VLM re-confirm them, and finally consolidates "ladder + not wearing" duplicates into a single report. The Verification Loop's ability to filter out false positives like "wait, isn't that a cart rather than a worker?" seems quietly effective.
GPU-specific latency is also officially published for the Alert Verification workflow (RT-DETR + Cosmos Reason 2, assuming 1 alert per minute).
| GPU Configuration | Max Concurrent Streams | Verification Latency |
|---|---|---|
| 1x DGX Spark + 1x AGX Thor | 14 | 0.89 sec |
| 1x H100 | 147 | 1.01 sec |
| 1x RTX PRO 6000 | 87 | 0.82 sec |
The reason it's DGX Spark paired with AGX Thor rather than DGX Spark alone is that the CV pipeline side (DeepStream + RT-DETR) is intended to offload to AGX Thor. For familiar small-scale sites, 14 streams should be more than sufficient in many cases, so it's reassuring that DGX Spark holds its own position compared to the enterprise-oriented H100 configuration.
What's Possible at Familiar Worksites
Rather than the large-scale use cases like Invisible AI or Pegatron covered in the previous article, I tried to imagine scenarios where VSS Skills could make an impact at more familiar, everyday worksites, assuming a scale of 1-2 cameras per location. While I'll leave actual hands-on testing for another occasion, I think a reasonable range of applications can be inferred from the combination of Developer Profiles and the roles of each Skill.
Reviewing Misplaced Items and Customer Flow at Retail Stores
I think it would be interesting, at a drugstore or convenience store in town (assuming a scale of 3-4 cameras per store), to use natural language queries to pick out things like "a customer who stood in front of the red shelf for more than 3 minutes" or "a shelf that went out of stock again right after restocking." The image is combining the search profile with video-search + report Skills, and asking Claude Code something like "Pull out the time slots last Saturday afternoon when a customer seemed confused in the cosmetics corner." Behaviors that POS data alone can't reveal — like "picked up a product and put it back" or "looked for a staff member but couldn't find one and left" — could become useful material for staff shift reviews.
However, from a privacy standpoint, customer faces and personally identifiable elements are an obvious challenge. Since VSS alone doesn't cover masking functionality, any real-world deployment would require separate designs for mosaic processing and data retention periods.
Recording Visitor Movement at Reception Areas and Entrances
Consider the configuration of alerts (VLM) profile + video-analytics Skill for an office or hospital reception area (a scale that can be covered with 1-2 cameras). The idea is to have the VLM make real-time judgments on conditions like "an unscheduled visitor was at the counter for more than 5 minutes" or "there's a figure at the entrance after 8 PM," and store the results as metrics in Elasticsearch.
Rather than security applications, using it for reviewing visitor interactions (average visitor dwell time, peak hour trends) seems more likely to gain agreement at the worksite. The vision is to use the report Skill to auto-generate monthly reports and connect them to operational decisions like "Tuesday mornings see concentrated delivery traffic, so let's add one more reception staff member."
Monitoring Sorting Mistakes and Safe Behavior in Small Warehouses
Let me bring the official blog's ladder + PPE example down to the scale of a mid-sized local warehouse. Not a massive fulfillment center, but a regional logistics hub with 1 camera per line (5-10 cameras across the entire site). Combining the alerts (verification) profile with rt-vlm + alerts Skills, you could set up operations like "immediately alert when a helmet isn't worn during ladder work" or "send a caution when more than 5 items are loaded simultaneously onto a picking cart."
The PCB line case at Pegatron covered in the previous article produced a major outcome of "67% defect rate reduction," but at a local warehouse scale, the target would more likely be something like "reduce near-miss incidents from 1 per month to 0" or "cut picking errors from 5 per week to 3." Even so, I believe the judgment accuracy of Cosmos Reason 2 covers this range sufficiently.
Reinforcing SOP Compliance with Video at Restaurant Kitchens
Finally, let's talk about the kitchen of a chain restaurant (1-2 cameras per store, limited to inside the kitchen). HACCP self-inspection is mostly paper-based record-keeping, but with a combination of rt-vlm + video-understanding, items like "did they wash their hands for at least 15 seconds before cooking," "did they leave the refrigerator open for more than 3 minutes," and "did they switch cutting boards after handling raw meat" could be reinforced with video evidence.
The camera placement would be limited to the kitchen only, with a clear boundary that it doesn't face dining areas or any personally identifiable zone. Since video evidence of SOP compliance would be retained, the vision is to use it for headquarters-to-store guidance or as review material for new employee training. Even if the VLM's judgment occasionally fluctuates, there's value in creating an "opportunity for humans to review after the fact."
Aggregating Multiple Stores at Headquarters Instantly Creates an Enterprise Feel
So far I've been talking about single-location units, but if you configure multiple stores and locations to aggregate video footage to headquarters via the cloud for batch analysis, it instantly scales up to enterprise level. For example, at a chain retail or restaurant business, you could place Jetson Orin Nano Super or AGX Thor on the edge at each store for real-time judgment and initial filtering, while running cross-site analysis at headquarters using DGX Spark or H100 with search profile + lvs profile — a two-tier Edge-to-Cloud approach.
The Edge-to-Fog-to-Cloud architecture of Fogsphere and the six-automaker deployment of Invisible AI introduced in the previous article both ultimately come down to this structure. Cross-site aggregation of "customers who seemed confused in front of the red shelf," or visualizing "stores where hand-washing non-compliance frequently occurs" on a headquarters dashboard — these analyses can be called up in natural language with just the report Skill. Since VSS Skills use the same interface for both single-store and headquarters-aggregated use, the scale-up from PoC to production deployment is seamless, which is a nice benefit.
Summary
So far, I've explored what the AI Agents + Skills integration of VSS brings, alongside potential use cases at familiar worksites.
The biggest takeaway in the context of this series is that the distance from familiar coding agents to handling VSS has shrunk by one step. The structure I described in the previous article — "VSS doesn't surface as a standalone product" — remains unchanged, but I think it's more accurate to say that only the interface on the user side has become one step more accessible.
The four familiar worksite scenarios (retail, reception, small warehouse, restaurant) all assumed a scale of 1-2 cameras per location. The benchmark figure of 14 streams for Alert Verification with DGX Spark + AGX Thor provides more than enough headroom for these small-scale settings. Stepping back from the heavy manufacturing use cases and imagining VSS Skills operating at a local shop or office feels both realistic and exciting.
Reference Links
- NVIDIA blog: Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills (2026-05-13)
- VSS Skills GitHub (NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/skills)
- agentskills.io specification
- VSS Documentation 3.0.0
- VSS Warehouse Blueprint 3.1.0 Release Notes
- VSS Brev Launchable
- Previous article: Investigating the Current State of Manufacturing VSS as Seen Through VSS 3.1.0 EA and Hannover Messe
- Running NemoClaw on DGX Spark
- Running NeMo Agent Toolkit on a Local Configuration with DGX Spark + vLLM
- An Overview of the NVIDIA NeMo Framework — Spring 2026 Ecosystem Map and DGX Spark Series Index
