I thought about practical use cases for NVIDIA VSS + AI Agents + Skills in everyday work settings

VSS has entered the era of AI Agents + Skills. It is now possible to invoke video search and analysis from natural language via Codex or Claude Code, and use cases in familiar settings such as retail, reception, warehouses, and restaurants have suddenly become much more realistic. In this article, I have organized how the usability of VSS has changed through the Skills configuration and each use case.

森茂洋 / Hiroshi Morishige

2026.05.16

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
A little while ago, in the article below, I summarized the VSS 3.1.0 EA setup differences and manufacturing industry cases from Hannover Messe 2026.
https://dev.classmethod.jp/articles/dgx-spark-vss-3-1-revisit/
The day after that article was published (2026-05-13), a follow-up article appeared on the NVIDIA Developer Blog.
https://developer.nvidia.com/blog/transform-video-into-instantly-searchable-actionable-intelligence-with-ai-agents-and-skills/
The main topic is "handling VSS with AI Agents + Skills," introducing a configuration where VSS is called using natural language from four types of agents: Codex, Claude Code, OpenClaw, and NemoClaw. This is an extension of what I wrote in the previous article about "VSS not appearing as a standalone product," and now the interface on the user side has become one step more accessible.
In this article, I'll step back a bit from heavy manufacturing use cases and explore how "VSS Skills might be effective" in more familiar settings such as retail, reception areas, small warehouses, and restaurants, by imagining potential use cases.
 VSS Has Entered the Age of AI Agents + SkillsThe main message of the NVIDIA Developer Blog is that "VSS can now be handled with AI Agents + Skills." Specifically, Skills—units of functionality—are distributed to the agent side, and when an agent sends a video query in natural language, the Skills call the VSS API and return the results.
The true nature of Skills is a "small folder centered around SKILL.md" defined by the specification at agentskills.io, distributed in a form that agents can read and execute. VSS Skills are published on GitHub at NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/skills, and as of May 2026, 10 skills are available.
Skills Catalog (under NVIDIA-AI-Blueprints/video-search-and-summarization)
skills/
├── alerts/              ← Add, manage, and monitor alerts on video streams
├── deploy/              ← Deploy/remove VSS profile via docker compose
├── report/              ← Generate analysis reports via /generate endpoint
├── rt-vlm/              ← Real-time VLM (caption / alert / stream / OpenAI compatible)
├── video-analytics/     ← Query video metrics in Elasticsearch via VA-MCP
├── video-search/        ← Video search with perspective-specific embeddings + VLM critique
├── video-summarization/ ← Video summarization via LVS microservice
├── video-understanding/ ← Q&A on video content via VLM
├── vios/                ← Video IO + Storage (recording, timeline, clip extraction)
└── vss-frag/            ← Extended long-form summarization, Enterprise RAG, HITL via video_search_frag
The VSS core selects from 5 Developer Profiles depending on "which workflow to run."


Profile
Primary Role


base
VLM Q&A and report generation for short clips

alerts (verification)
Combination of CV pipeline + Behavior Analytics + VLM verification

alerts (VLM)
Continuous VLM anomaly detection on live streams

search
Natural language + perspective-specific embedding search of video archives

lvs
Chunked summarization of long videos (Long Video Summarization)

In other words, Skills are the "interface on the user side" and Profiles are "how to run VSS"—a two-layer structure. As shown in the diagram below, from the agent's perspective, Skills are the drawer handles and VSS Profiles are the contents inside the drawer.
Installation only requires sending a single natural language prompt to the agent, which automatically creates symbolic links for the entire Skill folder to ~/.claude/skills/<name>/ or ~/.codex/skills/<name>/. For generic hosts without agent-specific paths, ~/.agents/skills/<name>/ compliant with the agentskills.io specification is also available. Since it uses symbolic links, doing a git pull on the repository simultaneously updates the Skills for all agents—a pleasantly elegant design.
 The Positioning of the 4 Supported AgentsThe official blog explicitly states that "VSS Skills can be called from any of Codex, Claude Code, OpenClaw, or NemoClaw." Simply placing the same Skill folder according to each agent's conventions allows any agent to interact with VSS through the same natural language interface.
 Perspective-Specific Embedding Search and the Agentic Reasoning LayerDigging a bit into the search profile reveals the craftsmanship behind it as a video search engine. The official blog prefaces this by saying "video search is one of the most difficult areas in modern information retrieval," then highlights two core features.
The first is a method that combines perspective-specific embedding vectors (referred to as Multi-Embedding Search in the official blog), which creates separate embedding indexes for each perspective such as objects, events, and attributes, then integrates and ranks the results. When trying to capture parallel conditions like "worker in a red uniform," "person climbing a ladder," or "no helmet worn" using only a single type of vector similarity, the priority of each element tends to conflict. The idea is that separating indexes by type and then combining them allows conditions to be balanced without sacrificing recall.
The second is the Agentic Reasoning Layer, which decomposes complex queries into sub-queries (Query Decomposition), runs verification loops (Verification Loops) for each sub-query, and finally eliminates semantic duplicates (Semantic Deduplication). The flow looks like the diagram below.
The representative example in the official blog involves asking OpenClaw to do the following with three 10-minute warehouse videos:
I have a set of warehouse videos located at ~/warehouse_videos. I need to find any instances of a worker climbing a ladder and verify they are wearing a hardhat and safety vest. Can you do this with the VSS Search profile that is deployed?
OpenClaw calls the search profile via Skills, verifies the three conditions—"ladder use," "helmet worn," and "safety vest worn"—as individual sub-queries through Query Decomposition, retrieves candidates using perspective-specific embeddings, has the VLM re-confirm, and finally organizes duplicates of "ladder + not wearing" into a single report. The Verification Loop quietly does important work by filtering out false positives like "is this a cart rather than a worker?"
Latency by GPU is also officially published for the Alert Verification workflow (RT-DETR + Cosmos Reason 2, assuming 1 alert per minute).


GPU Configuration
Max Concurrent Streams
Verification Latency


1x DGX Spark + 1x AGX Thor
14
0.89 sec

1x H100
147
1.01 sec

1x RTX PRO 6000
87
0.82 sec

The reason it's a combination with AGX Thor rather than DGX Spark alone is that the CV pipeline side (DeepStream + RT-DETR) is intended to be offloaded to AGX Thor. In familiar workplace settings, 14 streams should be sufficient in many cases, so it's reassuring to see DGX Spark positioned comparably to enterprise-grade H100 configurations.
 What Can Be Done in Familiar WorkplacesRather than large-scale cases like Invisible AI or Pegatron covered in the previous article, I tried to imagine scenarios where VSS Skills might be effective in more familiar workplaces, on the scale of 1–2 cameras per location. I'll leave actual hardware verification for another occasion, but I think the range of possibilities can be reasonably inferred from the combination of Developer Profiles and the roles of each Skill.
 Reviewing Incorrect Product Placement and Customer Traffic Flow in Retail StoresIn drugstores or convenience stores in town (assuming 3–4 cameras per store), it could be interesting to use natural language queries on in-store traffic cameras to find things like "customers who stood in front of the red shelf for more than 3 minutes" or "shelves that went out of stock again immediately after restocking." The idea is to combine the search profile with video-search + report Skills and ask Claude Code something like: "Find the time slots last Saturday afternoon when customers appeared confused in the cosmetics section." Behaviors that aren't visible from POS data alone—like "picked up a product and put it back" or "looked for a staff member but left without finding one"—could serve as review material for staff shift debriefs.
However, from a privacy perspective, customer faces and personally identifiable elements present obvious challenges. Since VSS alone doesn't cover masking functionality, real-world deployments would require separate mosaic processing and data retention period design.
 Recording Visitor Movements at Reception Areas and EntrancesFor reception areas in offices or hospitals (a scale well covered by 1–2 cameras), consider a configuration using the alerts (VLM) profile + video-analytics Skill. The idea is for the VLM to make real-time judgments on conditions like "a visitor without an appointment has been at the counter for more than 5 minutes" or "a figure at the entrance after 8 PM," then store the results as metrics in Elasticsearch.
Rather than for security purposes, using it for visitor response reviews (average visitor dwell time, peak hour trends) seems more likely to gain agreement from on-site staff. The vision would be something like automatically generating monthly reports with the report Skill and using them to make operational decisions such as "deliveries concentrate on Tuesday mornings, so assign one more reception staff member."
 Monitoring Sorting Errors and Safe Operations in Small WarehousesLet me bring the official blog's ladder + PPE example closer to a mid-sized urban warehouse scale. Not a huge fulfillment center, but a regional logistics hub with 1 camera per line (5–10 cameras for the entire facility). By combining the alerts (verification) profile with rt-vlm + alerts Skills, you could set up operations like "immediate alert when a helmet is not worn during ladder work" or "warning when more than 5 items are loaded onto a picking cart simultaneously."
The PCB line case at Pegatron covered in the previous article achieved the large-scale result of "67% reduction in defect rate," but at an urban warehouse scale, the targets would be more like "reduce near-miss incidents from 1 per month to 0" or "reduce picking errors from 5 per week to 3." Even so, I believe Cosmos Reason 2's judgment accuracy is sufficient to cover these needs.
 Reinforcing SOP Compliance with Video in Restaurant KitchensFinally, let's talk about chain restaurant kitchens (1–2 cameras per store, kitchen only). HACCP self-inspection is primarily paper-based, but combining rt-vlm + video-understanding could help supplement items like "did handwashing last at least 15 seconds before cooking?", "was the refrigerator left open for more than 3 minutes?", and "was the cutting board switched after handling raw meat?" with video evidence.
On the premise that cameras are limited to inside the kitchen and not pointed at customer seating or areas where individuals can be identified, a clear boundary must be drawn on-site. Since video remains as a record of SOP compliance, the vision is to use it for guidance from headquarters to individual stores and as review material for training new employees. Even if the VLM's judgment occasionally wavers, the value lies in creating "an opportunity for humans to look back at it later."
 Aggregating Multiple Stores at Headquarters Instantly Creates an Enterprise FeelUp to this point, I've been discussing single-location scenarios, but configuring multiple stores and locations to aggregate camera footage to headquarters via the cloud for collective analysis instantly transforms this into enterprise scale. For example, in chain retail or restaurant settings, placing Jetson Orin Nano Super or AGX Thor on the edge side at each store to handle real-time judgment and initial filtering, while running cross-store analysis at headquarters using search profile + lvs profile on DGX Spark or H100—this becomes a two-tiered Edge-to-Cloud approach.
The Edge-to-Fog-to-Cloud architecture by Fogsphere and the six-automotive-manufacturer deployment by Invisible AI introduced in the previous article ultimately have this same structure. Analysis such as aggregating "customers who were confused in front of the red shelf" across all stores, or visualizing "stores with frequent handwashing non-compliance" on a headquarters dashboard, can all be called from natural language using just a single report Skill. Since VSS Skills use the same interface whether for a single store or headquarters aggregation, the seamless scale-up from PoC to production deployment is a pleasing feature.
 SummaryIn this article, I've explored what the AI Agents + Skills evolution of VSS brings, alongside potential use cases in familiar workplace settings.
The biggest takeaway in the context of this series is that the distance from familiar coding agents to handling VSS has been shortened by one step. I think the accurate interpretation is that the structure I wrote about in the previous article—"VSS doesn't surface as a standalone product"—remains unchanged, while only the interface on the user side has become one step more accessible.
The four familiar workplace scenarios (retail, reception, small warehouse, restaurant) were all premised on the scale of 1–2 cameras per location. The benchmark value of 14 streams for Alert Verification with DGX Spark + AGX Thor is more than sufficient for these small-scale settings. Stepping back from heavy manufacturing cases and imagining VSS Skills running in town shops and offices feels both realistic and exciting.
 Reference LinksNVIDIA blog: Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills (2026-05-13)
VSS Skills GitHub (NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/skills)
agentskills.io specification
VSS Documentation 3.0.0
VSS Warehouse Blueprint 3.1.0 Release Notes
VSS Brev Launchable
Previous article: Investigating the Current State of Manufacturing VSS as Seen in VSS 3.1.0 EA and Hannover Messe
Running NemoClaw on DGX Spark
Running NeMo Agent Toolkit on a Local DGX Spark + vLLM Configuration
An Overview of the NVIDIA NeMo Framework — Spring 2026 Ecosystem Map and DGX Spark Series Index

I thought about practical use cases for NVIDIA VSS + AI Agents + Skills in everyday work settings

Introduction

VSS Has Entered the Age of AI Agents + Skills

The Positioning of the 4 Supported Agents

Perspective-Specific Embedding Search and the Agentic Reasoning Layer

What Can Be Done in Familiar Workplaces

Reviewing Incorrect Product Placement and Customer Traffic Flow in Retail Stores

Recording Visitor Movements at Reception Areas and Entrances

Monitoring Sorting Errors and Safe Operations in Small Warehouses

Reinforcing SOP Compliance with Video in Restaurant Kitchens

Aggregating Multiple Stores at Headquarters Instantly Creates an Enterprise Feel

Summary

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Profile	Primary Role
`base`	VLM Q&A and report generation for short clips
`alerts (verification)`	Combination of CV pipeline + Behavior Analytics + VLM verification
`alerts (VLM)`	Continuous VLM anomaly detection on live streams
`search`	Natural language + perspective-specific embedding search of video archives
`lvs`	Chunked summarization of long videos (Long Video Summarization)

GPU Configuration	Max Concurrent Streams	Verification Latency
1x DGX Spark + 1x AGX Thor	14	0.89 sec
1x H100	147	1.01 sec
1x RTX PRO 6000	87	0.82 sec