I thought about the retail store version of NVIDIA FOX Blueprint with VSS Skills and Hermes Agent

I will introduce the design of "Mini-Retail FOX," which solves retail store operational challenges using video AI. We adapted the factory-oriented NVIDIA FOX for retail use, and compared three implementation patterns that leverage NVIDIA's official VSS Skills and Hermes Agent to achieve shelf depletion detection and aisle obstruction detection.

森茂洋 / Hiroshi Morishige

2026.06.18

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
Last time, I wrote about "Mini-FOX," an approach to start small with NVIDIA Factory Operations Blueprint (commonly known as FOX) by cutting it down to 1 line, 1 camera, and 1 use case.
https://dev.classmethod.jp/articles/nvidia-fox-blueprint-mini-fox-structure/
FOX is a large-scale AI Blueprint for manufacturing, with assumed hardware at the DGX Station level. The idea was to shrink that down to 1 line and start by combining DGX Spark, PCs, Jetson, and AWS. After finishing that write-up, I thought this idea could apply just as well to settings outside of factories.
Supermarket store operations in particular share similar challenges with factories in the sense that it's difficult to continuously review long hours of video every day. Thin shelves, delayed restocking, crowded checkout areas, carts left in aisles—situations that are hard to see from POS or inventory data alone accumulate in the footage, and it's quite unrealistic for store managers or supervisors to review all of it.
In this article, I'll explore an approach called "Mini-Retail FOX," which applies the Mini-FOX concept to retail stores. At the core of the reasoning and natural language interface, I'll place a combination of NVIDIA's official VSS Skills and Hermes Agent.
!"Mini-Retail FOX" in this article is a name I coined for organizational purposes within this article and is not an official NVIDIA term. The same applies to "Mini-FOX" from the previous article—neither is an official NVIDIA product name or Blueprint name.
 How Does FOX Look When Applied to Retail Stores?Here's the main topic.
FOX was a reference design that integrated factory machine signals, quality systems, work procedures, and operational alerts, with a Factory Manager AI orchestrating specialized agents and machinery. It was a structure where the "field brain" sits at the center, with domain-specialized agents for safety, quality, maintenance, and operations hanging below it.
Translating this to retail stores, the equivalent of the Factory Manager AI would be a "Store Operations Support AI." The image is one where agents for checking shelf thinning and restocking, agents for monitoring checkout congestion, agents for watching aisles and traffic flow, and agents for writing daily reports surround the Store Operations Support AI, assisting in the judgments of store managers and supervisors.
The granularity of events handled differs considerably between factories and retail. Factories focus on machine signals and procedure deviations, while retail centers on "changes in the state of people and shelves" read from video and sales data. On the other hand, the idea of "the field brain surveying long-form data across the board and narrowing it down to candidates for human review" applies naturally to both.
It's not the case that FOX for factories can simply be placed in retail stores as-is. The official FOX configuration is quite heavyweight, and if you try to target all stores and all cameras right away, there's a high chance of stalling before the PoC even gets moving. So, just like Mini-FOX, the approach is to start by cutting it down small.
 Don't Aim for the Entire Store at Once—Cut It Out as Mini-Retail FOXUsing the same framing as calling Mini-FOX a "trial version of the full-featured FOX," Mini-Retail FOX is organized as a "trial version of a store operations support AI." The form is to start with 1 store, 1 to a few cameras, and 1 use case.
Here's a rough mapping of FOX scaled down for retail stores.


FOX Configuration
How to Start in Mini-Retail FOX


Factory Manager Agent for the entire plant
Lightweight Store Supervisor Agent for 1 store

Many specialized agents
Narrowed down to 3–5 such as shelf, flow, checkout, report

Large-scale local inference on DGX Station
Distributed across DGX Spark, PC GPU, AWS Bedrock, EC2 GPU

Operational twin
Start with store event timeline and a simple dashboard

Automated retraining and production push
Keep as a retraining candidate queue with human review by store managers/SVs

Personally, I think this approach of cutting it out is the most realistic for retail as well. For store-facing AI, the first hurdle is whether you can capture just 1 scene that's truly problematic, rather than "connecting everything." Going after shelf thinning, congestion, aisles, safety, and promotional effectiveness all in parallel from the start causes the difficulty of operations and data preparation to spike all at once.
 The Minimum Configuration of Mini-Retail FOX Starts with Event GenerationThe minimum configuration of Mini-Retail FOX follows almost the same flow as Mini-FOX. Rather than feeding video directly to a VLM, you thin it out at the edge and convert it to events, then pass those to LLM or VLM reasoning. Here's an overall diagram to get a sense of it.
What I want to emphasize here is the point of not feeding all frames to a VLM. If you stream store footage during business hours directly into a VLM, inference costs and bandwidth will quickly become problematic. The flow is to first thin out at the edge, use lightweight detection and rule judgment to convert only frames that might be "candidates for review" into events, and then have the LLM or VLM return a situation description, review perspectives, and recommended actions for those events.
As a concrete image of the store event JSON, here's an example where a shelf-thinning candidate appeared at a beverage shelf in the evening. The idea is to format the detector output and simple rule judgment results directly into JSON and pass them to the reasoning side.
{
  "timestamp": "2026-06-17T16:20:30+09:00",
  "store_id": "store-01",
  "camera_id": "store-01-beverage-aisle-01",
  "location": "Beverage Shelf",
  "event_type": "shelf_low_candidate",
  "confidence": 0.78,
  "frame_uri": "s3://example-bucket/events/2026/06/17/store-01/frame-001.jpg",
  "detected_state": ["partial_empty_section"],
  "rule_triggered": "low_shelf_density_over_60s",
  "llm_summary": "A state where products appear sparse on the right side of the beverage shelf has continued for approximately 60 seconds.",
  "recommended_action": "Please share with the store manager as a candidate for checking restocking timing before the evening peak.",
  "human_feedback": null
}
Keeping records at this level of granularity allows you to write human review results into human_feedback later, and it can also be used as a dataset for retraining. Even without finalizing a detailed schema from the start, having timestamp, store_id, camera_id, event_type, frame_uri, llm_summary, and human_feedback in place should be sufficient.
Another important point is not to stream raw events directly to the store manager or SV review UI. If you only pass through the detector and rule judgment, scenes that are operationally fine—such as a cart approaching during restocking, waiting for alignment right after stocking, or momentary queues—will also come up as "candidates." Inserting a "curation layer" between the raw events and the review UI to bundle events, remove duplicates, and drop operationally unproblematic types using rules or a lightweight VLM will significantly change the experience on the review side. There's quite a difference in how store managers and SVs use the system when they receive 200 alerts a day versus 5 alerts a day.
 Line Up NVIDIA Official VSS Skills as a ToolboxWhen building the reasoning side of Mini-Retail FOX, there's no need to build VSS integration from scratch. Under the NVIDIA official NVIDIA-AI-Blueprints/video-search-and-summarization repository, 10 VSS Skills compliant with the agentskills.io specification have been published. Simply distributing these Skills to the agent side allows you to call VSS APIs using natural language.
Pulling back one level to organize the structure of VSS Skills, there's a two-layer structure: Skills are the "interface for the user," and the VSS Developer Profile is "how to run VSS." From the agent's perspective, Skills are the drawer handles and VSS Profile is the contents inside the drawer.
There are 10 official Skills in total, but for Mini-Retail FOX in a retail store, you can narrow it down to about 5 of them to start with.


Skill
Mainly Combined Profile
Use Cases in Retail Stores


video-search
search
Search for relevant scenes using natural language queries for shelf thinning, restocking, congestion, abandoned carts, etc.

video-understanding
base
Confirm "what is happening here" in Q&A format for candidate scenes

alerts
alerts (verification) / (VLM)
Alert management for aisle blockages and safety check candidates, combining CV and VLM verification

report
Depending on use case
Generate daily/weekly store operations reports via the /generate endpoint

rt-vlm
alerts (VLM)
Caption/alert for live video, to be developed into future real-time monitoring and recording

Since Skills and Profiles are simply the relationship of "drawer handles and contents," the table shows guideline combinations that are often used together. In actual operation, there are Skills like report that are used without choosing a Profile, and Skills like alerts that are used in combination with multiple Profiles.
The installation procedure isn't difficult either. According to the NVIDIA Developer Blog explanation, by simply sending a single natural language prompt to the agent, it creates symbolic links for the entire Skill folder into ~/.claude/skills/<name>/ or ~/.codex/skills/<name>/. For generic hosts without agent-specific paths, ~/.agents/skills/<name>/ per the agentskills.io specification is also available. Since it uses symbolic links, doing git pull on the repository side simultaneously updates the Skills for all agents.
Reimplementing VSS Skills from scratch is something you don't need to do at the initial stage of Mini-Retail FOX. Starting by "pulling out drawers that seem to fit store use cases" within the range of official Skills will make the PoC launch considerably easier.
 Place Hermes Agent as the Natural Language InterfaceFor the agent side that calls VSS Skills, let's place Hermes Agent here. I've written about patterns for running Hermes Agent on the DGX Spark's OpenShell sandbox in the NemoHermes series.
https://dev.classmethod.jp/articles/dgx-spark-nemohermes-openshell-hermes-agent/
By placing Hermes Agent as the natural language interface for Mini-Retail FOX, you can build a flow where Hermes receives inquiries from store managers or SVs, selects the appropriate VSS Skill and sends it to the VSS Profile, then restructures and returns the results. By keeping it running inside the OpenShell sandbox, the network permissions needed for VSS integration, credentials, and sandbox boundary operations can be carried over directly from the configuration established in the NemoHermes series.
Hermes Agent can separate personality and roles using Profile units. Here's an example of role assignments for Mini-Retail FOX.


Profile
Role


Store Ops
Receives natural language inquiries from store managers/SVs, selects and calls VSS Skills

Night Report
Generates daily reports in batch after business hours end, distributes to Slack and daily reports

Triage
Sorts and handles customer interaction and campaign-related inquiries away from store operations

For the VSS Skills installation location, if Hermes Agent is placed inside the OpenShell sandbox, it's natural to deploy them via NemoClaw or OpenClaw inside the sandbox into ~/.openclaw/workspace/skills/ or ~/.agents/skills/. Since Hermes Agent can reference the generic path per the agentskills.io specification, Skills can be shared between Codex and Claude Code without multiplying agent-specific storage locations.
As an image of the experience, let me write just one scene where a store manager talks to Hermes Agent. In actual operation, it would likely be set up so that queries can be sent from contact points the store is accustomed to, such as Slack, in-store chat tools, or a simple web UI.
[Store Manager]
From yesterday's store footage between 15:00 and 19:00, please narrow down to 5 scenes that seem concerning from a store operations perspective.
In particular, please look at shelves where products appear sparse, the presence or absence of restocking work, congestion in front of registers, and carts blocking aisles.

[Hermes Agent]
I've extracted 5 candidate scenes.

1. 16:20 Beverage Shelf
   Part of the shelf appears to have sparse products. This is a candidate for checking restocking timing.
2. 16:37 Beverage Shelf
   A staff member appears to be approaching with a cart and performing restocking work.
3. 17:45 Checkout Area
   A waiting line of multiple people appears to have formed. This is a candidate for checking supplementary register criteria.
4. 18:10 Aisle
   A cart or work item appears to remain near the center of the aisle. This is a candidate for checking aisle clearance rules.
5. 18:30 Sales Floor
   A customer appears to be standing in front of the shelf. This is a candidate for checking product findability.

Reference frames are attached to each candidate. If there's a candidate you'd like to check, please instruct me with something like "Show me the basis for number 3."
The key point is that Hermes Agent remains in a position of "narrowing down to candidates humans should review and presenting them with evidence," rather than "making automatic judgments and returning instructions." At the initial stage of launching Mini-Retail FOX, I think it's better not to overstep this boundary in order to gain acceptance on the ground.
At this point, all the components of Mini-Retail FOX—agent, VSS Skills, and VSS Profile—have been laid out. Next, let's look at which hardware and services to actually run this on, examining three configuration options.
 Configuration A: Local-Centric with DGX Spark and HermesThe first configuration option is to place one DGX Spark in the store or a nearby server room, and run VSS's search profile and Hermes Agent on it. This is a configuration that fits well when there's a requirement to keep footage within the store or to keep everything within the internal network.
The advantage of this configuration is that you can get the feel of the NVIDIA stack while keeping store footage local. Existing DGX Spark verification assets such as Cosmos-series VLMs, Nemotron-series LLMs, VSS Skills, and Hermes Agent stack up as-is, and it becomes easier to explain to the site that "the footage remains within the store."
On the other hand, there are caveats. Increasing the number of cameras on a single DGX Spark or running long-duration footage continuously will quickly hit the ceiling on GPU memory and inference costs. At the initial stage of Mini-Retail FOX, I think it's more realistic to limit the scope to 1 store, 1–3 cameras, and 1 use case. The discussion of scaling out can wait until something visible emerges from the PoC.
 Configuration B: Lightweight Detection on PC, Reasoning Delegated to CloudWhen the premise of bringing in a DGX Spark is difficult, or when you want to keep the hardware on the store side as light as possible, a configuration that separates roles between edge and cloud is realistic. The store side handles only video capture and lightweight detection on a small PC, and passes reasoning to cloud LLMs or VLMs.
The approach is to not continuously send normal footage, but to send only representative frames or short clips from around anomaly candidates to the cloud. This makes it easier to balance bandwidth, cost, and privacy, and helps avoid the situation where cloud bills end up higher than expected during the PoC stage.
Here's an image of the role division between edge and cloud sides.


Role
Store-side PC
Cloud


Video capture
RTSP capture, frame thinning
Essentially none

Lightweight judgment
Shelf-thinning rules, checkout congestion counting, dwell detection
Essentially none

Reasoning
Convert rule output to event JSON
VSS Skills + Bedrock / OpenAI-compatible API

Storage
Short-term cache
S3, DynamoDB, OpenSearch

Notification
Local warning light, in-store handheld notification
Slack, Teams, daily reports

Having lightweight judgment handled on the edge side means the first level of alerts can still be issued even if the network goes down. Keeping only reasoning dependent on the cloud makes it relatively straightforward to design degraded operation during outages. Hermes Agent can be placed on either the store side or the cloud side, but if you want to keep the store-side PC resources light, placing it on the cloud side and combining it with Bedrock or an OpenAI-compatible API is easier to manage.
 Configuration C: Multiple Store Deployment in Sight with AWS IoT GreengrassThe final configuration option is for when multi-store deployment is in view from the start. Use AWS IoT Greengrass as the edge runtime, receive events via IoT Core, and combine with Bedrock and SageMaker. Greengrass is a management platform that can run Lambda and containers on edge devices, and when combined with IoT Core, it becomes easier to safely operate edges across multiple locations. AWS official blogs have also introduced configurations using IoT Greengrass and IoT Core to perform video analysis for industrial safety from existing CCTV and edge gateways, connecting to S3 and SageMaker.
Looking at it by role, Greengrass handles distribution and management of edge applications, IoT Core receives events, Lambda and Step Functions advance the workflow, and Bedrock and Hermes Agent + VSS Skills generate descriptions and response plans. Keeping model improvement on SageMaker makes it easier to build in retraining and incorporation of store manager review results later on.
If there's a premise of looking toward multi-store deployment, it's easier to put Greengrass in from the start. Even for a single-store PoC, if you have horizontal expansion in mind afterward, placing the control point here means you won't be scrambling later. Conversely, if the initial PoC is fully contained to 1 store, starting with Configuration A or B and then shifting toward Configuration C once horizontal expansion comes into view is also a natural progression.
 Narrow Down to 1 Use Case FirstFor the first subject to run through Mini-Retail FOX, I recommend either "discovery of shelf-thinning candidates" or "aisle blockage detection."
Shelf-thinning candidate discovery is a subject where the explanation can be completed from video alone, event generation is straightforward, and it's easy for store managers and SVs to understand as an extension of daily work. The detection side can run on simple state judgment, and the landing point of "candidate for checking restocking timing before the evening peak" is familiar to people on the ground. False positives can also be discussed with clear examples like "a shelf that looks sparse but is actually fine right after restocking."


Generative AI material depicting a state where the right side of a dairy shelf is partially sparse
For shelf-thinning candidates, being able to confirm not just a single snapshot but the full sequence through restocking work until the shelf state recovers makes it easier to fit into the context of store operations.


Generative AI material depicting a scene where a staff member approaches with a cart to perform restocking
Aisle blockage detection is a subject that connects easily to 5S activities from a safety perspective. It can be straightforwardly implemented with rules like a cart or cardboard box remaining near the center of the aisle and the dwell time exceeding a certain threshold. The point of being contained to video footage is similar to shelf-thinning candidates.


Generative AI material depicting a scene where a cart is left in the center of an aisle
The scene images included here are not actual store footage but are pseudo-CCTV-style materials created with generative AI. When you want to explain the Mini-Retail FOX concept but can't prepare actual store footage right away, having this kind of pseudo-material lets you advance discussions on UX and event definition ahead of time, which can reduce rework after the PoC begins.
SOP deviations—such as "a register hasn't been closed on time" or "a cart has been left in an unexpected location for an extended time"—can strongly showcase what's distinctive about Mini-Retail FOX. However, they require organizing store-specific operational rules, making them somewhat heavy as an initial PoC. The sequence of keeping shelf-thinning candidates or aisle blockages as the main examples first, staying within a flow of "what was found, how it was described, and how the store manager or SV made a judgment" rather than automatically instructing anything, and then layering on SOP verification as a development, seems like a more readable progression to me.
 Specialized Agent-ization Can WaitFOX's concept involves many specialized agents, but it's easier to operate if you don't split them too early right after starting Mini-Retail FOX. Start with all processing inside 1 Store Supervisor Agent, and split after observing the logs and clearly seeing that the roles are distinct.
Even when splitting, 4–5 of the following should be sufficient at first.


Specialized Agent
Scope


shelf_agent
State description for shelf-thinning candidates, restocking timing, and out-of-stock candidates

flow_agent
Explanation of aisle blockages, dwell detection, and traffic flow obstruction candidates

checkout_agent
Organization of checkout congestion, supplementary register candidates, and self-checkout guidance candidates

report_agent
Daily/weekly store operations report generation, distribution to Slack and daily reports

learning_queue
Collection of false positives and misses, feedback loop management for human review results

When deciding whether to split agents, keeping store-specific operational rule differences in mind will make things easier later. Keeping rules common across the chain on the Supervisor Agent side, and making it possible to switch store-specific fine differences via Profile or configuration, will make the inventory easier when expanding to more stores later.
 SummaryHere's a comparison of the 3 configurations—DGX Spark-centric, PC and cloud split, and AWS IoT Greengrass-centric—arranged by key considerations.


Configuration
Suited For
Initial Cost
Ease of Sending Footage
Main Highlights


DGX Spark local configuration
Stores where footage is hard to send out, NVIDIA-context PoC
Medium
Manageable even in stores where it's hard to send out
Combination of local VSS and Hermes Agent

PC + cloud reasoning
Low-cost technical verification, keeping store hardware light
Low
Design to send only anomaly candidates
Lightweight configuration starting from 1 store, 1 camera

AWS IoT Greengrass configuration
PoC with multi-store deployment in view
Medium
Easy to govern on AWS side
Role division and governance between Greengrass, Bedrock, and SageMaker

For further expansion from here, possible directions include: actually installing VSS's search profile and video-search Skill on DGX Spark and running shelf-thinning detection through 1 use case; trying multi-store operation with AWS IoT Greengrass and Bedrock; building store operations automation according to Hermes Agent Profiles; and digging into checkout congestion analysis with Cosmos Reason or Cosmos 3. I'd like to track how Physical AI is gradually spreading to sites beyond factories, alongside the manufacturing edition of Mini-FOX.
 Reference LinksNVIDIA FOX Blueprint を小さく始める Mini-FOX 構成を考えてみた
NVIDIA VSS + AI Agents + Skills の身近な現場での使いどころを考えてみた
NemoHermes で Hermes Agent を DGX Spark の OpenShell に載せてみた
Transform video into instantly searchable, actionable intelligence with AI Agents and Skills (NVIDIA Developer Blog)
NVIDIA-AI-Blueprints/video-search-and-summarization (GitHub)
agentskills.io specification
Improving industrial safety with video analytics, AWS IoT Core, and AWS IoT Greengrass (AWS Blog)

Skill	Mainly Combined Profile	Use Cases in Retail Stores
`video-search`	`search`	Search for relevant scenes using natural language queries for shelf thinning, restocking, congestion, abandoned carts, etc.
`video-understanding`	`base`	Confirm "what is happening here" in Q&A format for candidate scenes
`alerts`	`alerts (verification)` / `(VLM)`	Alert management for aisle blockages and safety check candidates, combining CV and VLM verification
`report`	Depending on use case	Generate daily/weekly store operations reports via the `/generate` endpoint
`rt-vlm`	`alerts (VLM)`	Caption/alert for live video, to be developed into future real-time monitoring and recording

I thought about the retail store version of NVIDIA FOX Blueprint with VSS Skills and Hermes Agent

Introduction

How Does FOX Look When Applied to Retail Stores?

Don't Aim for the Entire Store at Once—Cut It Out as Mini-Retail FOX

The Minimum Configuration of Mini-Retail FOX Starts with Event Generation

Line Up NVIDIA Official VSS Skills as a Toolbox

Place Hermes Agent as the Natural Language Interface

Configuration A: Local-Centric with DGX Spark and Hermes

Configuration B: Lightweight Detection on PC, Reasoning Delegated to Cloud

Configuration C: Multiple Store Deployment in Sight with AWS IoT Greengrass

Narrow Down to 1 Use Case First

Specialized Agent-ization Can Wait

Summary

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

FOX Configuration	How to Start in Mini-Retail FOX
Factory Manager Agent for the entire plant	Lightweight Store Supervisor Agent for 1 store
Many specialized agents	Narrowed down to 3–5 such as shelf, flow, checkout, report
Large-scale local inference on DGX Station	Distributed across DGX Spark, PC GPU, AWS Bedrock, EC2 GPU
Operational twin	Start with store event timeline and a simple dashboard
Automated retraining and production push	Keep as a retraining candidate queue with human review by store managers/SVs

Profile	Role
Store Ops	Receives natural language inquiries from store managers/SVs, selects and calls VSS Skills
Night Report	Generates daily reports in batch after business hours end, distributes to Slack and daily reports
Triage	Sorts and handles customer interaction and campaign-related inquiries away from store operations

Role	Store-side PC	Cloud
Video capture	RTSP capture, frame thinning	Essentially none
Lightweight judgment	Shelf-thinning rules, checkout congestion counting, dwell detection	Essentially none
Reasoning	Convert rule output to event JSON	VSS Skills + Bedrock / OpenAI-compatible API
Storage	Short-term cache	S3, DynamoDB, OpenSearch
Notification	Local warning light, in-store handheld notification	Slack, Teams, daily reports

Specialized Agent	Scope
`shelf_agent`	State description for shelf-thinning candidates, restocking timing, and out-of-stock candidates
`flow_agent`	Explanation of aisle blockages, dwell detection, and traffic flow obstruction candidates
`checkout_agent`	Organization of checkout congestion, supplementary register candidates, and self-checkout guidance candidates
`report_agent`	Daily/weekly store operations report generation, distribution to Slack and daily reports
`learning_queue`	Collection of false positives and misses, feedback loop management for human review results

Configuration	Suited For	Initial Cost	Ease of Sending Footage	Main Highlights
DGX Spark local configuration	Stores where footage is hard to send out, NVIDIA-context PoC	Medium	Manageable even in stores where it's hard to send out	Combination of local VSS and Hermes Agent
PC + cloud reasoning	Low-cost technical verification, keeping store hardware light	Low	Design to send only anomaly candidates	Lightweight configuration starting from 1 store, 1 camera
AWS IoT Greengrass configuration	PoC with multi-store deployment in view	Medium	Easy to govern on AWS side	Role division and governance between Greengrass, Bedrock, and SageMaker