
I thought about the retail store version of NVIDIA FOX Blueprint with VSS Skills and Hermes Agent
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
Previously, I wrote about "Mini-FOX," a framework for starting small with NVIDIA Factory Operations Blueprint (FOX for short) — trimmed down to 1 line, 1 camera, and 1 use case.
FOX is a large-scale AI Blueprint for manufacturing, with DGX Station-class hardware as the assumed target. The idea was to scale it down to 1 line and start by combining DGX Spark, PC, Jetson, and AWS. After finishing that article, I realized this concept could apply just as naturally outside of factories.
Supermarket store operations, in particular, share similar challenges with factories in the sense that it's difficult to continuously review hours of footage every day. Low shelves, delayed restocking, crowded checkout areas, carts left in aisles — situations that are hard to see from POS or inventory data alone accumulate in the footage, and it's quite tough in practice for store managers and supervisors to review all of it.
In this article, I'll think through a framework called "Mini-Retail FOX," which applies the Mini-FOX concept to retail stores. At the core of reasoning and the natural language interface, I'll place a combination of NVIDIA's official VSS Skills and Hermes Agent.
How Does FOX Look When Applied to Retail Stores?
Here's the main topic.
FOX was a reference design that integrated factory machine signals, quality systems, work procedures, and operational alerts, with a Factory Manager AI orchestrating specialized agents and machinery. It had a structure where the "field brain" sits at the center, with domain-specialized agents for safety, quality, maintenance, and operations hanging beneath it.
When you map this to a retail store, the equivalent of the Factory Manager AI would be a "Store Operations Support AI." The image would be a Store Operations Support AI surrounded by agents that watch for low shelves and restocking, agents monitoring checkout congestion, agents watching aisles and traffic flow, and agents writing daily reports — all assisting the judgment of store managers and supervisors.
The granularity of events handled does differ somewhat between factories and retail. In factories, the focus is on machine signals and procedure deviations, while in retail, the focus is on "changes in the state of people and shelves" read from video and sales data. On the other hand, the idea of "the field brain surveying long-form data comprehensively and narrowing it down to candidates for human review" applies naturally to both.
Can you simply transplant FOX for factories directly into a retail store? Not quite. The official FOX configuration is quite heavy, and if you try to cover all stores and all cameras at once, there's a high chance you'll stall before the PoC even gets moving. So, just like with Mini-FOX, we start by cutting things down small.
Don't Aim for the Whole Store at Once — Cut It Out as Mini-Retail FOX
In the same framing as calling Mini-FOX a "trial version of fully-equipped FOX," Mini-Retail FOX is organized as a "trial version of a Store Operations Support AI." The form is to start with 1 store, 1 to a few cameras, and 1 use case.
Here's a rough mapping of FOX's configuration scaled down for retail stores.
| FOX Configuration | How to Start with Mini-Retail FOX |
|---|---|
| Factory Manager Agent for entire factory | Lightweight Store Supervisor Agent for 1 store |
| Many specialized agents | Limit to 3–5 such as shelf, flow, checkout, report |
| Large-scale local inference on DGX Station | Distribute across DGX Spark, PC GPU, AWS Bedrock, EC2 GPU |
| Operational twin | Start with a store event timeline and simple dashboard |
| Automated retraining and production deployment | Keep to a retraining candidate queue with store manager/SV human review |
Personally, I think this approach is the most realistic for retail as well. The first hurdle for store-facing AI is whether it can catch just one scene that's truly causing trouble, rather than "connecting everything." If you try to go after low shelves, congestion, aisles, safety, and promotional effectiveness all in parallel from the start, the difficulty of operations and data preparation jumps dramatically.
The Minimum Configuration for Mini-Retail FOX Starts with Eventification
The minimum configuration for Mini-Retail FOX follows nearly the same flow as Mini-FOX. Rather than feeding video directly to a VLM, we thin it out at the edge and turn it into events, then pass those to LLM or VLM reasoning. Here's an overall diagram to get a feel for it.
What I want to emphasize here is the point of not feeding every frame to a VLM. If you pipe store footage during business hours directly into a VLM, inference costs and bandwidth become strained quickly. The flow is: first thin out frames at the edge, use lightweight detection and rule judgment to eventify only frames that might be "candidates for review," and then have the LLM or VLM return a situational description, review perspective, and recommended action for those events.
As a concrete image of a store event JSON, here's an example where a low-shelf candidate appeared at the beverage aisle in the evening. The idea is to organize the detector output and simple rule judgment results directly into JSON and pass them to the reasoning side.
{
"timestamp": "2026-06-17T16:20:30+09:00",
"store_id": "store-01",
"camera_id": "store-01-beverage-aisle-01",
"location": "Beverage Shelf",
"event_type": "shelf_low_candidate",
"confidence": 0.78,
"frame_uri": "s3://example-bucket/events/2026/06/17/store-01/frame-001.jpg",
"detected_state": ["partial_empty_section"],
"rule_triggered": "low_shelf_density_over_60s",
"llm_summary": "The right side of the beverage shelf has appeared to have few products for approximately 60 seconds.",
"recommended_action": "Please share with the store manager as a restocking timing check candidate before the evening peak.",
"human_feedback": null
}
Keeping things at this level of granularity means you can later write human review results into human_feedback, making it usable as a dataset for retraining. Even without deciding on a detailed schema from the start, having timestamp, store_id, camera_id, event_type, frame_uri, llm_summary, and human_feedback in place should be sufficient.
Another important point is not to feed raw events directly to the store manager or SV review UI. If you pass them through only the detector and rule judgment, scenes that are not operationally problematic — such as a cart approaching during restocking, waiting for alignment right after stocking, or momentary queues — will also come up as "candidates." Inserting a "curation layer" before the review UI that bundles raw events, removes duplicates, and filters out operationally unproblematic types using rules or a lightweight VLM makes a significant difference to the review experience. The way a store manager or SV uses the system changes considerably when they receive 200 alerts a day versus 5.
Line Up NVIDIA's Official VSS Skills as a Toolbox
When building the reasoning side of Mini-Retail FOX, there's no need to build VSS integrations from scratch. Under the NVIDIA official NVIDIA-AI-Blueprints/video-search-and-summarization repository, 10 VSS Skills compliant with the agentskills.io specification are published. Simply distributing these Skills to the agent side enables calling VSS APIs through natural language.
Taking one step back to organize the structure of VSS Skills: Skills are the "interface for the user side," and VSS Developer Profile is the "how VSS runs" — a two-layer structure. From the agent's perspective, Skills are the handle of a drawer, and VSS Profile is the contents inside.
There are 10 official Skills in total, but for Mini-Retail FOX in a retail store, you can narrow it down to about 5 that you'd want to try first.
| Skill | Mainly Combined Profile | Use Cases in Retail Stores |
|---|---|---|
video-search |
search |
Search for relevant scenes using natural language queries for low shelves, restocking, congestion, abandoned carts, etc. |
video-understanding |
base |
Confirm "what is happening here" in Q&A format for candidate scenes |
alerts |
alerts (verification) / (VLM) |
Alert management for aisle blockages and safety check candidates, combining CV and VLM verification |
report |
Depending on use case | Generate daily/weekly store operations reports via the /generate endpoint |
rt-vlm |
alerts (VLM) |
Caption/alert for live footage, to be developed into future real-time monitoring and recording |
Skills and Profiles are just "handles and contents of a drawer," so the table shows typical combinations. In practice, there are Skills like report that are used without choosing a Profile, and Skills like alerts that are combined with multiple Profiles.
The installation procedure is also not difficult. According to the NVIDIA Developer Blog explanation, sending a single natural language prompt to the agent creates symbolic links of the entire Skill folder into ~/.claude/skills/<name>/ or ~/.codex/skills/<name>/. For general-purpose hosts without agent-specific paths, ~/.agents/skills/<name>/ per the agentskills.io specification is also provided. Since it uses symbolic links, doing git pull on the repository side simultaneously updates the Skills for all agents.
Re-implementing VSS Skills from scratch is something you don't need to do in the initial stages of Mini-Retail FOX. Starting by "pulling the drawers that seem relevant to retail store use cases" within the official Skills makes the PoC launch dramatically easier.
Place Hermes Agent as the Natural Language Interface
As the agent side for calling VSS Skills, let's place Hermes Agent here. I've written about Hermes Agent in the NemoHermes series, covering patterns for running it on the OpenShell sandbox on DGX Spark.
Placing Hermes Agent as the natural language interface for Mini-Retail FOX enables a flow where Hermes receives inquiries from store managers and supervisors in natural language, selects the appropriate VSS Skill, sends it to the VSS Profile, and returns the reconstructed result. Keeping it running inside the OpenShell sandbox means the network permissions needed for VSS integration, credentials, and sandbox boundary operations can all be reused directly from the configuration established in the NemoHermes series.
Hermes Agent can separate personality and role using units called Profiles. Here's an example of role assignments for Mini-Retail FOX:
| Profile | Role |
|---|---|
| Store Ops | Receives natural language inquiries from store managers/SVs and selects and calls VSS Skills |
| Night Report | Generates daily reports in batch after business hours and sends them to Slack or daily reports |
| Triage | Routes customer interaction and campaign-related inquiries separately from store operations inquiries |
For the VSS Skills installation location, if Hermes Agent is placed inside an OpenShell sandbox, the natural approach is to expand them inside the sandbox via NemoClaw or OpenClaw into ~/.openclaw/workspace/skills/ or ~/.agents/skills/. Since Hermes Agent can reference the general-purpose path per the agentskills.io specification, Skills can be shared between Codex and Claude Code without multiplying agent-specific storage locations.
As an image of the experience, here's one scene of a store manager talking to Hermes Agent. In actual operation, it would likely be set up so queries can be sent from touchpoints the store side is familiar with, such as Slack, in-store chat tools, or a simple web UI.
[Store Manager]
Please narrow down to 5 scenes from yesterday's store footage from 15:00 to 19:00 that seem worth noting from a store operations standpoint.
In particular, please look at shelves where products appear sparse, presence of restocking work, congestion at checkout, and carts blocking aisles.
[Hermes Agent]
I've extracted 5 review candidates.
1. 16:20 Beverage Shelf
Part of the products appear sparse. This is a restocking timing check candidate.
2. 16:37 Beverage Shelf
A staff member appears to be approaching with a cart and performing restocking work.
3. 17:45 Checkout Area
Multiple people appear to have formed a waiting queue. This is a candidate for checking the backup register criteria.
4. 18:10 Aisle
A cart or work materials appear to remain near the center of the aisle. This is a candidate for checking aisle clearance rules.
5. 18:30 Sales Floor
A customer appears to be standing still in front of a shelf. This is a candidate for checking product discoverability.
I've attached supporting frames for each candidate. If you want to look at a particular candidate, please instruct me with something like "Show me the evidence for number 3."
The key point is that Hermes Agent remains in a position of "narrowing down to candidates for human review and presenting them with supporting evidence," rather than "making automatic judgments and issuing instructions." In the early stages of launching Mini-Retail FOX, I think staying within this boundary makes it easier to gain buy-in from those on the ground.
At this point, the components of Mini-Retail FOX — the agent, VSS Skills, and VSS Profile — have all been laid out. Next, let's look at which actual hardware and services to run this on, presented across 3 configuration options.
Configuration A: Keep It Local with DGX Spark and Hermes
The first configuration option involves placing 1 DGX Spark unit in the store or in a nearby server room, and running VSS's search profile and Hermes Agent on it. This configuration is a natural fit when there's a requirement to keep footage from leaving the store or to keep everything within an internal network.
The advantage of this configuration is that it keeps store footage local while giving the feel of the NVIDIA stack. Existing DGX Spark verification assets such as Cosmos-series VLMs, Nemotron-series LLMs, VSS Skills, and Hermes Agent stack up directly, and it becomes easier to explain to those on site that "the footage stays within the store."
On the other hand, there are caveats. If you increase the number of cameras or run long-form video continuously on a single DGX Spark, GPU memory and inference costs quickly hit a ceiling. In the early stages of Mini-Retail FOX, it's more realistic to limit things to 1 store, 1–3 cameras, and 1 use case. The scale-out discussion can wait until the PoC produces visible results.
Configuration B: Lightweight Detection on PC, Reasoning Offloaded to Cloud
If the premise of bringing in a DGX Spark is difficult, or if you want to keep the store-side hardware as light as possible, a configuration that splits roles between edge and cloud is practical. The store side handles only video capture and lightweight detection on a small PC, and reasoning is passed to cloud LLMs or VLMs.
The approach is not to continuously send normal footage, but to send only representative frames or short clips corresponding to before and after anomaly candidates to the cloud. This makes it easier to balance bandwidth, cost, and privacy, and also helps avoid the surprise of cloud bills growing larger than expected during the PoC stage.
Here's the role division between edge and cloud:
| Role | Store-Side PC | Cloud |
|---|---|---|
| Video Capture | RTSP capture, frame thinning | Essentially none |
| Lightweight Judgment | Low-shelf rules, checkout congestion counting, dwell detection | Essentially none |
| Reasoning | Eventify rule output to JSON | VSS Skills + Bedrock / OpenAI-compatible API |
| Storage | Short-term cache | S3, DynamoDB, OpenSearch |
| Notification | Local warning lights, in-store handheld notifications | Slack, Teams, daily reports |
Keeping lightweight judgment on the edge side means first-tier alerts can still be issued even if the network goes down. Structuring it so that only reasoning is cloud-dependent makes degraded operation during outages relatively straightforward to implement. Hermes Agent can work either on the store side or the cloud side, but if you want to keep the store-side PC resources light, placing it on the cloud side and combining it with Bedrock or an OpenAI-compatible API is more manageable.
Configuration C: AWS IoT Greengrass with Multiple Stores in Sight
The final configuration option is for cases where multi-store deployment is in scope from the beginning. This uses AWS IoT Greengrass as the edge runtime, receives events via IoT Core, and combines with Bedrock and SageMaker. Greengrass is a management platform that can run Lambda and containers on edge devices, and when combined with IoT Core, it becomes easier to operate multiple edge locations securely. AWS's official blog also introduces a configuration using IoT Greengrass and IoT Core to perform video analysis for industrial safety from existing CCTV and edge gateways, connected to S3 and SageMaker.
In terms of roles: Greengrass handles distribution and management of edge applications, IoT Core receives events, Lambda and Step Functions advance the workflow, and Bedrock combined with Hermes Agent + VSS Skills generates descriptions and response suggestions. Keeping model improvements in SageMaker makes it easier to later build in retraining and the incorporation of store manager review results.
If multi-store deployment is a given premise, it's easier to include Greengrass from the start. Even for a single-store PoC, if you have the subsequent horizontal expansion in view, placing the control point here will save you from scrambling later. Conversely, if the initial PoC is self-contained within 1 store, it's also natural to start with Configuration A or B and then migrate toward Configuration C once horizontal expansion comes into view.
Narrow Down the First Use Case to One
The recommended starting material for running Mini-Retail FOX is either "discovering low-shelf candidates" or "aisle blockage detection."
Discovering low-shelf candidates is a subject where the explanation can be completed using footage alone, eventification is straightforward, and it's easy for store managers and supervisors to understand as an extension of their daily work. The detection side can also run with simple state judgment, and the landing point of "restocking timing check candidate before the evening peak" is something the site is familiar with. False detections can also be discussed with clear examples such as "a shelf that looks sparse but is actually fine right after restocking."

Generative AI material capturing a state where the right side of a dairy shelf is partially sparse
Low-shelf candidates become even more aligned with store operations context when you can confirm not just a single snapshot but the whole sequence through to when restocking work is completed and the shelf state recovers.

Generative AI material of a scene where a staff member approaches with a cart and performs restocking work
Aisle blockage detection is a subject easily connected to 5S activities from a safety perspective. It can be built straightforwardly around rules such as a cart or cardboard remaining near the center of the aisle, or dwell time exceeding a certain threshold. The fact that it's self-contained using footage is also similar to low-shelf candidates.

Generative AI material of a scene where a cart is left in the center of the aisle
The scene images included here are not actual store footage but pseudo CCTV-style materials created with generative AI. When you want to explain the Mini-Retail FOX concept, being able to advance discussions on UX and event definitions using such pseudo materials even before real store footage is readily available can reduce rework after the PoC begins.
SOP deviations — such as "a register closed within the time window" or "a cart remaining in an unexpected location for an extended time" — can strongly express the Mini-Retail FOX character. However, they require organizing store-specific operational rules, making them somewhat heavy for an initial PoC. I think the more readable sequence is to first use low-shelf candidates and aisle blockages as the main examples, keep the flow limited to "what was found, how it was described, and how the store manager or SV judged it" before getting to automatic instructions, and then layer in SOP matching as a development from there.
No Need to Rush Specialized Agent Splitting
FOX's vision involves many specialized agents, but it's easier to operate without splitting too quickly right after starting Mini-Retail FOX. Start with the processing inside a single Store Supervisor Agent, and split once you can clearly see from reviewing logs that roles are naturally separating.
Even when splitting, starting with about 4–5 agents like the following should be sufficient:
| Specialized Agent | Scope |
|---|---|
shelf_agent |
Status descriptions for low-shelf candidates, restocking timing, and out-of-stock candidates |
flow_agent |
Descriptions of aisle blockages, dwell detection, and traffic flow obstruction candidates |
checkout_agent |
Organization of checkout congestion, backup register candidates, and self-checkout guidance candidates |
report_agent |
Daily/weekly store operations report generation and delivery to Slack or daily reports |
learning_queue |
Collection of false detections and misses, management of feedback loop from human review results |
Keeping operational rule differences between stores in mind as a decision axis for splitting agents will make things easier later. Keeping rules common across the chain on the Supervisor Agent side, and making store-specific differences switchable via Profiles or configuration, makes it easier to take stock when expanding to additional stores later.
Summary
Here's a comparison of the 3 configurations — DGX Spark-centric, PC and cloud split, and AWS IoT Greengrass-centric — by key aspects:
| Configuration | Suitable Situations | Initial Cost | Ease of Sharing Footage | Main Highlights |
|---|---|---|---|---|
| DGX Spark Local | Stores where footage is hard to send outside, NVIDIA-context PoC | Medium | Manageable even for stores where sharing is difficult | Combination of local VSS and Hermes Agent |
| PC + Cloud Reasoning | Low-cost technical validation, want to keep store hardware light | Low | Design to only send anomaly candidates | Lightweight configuration starting from 1 store, 1 camera |
| AWS IoT Greengrass | PoC with multi-store deployment in view | Medium | Easy to govern on the AWS side | Role division and governance of Greengrass, Bedrock, and SageMaker |
For further expansion from here, possibilities include: actually installing VSS's search profile and video-search Skill on DGX Spark and running through 1 use case of low-shelf detection; trying multi-store operation with AWS IoT Greengrass and Bedrock; building store operations automation according to Hermes Agent Profile; and digging into checkout congestion analysis with Cosmos Reason or Cosmos 3. I hope to keep tracking the trend of Physical AI gradually spreading beyond factories to other real-world settings, alongside the manufacturing edition of Mini-FOX.
Reference Links
- NVIDIA FOX Blueprint を小さく始める Mini-FOX 構成を考えてみた
- NVIDIA VSS + AI Agents + Skills の身近な現場での使いどころを考えてみた
- NemoHermes で Hermes Agent を DGX Spark の OpenShell に載せてみた
- Transform video into instantly searchable, actionable intelligence with AI Agents and Skills (NVIDIA Developer Blog)
- NVIDIA-AI-Blueprints/video-search-and-summarization (GitHub)
- agentskills.io specification
- Improving industrial safety with video analytics, AWS IoT Core, and AWS IoT Greengrass (AWS Blog)
