I thought about the retail store version of NVIDIA FOX Blueprint with VSS Skills and Hermes Agent

I thought about the retail store version of NVIDIA FOX Blueprint with VSS Skills and Hermes Agent

2026.06.18

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

Previously, I wrote about "Mini-FOX," a framework for starting small with NVIDIA Factory Operations Blueprint (FOX for short) — trimmed down to 1 line, 1 camera, and 1 use case.

https://dev.classmethod.jp/articles/nvidia-fox-blueprint-mini-fox-structure/

FOX is a large-scale AI Blueprint for manufacturing, with DGX Station-class hardware as the assumed target. The idea was to scale it down to 1 line and start by combining DGX Spark, PC, Jetson, and AWS. After finishing that article, I realized this concept could apply just as naturally outside of factories.

Supermarket store operations, in particular, share similar challenges with factories in the sense that it's difficult to continuously review hours of footage every day. Low shelves, delayed restocking, crowded checkout areas, carts left in aisles — situations that are hard to see from POS or inventory data alone accumulate in the footage, and it's quite tough in practice for store managers and supervisors to review all of it.

In this article, I'll think through a framework called "Mini-Retail FOX," which applies the Mini-FOX concept to retail stores. At the core of reasoning and the natural language interface, I'll place a combination of NVIDIA's official VSS Skills and Hermes Agent.

How Does FOX Look When Applied to Retail Stores?

Here's the main topic.

FOX was a reference design that integrated factory machine signals, quality systems, work procedures, and operational alerts, with a Factory Manager AI orchestrating specialized agents and machinery. It had a structure where the "field brain" sits at the center, with domain-specialized agents for safety, quality, maintenance, and operations hanging beneath it.

When you map this to a retail store, the equivalent of the Factory Manager AI would be a "Store Operations Support AI." The image would be a Store Operations Support AI surrounded by agents that watch for low shelves and restocking, agents monitoring checkout congestion, agents watching aisles and traffic flow, and agents writing daily reports — all assisting the judgment of store managers and supervisors.

The granularity of events handled does differ somewhat between factories and retail. In factories, the focus is on machine signals and procedure deviations, while in retail, the focus is on "changes in the state of people and shelves" read from video and sales data. On the other hand, the idea of "the field brain surveying long-form data comprehensively and narrowing it down to candidates for human review" applies naturally to both.

Can you simply transplant FOX for factories directly into a retail store? Not quite. The official FOX configuration is quite heavy, and if you try to cover all stores and all cameras at once, there's a high chance you'll stall before the PoC even gets moving. So, just like with Mini-FOX, we start by cutting things down small.

Don't Aim for the Whole Store at Once — Cut It Out as Mini-Retail FOX

In the same framing as calling Mini-FOX a "trial version of fully-equipped FOX," Mini-Retail FOX is organized as a "trial version of a Store Operations Support AI." The form is to start with 1 store, 1 to a few cameras, and 1 use case.

Here's a rough mapping of FOX's configuration scaled down for retail stores.

FOX Configuration How to Start with Mini-Retail FOX
Factory Manager Agent for entire factory Lightweight Store Supervisor Agent for 1 store
Many specialized agents Limit to 3–5 such as shelf, flow, checkout, report
Large-scale local inference on DGX Station Distribute across DGX Spark, PC GPU, AWS Bedrock, EC2 GPU
Operational twin Start with a store event timeline and simple dashboard
Automated retraining and production deployment Keep to a retraining candidate queue with store manager/SV human review

Personally, I think this approach is the most realistic for retail as well. The first hurdle for store-facing AI is whether it can catch just one scene that's truly causing trouble, rather than "connecting everything." If you try to go after low shelves, congestion, aisles, safety, and promotional effectiveness all in parallel from the start, the difficulty of operations and data preparation jumps dramatically.

The Minimum Configuration for Mini-Retail FOX Starts with Eventification

The minimum configuration for Mini-Retail FOX follows nearly the same flow as Mini-FOX. Rather than feeding video directly to a VLM, we thin it out at the edge and turn it into events, then pass those to LLM or VLM reasoning. Here's an overall diagram to get a feel for it.

What I want to emphasize here is the point of not feeding every frame to a VLM. If you pipe store footage during business hours directly into a VLM, inference costs and bandwidth become strained quickly. The flow is: first thin out frames at the edge, use lightweight detection and rule judgment to eventify only frames that might be "candidates for review," and then have the LLM or VLM return a situational description, review perspective, and recommended action for those events.

As a concrete image of a store event JSON, here's an example where a low-shelf candidate appeared at the beverage aisle in the evening. The idea is to organize the detector output and simple rule judgment results directly into JSON and pass them to the reasoning side.

{
  "timestamp": "2026-06-17T16:20:30+09:00",
  "store_id": "store-01",
  "camera_id": "store-01-beverage-aisle-01",
  "location": "Beverage Shelf",
  "event_type": "shelf_low_candidate",
  "confidence": 0.78,
  "frame_uri": "s3://example-bucket/events/2026/06/17/store-01/frame-001.jpg",
  "detected_state": ["partial_empty_section"],
  "rule_triggered": "low_shelf_density_over_60s",
  "llm_summary": "The right side of the beverage shelf has appeared to have few products for approximately 60 seconds.",
  "recommended_action": "Please share with the store manager as a restocking timing check candidate before the evening peak.",
  "human_feedback": null
}

Keeping things at this level of granularity means you can later write human review results into human_feedback, making it usable as a dataset for retraining. Even without deciding on a detailed schema from the start, having timestamp, store_id, camera_id, event_type, frame_uri, llm_summary, and human_feedback in place should be sufficient.

Another important point is not to feed raw events directly to the store manager or SV review UI. If you pass them through only the detector and rule judgment, scenes that are not operationally problematic — such as a cart approaching during restocking, waiting for alignment right after stocking, or momentary queues — will also come up as "candidates." Inserting a "curation layer" before the review UI that bundles raw events, removes duplicates, and filters out operationally unproblematic types using rules or a lightweight VLM makes a significant difference to the review experience. The way a store manager or SV uses the system changes considerably when they receive 200 alerts a day versus 5.

Line Up NVIDIA's Official VSS Skills as a Toolbox

When building the reasoning side of Mini-Retail FOX, there's no need to build VSS integrations from scratch. Under the NVIDIA official NVIDIA-AI-Blueprints/video-search-and-summarization repository, 10 VSS Skills compliant with the agentskills.io specification are published. Simply distributing these Skills to the agent side enables calling VSS APIs through natural language.

Taking one step back to organize the structure of VSS Skills: Skills are the "interface for the user side," and VSS Developer Profile is the "how VSS runs" — a two-layer structure. From the agent's perspective, Skills are the handle of a drawer, and VSS Profile is the contents inside.

There are 10 official Skills in total, but for Mini-Retail FOX in a retail store, you can narrow it down to about 5 that you'd want to try first.

Skill Mainly Combined Profile Use Cases in Retail Stores
video-search search Search for relevant scenes using natural language queries for low shelves, restocking, congestion, abandoned carts, etc.
video-understanding base Confirm "what is happening here" in Q&A format for candidate scenes
alerts alerts (verification) / (VLM) Alert management for aisle blockages and safety check candidates, combining CV and VLM verification
report Depending on use case Generate daily/weekly store operations reports via the /generate endpoint
rt-vlm alerts (VLM) Caption/alert for live footage, to be developed into future real-time monitoring and recording

Skills and Profiles are just "handles and contents of a drawer," so the table shows typical combinations. In practice, there are Skills like report that are used without choosing a Profile, and Skills like alerts that are combined with multiple Profiles.

The installation procedure is also not difficult. According to the NVIDIA Developer Blog explanation, sending a single natural language prompt to the agent creates symbolic links of the entire Skill folder into ~/.claude/skills/<name>/ or ~/.codex/skills/<name>/. For general-purpose hosts without agent-specific paths, ~/.agents/skills/<name>/ per the agentskills.io specification is also provided. Since it uses symbolic links, doing git pull on the repository side simultaneously updates the Skills for all agents.

Re-implementing VSS Skills from scratch is something you don't need to do in the initial stages of Mini-Retail FOX. Starting by "pulling the drawers that seem relevant to retail store use cases" within the official Skills makes the PoC launch dramatically easier.

Place Hermes Agent as the Natural Language Interface

As the agent side for calling VSS Skills, let's place Hermes Agent here. I've written about Hermes Agent in the NemoHermes series, covering patterns for running it on the OpenShell sandbox on DGX Spark.

https://dev.classmethod.jp/articles/dgx-spark-nemohermes-openshell-hermes-agent/

Placing Hermes Agent as the natural language interface for Mini-Retail FOX enables a flow where Hermes receives inquiries from store managers and supervisors in natural language, selects the appropriate VSS Skill, sends it to the VSS Profile, and returns the reconstructed result. Keeping it running inside the OpenShell sandbox means the network permissions needed for VSS integration, credentials, and sandbox boundary operations can all be reused directly from the configuration established in the NemoHermes series.

Hermes Agent can separate personality and role using units called Profiles. Here's an example of role assignments for Mini-Retail FOX:

Profile Role
Store Ops Receives natural language inquiries from store managers/SVs and selects and calls VSS Skills
Night Report Generates daily reports in batch after business hours and sends them to Slack or daily reports
Triage Routes customer interaction and campaign-related inquiries separately from store operations inquiries

For the VSS Skills installation location, if Hermes Agent is placed inside an OpenShell sandbox, the natural approach is to expand them inside the sandbox via NemoClaw or OpenClaw into ~/.openclaw/workspace/skills/ or ~/.agents/skills/. Since Hermes Agent can reference the general-purpose path per the agentskills.io specification, Skills can be shared between Codex and Claude Code without multiplying agent-specific storage locations.

As an image of the experience, here's one scene of a store manager talking to Hermes Agent. In actual operation, it would likely be set up so queries can be sent from touchpoints the store side is familiar with, such as Slack, in-store chat tools, or a simple web UI.

[Store Manager]
Please narrow down to 5 scenes from yesterday's store footage from 15:00 to 19:00 that seem worth noting from a store operations standpoint.
In particular, please look at shelves where products appear sparse, presence of restocking work, congestion at checkout, and carts blocking aisles.

[Hermes Agent]
I've extracted 5 review candidates.

1. 16:20 Beverage Shelf
   Part of the products appear sparse. This is a restocking timing check candidate.
2. 16:37 Beverage Shelf
   A staff member appears to be approaching with a cart and performing restocking work.
3. 17:45 Checkout Area
   Multiple people appear to have formed a waiting queue. This is a candidate for checking the backup register criteria.
4. 18:10 Aisle
   A cart or work materials appear to remain near the center of the aisle. This is a candidate for checking aisle clearance rules.
5. 18:30 Sales Floor
   A customer appears to be standing still in front of a shelf. This is a candidate for checking product discoverability.

I've attached supporting frames for each candidate. If you want to look at a particular candidate, please instruct me with something like "Show me the evidence for number 3."

The key point is that Hermes Agent remains in a position of "narrowing down to candidates for human review and presenting them with supporting evidence," rather than "making automatic judgments and issuing instructions." In the early stages of launching Mini-Retail FOX, I think staying within this boundary makes it easier to gain buy-in from those on the ground.

At this point, the components of Mini-Retail FOX — the agent, VSS Skills, and VSS Profile — have all been laid out. Next, let's look at which actual hardware and services to run this on, presented across 3 configuration options.

Configuration A: Keep It Local with DGX Spark and Hermes

The first configuration option involves placing 1 DGX Spark unit in the store or in a nearby server room, and running VSS's search profile and Hermes Agent on it. This configuration is a natural fit when there's a requirement to keep footage from leaving the store or to keep everything within an internal network.

The advantage of this configuration is that it keeps store footage local while giving the feel of the NVIDIA stack. Existing DGX Spark verification assets such as Cosmos-series VLMs, Nemotron-series LLMs, VSS Skills, and Hermes Agent stack up directly, and it becomes easier to explain to those on site that "the footage stays within the store."

On the other hand, there are caveats. If you increase the number of cameras or run long-form video continuously on a single DGX Spark, GPU memory and inference costs quickly hit a ceiling. In the early stages of Mini-Retail FOX, it's more realistic to limit things to 1 store, 1–3 cameras, and 1 use case. The scale-out discussion can wait until the PoC produces visible results.

Configuration B: Lightweight Detection on PC, Reasoning Offloaded to Cloud

If the premise of bringing in a DGX Spark is difficult, or if you want to keep the store-side hardware as light as possible, a configuration that splits roles between edge and cloud is practical. The store side handles only video capture and lightweight detection on a small PC, and reasoning is passed to cloud LLMs or VLMs.

The approach is not to continuously send normal footage, but to send only representative frames or short clips corresponding to before and after anomaly candidates to the cloud. This makes it easier to balance bandwidth, cost, and privacy, and also helps avoid the surprise of cloud bills growing larger than expected during the PoC stage.

Here's the role division between edge and cloud:

Role Store-Side PC Cloud
Video Capture RTSP capture, frame thinning Essentially none
Lightweight Judgment Low-shelf rules, checkout congestion counting, dwell detection Essentially none
Reasoning Eventify rule output to JSON VSS Skills + Bedrock / OpenAI-compatible API
Storage Short-term cache S3, DynamoDB, OpenSearch
Notification Local warning lights, in-store handheld notifications Slack, Teams, daily reports

Keeping lightweight judgment on the edge side means first-tier alerts can still be issued even if the network goes down. Structuring it so that only reasoning is cloud-dependent makes degraded operation during outages relatively straightforward to implement. Hermes Agent can work either on the store side or the cloud side, but if you want to keep the store-side PC resources light, placing it on the cloud side and combining it with Bedrock or an OpenAI-compatible API is more manageable.

Configuration C: AWS IoT Greengrass with Multiple Stores in Sight

The final configuration option is for cases where multi-store deployment is in scope from the beginning. This uses AWS IoT Greengrass as the edge runtime, receives events via IoT Core, and combines with Bedrock and SageMaker. Greengrass is a management platform that can run Lambda and containers on edge devices, and when combined with IoT Core, it becomes easier to operate multiple edge locations securely. AWS's official blog also introduces a configuration using IoT Greengrass and IoT Core to perform video analysis for industrial safety from existing CCTV and edge gateways, connected to S3 and SageMaker.

In terms of roles: Greengrass handles distribution and management of edge applications, IoT Core receives events, Lambda and Step Functions advance the workflow, and Bedrock combined with Hermes Agent + VSS Skills generates descriptions and response suggestions. Keeping model improvements in SageMaker makes it easier to later build in retraining and the incorporation of store manager review results.

If multi-store deployment is a given premise, it's easier to include Greengrass from the start. Even for a single-store PoC, if you have the subsequent horizontal expansion in view, placing the control point here will save you from scrambling later. Conversely, if the initial PoC is self-contained within 1 store, it's also natural to start with Configuration A or B and then migrate toward Configuration C once horizontal expansion comes into view.

Narrow Down the First Use Case to One

The recommended starting material for running Mini-Retail FOX is either "discovering low-shelf candidates" or "aisle blockage detection."

Discovering low-shelf candidates is a subject where the explanation can be completed using footage alone, eventification is straightforward, and it's easy for store managers and supervisors to understand as an extension of their daily work. The detection side can also run with simple state judgment, and the landing point of "restocking timing check candidate before the evening peak" is something the site is familiar with. False detections can also be discussed with clear examples such as "a shelf that looks sparse but is actually fine right after restocking."

Generative AI material capturing a state where the right side of a dairy shelf is partially sparse
Generative AI material capturing a state where the right side of a dairy shelf is partially sparse

Low-shelf candidates become even more aligned with store operations context when you can confirm not just a single snapshot but the whole sequence through to when restocking work is completed and the shelf state recovers.

Generative AI material of a scene where a staff member approaches with a cart and performs restocking work
Generative AI material of a scene where a staff member approaches with a cart and performs restocking work

Aisle blockage detection is a subject easily connected to 5S activities from a safety perspective. It can be built straightforwardly around rules such as a cart or cardboard remaining near the center of the aisle, or dwell time exceeding a certain threshold. The fact that it's self-contained using footage is also similar to low-shelf candidates.

Generative AI material of a scene where a cart is left in the center of the aisle
Generative AI material of a scene where a cart is left in the center of the aisle

The scene images included here are not actual store footage but pseudo CCTV-style materials created with generative AI. When you want to explain the Mini-Retail FOX concept, being able to advance discussions on UX and event definitions using such pseudo materials even before real store footage is readily available can reduce rework after the PoC begins.

SOP deviations — such as "a register closed within the time window" or "a cart remaining in an unexpected location for an extended time" — can strongly express the Mini-Retail FOX character. However, they require organizing store-specific operational rules, making them somewhat heavy for an initial PoC. I think the more readable sequence is to first use low-shelf candidates and aisle blockages as the main examples, keep the flow limited to "what was found, how it was described, and how the store manager or SV judged it" before getting to automatic instructions, and then layer in SOP matching as a development from there.

No Need to Rush Specialized Agent Splitting

FOX's vision involves many specialized agents, but it's easier to operate without splitting too quickly right after starting Mini-Retail FOX. Start with the processing inside a single Store Supervisor Agent, and split once you can clearly see from reviewing logs that roles are naturally separating.

Even when splitting, starting with about 4–5 agents like the following should be sufficient:

Specialized Agent Scope
shelf_agent Status descriptions for low-shelf candidates, restocking timing, and out-of-stock candidates
flow_agent Descriptions of aisle blockages, dwell detection, and traffic flow obstruction candidates
checkout_agent Organization of checkout congestion, backup register candidates, and self-checkout guidance candidates
report_agent Daily/weekly store operations report generation and delivery to Slack or daily reports
learning_queue Collection of false detections and misses, management of feedback loop from human review results

Keeping operational rule differences between stores in mind as a decision axis for splitting agents will make things easier later. Keeping rules common across the chain on the Supervisor Agent side, and making store-specific differences switchable via Profiles or configuration, makes it easier to take stock when expanding to additional stores later.

Summary

Here's a comparison of the 3 configurations — DGX Spark-centric, PC and cloud split, and AWS IoT Greengrass-centric — by key aspects:

Configuration Suitable Situations Initial Cost Ease of Sharing Footage Main Highlights
DGX Spark Local Stores where footage is hard to send outside, NVIDIA-context PoC Medium Manageable even for stores where sharing is difficult Combination of local VSS and Hermes Agent
PC + Cloud Reasoning Low-cost technical validation, want to keep store hardware light Low Design to only send anomaly candidates Lightweight configuration starting from 1 store, 1 camera
AWS IoT Greengrass PoC with multi-store deployment in view Medium Easy to govern on the AWS side Role division and governance of Greengrass, Bedrock, and SageMaker

For further expansion from here, possibilities include: actually installing VSS's search profile and video-search Skill on DGX Spark and running through 1 use case of low-shelf detection; trying multi-store operation with AWS IoT Greengrass and Bedrock; building store operations automation according to Hermes Agent Profile; and digging into checkout congestion analysis with Cosmos Reason or Cosmos 3. I hope to keep tracking the trend of Physical AI gradually spreading beyond factories to other real-world settings, alongside the manufacturing edition of Mini-FOX.


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article