Using Structured Outputs with Amazon Bedrock batch inference as well, I tried to stabilize AI output

Using Structured Outputs with Amazon Bedrock batch inference as well, I tried to stabilize AI output

I'd like to introduce a practical record of using Amazon Bedrock's "Batch Inference" and "Structured Outputs" together to safely generate article summaries in bulk with type safety. We achieved a 50% cost reduction benefit from batch inference while avoiding errors from traditional regular expression parsing.
2026.02.16

This page has been translated by machine translation. View original

Our technical blog "DevelopersIO" is attempting to create summaries of over 60,000 articles using AI. This time, we needed to reprocess a large number of summaries to reflect improvements in our prompts. Before processing the entire volume at once, we first present the results of a pilot implementation targeting approximately 7,000 recent articles.

Previously, we introduced Amazon Bedrock's batch inference and Structured Outputs separately.

https://dev.classmethod.jp/articles/amazon-bedrock-batch-inference-efficiency/

https://dev.classmethod.jp/articles/amazon-bedrock-structured-outputs-json/

This article is a practical record of combining these two technologies.
Processing one by one on demand would take about 50 hours (about 3 seconds/item × 60,000 items), but Batch Inference promises a significant reduction in processing time and a 50% cost reduction. For basic Batch Inference setup, please refer to the first article.

Challenge: Batch Inference JSON Output Was Unstable

In our first batch inference implementation, we instructed "Please output in JSON format" in the system prompt and parsed the results using regular expressions. This method created a divergence from our production prompt by adding JSON output instructions for batch processing. Additionally, we encountered issues such as JSON being broken by double quotes or line breaks in Japanese text, missing latter parts of multiple fields, and missing closing brackets in long texts.

Along with improving our prompts, we've already migrated the on-demand side to Structured Outputs. If we could process with the same schema on the batch side, we could achieve both unified prompt management and type-safe JSON output.

Solution: Structured Outputs Worked with Batch Inference

In conclusion, Structured Outputs worked as-is with Batch Inference. Initially, we assumed "it probably won't work with batch" and proceeded with implementation using the conventional method, but the official documentation clearly states, "Batch inference - Use structured outputs within batch inference without any additional setup."

The following prompts and schema are minimal samples for reproducing functionality. Details of our production prompts remain confidential.

Sample Schema

We generate four fields in one request: summaries and details in both Japanese and English.

SCHEMA = {
    "type": "object",
    "properties": {
        "summary_ja": {"type": "string"},
        "summary_en": {"type": "string"},
        "detail_ja": {"type": "string"},
        "detail_en": {"type": "string"}
    },
    "required": ["summary_ja", "summary_en", "detail_ja", "detail_en"],
    "additionalProperties": False
}

Since output of all fields specified in required is guaranteed, missing fields or JSON syntax errors fundamentally cannot occur.

Only 3 Changes Needed in JSONL

For existing batch inference JSONL, we simply removed the JSON output instruction from the system prompt and added output_config. Result parsing also simplifies to just json.loads().

 {
   "recordId": "article_id",
   "modelInput": {
     "anthropic_version": "bedrock-2023-05-31",
     "max_tokens": 4096,
-    "system": "... Generate a JSON with summary_ja, summary_en ... Output ONLY valid JSON.",
+    "system": "... Generate summaries in both Japanese and English.",
     "messages": [{"role": "user", "content": "..."}],
+    "output_config": {
+      "format": {
+        "type": "json_schema",
+        "schema": { ... }
+      }
+    }
   }
 }

Using the same prompt and schema for both on-demand and batch ensures consistency in quality.

Schema Format Differences Between APIs

When using Structured Outputs across multiple APIs, how schemas are passed differs by API.

API Parameter schema type name required
Converse API outputConfig.textFormat.structure.jsonSchema JSON string Yes
InvokeModel API modelInput.output_config.format dict object No
Batch Inference (JSONL) modelInput.output_config.format dict object No

It's practical to centrally manage the schema as a dict and only use json.dumps() for the Converse API.

# Converse API -- stringify schema, name required
outputConfig={"textFormat": {"type": "json_schema", "structure": {
    "jsonSchema": {"name": "my_schema", "schema": json.dumps(SCHEMA)}}}}

# Batch Inference (in JSONL) -- dict as is, no name
"output_config": {"format": {"type": "json_schema", "schema": SCHEMA}}

When Batch Is Not Suitable: Comparison with Prompt Caching

If the system prompt is large and repeated with the same prompt, on-demand + prompt caching can be more cost-efficient.

Our system includes a tag master CSV (about 3,000 entries, approximately 15,000 tokens) in the system prompt for tag classification processing. For this process, we reduce input costs for the second and subsequent items by 90% using prompt caching.

system=[
    {"text": tag_prompt},
    {"cachePoint": {"type": "default"}}
]

Cost comparison for 1,000 tag classifications (system prompt about 15,000 tokens, article input about 500 tokens, output about 100 tokens):

Method Input Cost Output Cost Total
Batch (50% OFF) $6.20 $0.20 $6.40
On-demand (no cache) $12.40 $0.40 $12.80
On-demand (with cache) $1.60 $0.40 $2.00

*Pricing calculated using Claude 3.5 Haiku (input $0.80/1M, output $4.00/1M)

When a large system prompt is repeatedly used, the 90% OFF from prompt caching exceeds the 50% OFF from batch. However, prompt caching has a TTL constraint (5 minutes), and the cache expires if the interval between requests exceeds 5 minutes, reverting to normal pricing. This assumes a workload that can be processed continuously. Batch inference has no such time constraints, offering the operational convenience of "deploy anytime and forget."

Decision Flowchart

The minimum requirement for Batch Inference is 1,000 items. For fewer items, it's not suitable as wait time alone exceeds 15 minutes. For 1,000+ items, processing time remains nearly constant due to parallel processing (measured: 1,000 items in 17 minutes, 7,950 items in 21 minutes).

Our system's approach is as follows:

Process System Prompt Method Reason
AI Summary Generation About 800 tokens Batch Inference Small prompt, 50% OFF effective
Article Evaluation About 600 tokens Batch Inference Small prompt, 50% OFF effective
Tag Classification About 15,000 tokens On-demand + Cache Large prompt, 90% OFF with caching is advantageous

Implementation Points

Adding the us. prefix to model IDs enables Cross-Region Inference Profiles, which helps avoid capacity shortages in specific regions through cross-region load balancing. Additionally, checking for duplicate recordIds, ensuring file size is under 200MB, and confirming record counts are within the 1,000-50,000 range before submission helps prevent failures at the Validating stage.

Execution Results

First Run: Pilot (1,000 items)

Item Value
Processing Time 17 minutes
Success Rate 973/1,000 (97.3%)
JSON Parsing Success Rate 100% (successful records)
Error Breakdown Grammar compilation timed out: 26 items, Content filtering: 1 item

No JSON syntax errors occurred in successful records.
The 26 errors showed no reproducibility; based on the phenomenon, they may have been due to temporary infrastructure load or cold start timeouts.

Second Run: Scale-up (6,284 items)

Item Value
Processing Time 18 minutes
Success Rate 6,281/6,284 (99.95%)
Errors 3 items

Processing time remained almost the same despite a 6-fold increase in volume. The error rate improved from 2.7% to 0.05%, suggesting some warm-up effect on the infrastructure side.

Retrying Failed Items

The 30 failed items from the first and second runs were individually retried using the on-demand Converse API, resolving all cases. Since the minimum batch requirement is 1,000 items, it was more reasonable to retry a small number of failures immediately using on-demand.

Summary of Benefits

Aspect On-demand One by One Batch + Structured Outputs
Processing Time (1,000 items) About 50 minutes (3 sec/item) 17 minutes (parallel processing)
Processing Time (7,000 items) About 6 hours About 20 minutes (2 batches)
Cost On-demand pricing 100% 50% reduction
JSON Parsing Regular expressions + validation Only json.loads()
Parsing Failure Rate Could occur at several % 0% (schema guaranteed)
Prompt Management Separately managed for batch Completely identical to production

Improvements were achieved in processing time, cost, and quality. In particular, the ability to unify prompt management was a significant gain for future maintainability.

Conclusion

By combining Batch Inference and Structured Outputs, we achieved both type-safe JSON output and 50% cost reduction simultaneously. Since prompts and schemas can be completely shared between on-demand and batch processing, we avoided the operational burden of "managing separate prompts for batch."

In the future, when bulk update maintenance of existing data becomes necessary due to new model releases or prompt tuning, we plan to utilize Batch Inference and Structured Outputs.

Share this article

FacebookHatena blogX