Using Structured Outputs with Amazon Bedrock batch inference as well, I tried to stabilize AI output
This page has been translated by machine translation. View original
Our technical blog "DevelopersIO" is attempting to create summaries of over 60,000 articles using AI. This time, we needed to reprocess a large number of summaries to reflect improvements in our prompts. Before processing the entire volume at once, we first present the results of a pilot implementation targeting approximately 7,000 recent articles.
Previously, we introduced Amazon Bedrock's batch inference and Structured Outputs separately.
This article is a practical record of combining these two technologies.
Processing one by one on demand would take about 50 hours (about 3 seconds/item × 60,000 items), but Batch Inference promises a significant reduction in processing time and a 50% cost reduction. For basic Batch Inference setup, please refer to the first article.
Challenge: Batch Inference JSON Output Was Unstable
In our first batch inference implementation, we instructed "Please output in JSON format" in the system prompt and parsed the results using regular expressions. This method created a divergence from our production prompt by adding JSON output instructions for batch processing. Additionally, we encountered issues such as JSON being broken by double quotes or line breaks in Japanese text, missing latter parts of multiple fields, and missing closing brackets in long texts.
Along with improving our prompts, we've already migrated the on-demand side to Structured Outputs. If we could process with the same schema on the batch side, we could achieve both unified prompt management and type-safe JSON output.
Solution: Structured Outputs Worked with Batch Inference
In conclusion, Structured Outputs worked as-is with Batch Inference. Initially, we assumed "it probably won't work with batch" and proceeded with implementation using the conventional method, but the official documentation clearly states, "Batch inference - Use structured outputs within batch inference without any additional setup."
The following prompts and schema are minimal samples for reproducing functionality. Details of our production prompts remain confidential.
Sample Schema
We generate four fields in one request: summaries and details in both Japanese and English.
SCHEMA = {
"type": "object",
"properties": {
"summary_ja": {"type": "string"},
"summary_en": {"type": "string"},
"detail_ja": {"type": "string"},
"detail_en": {"type": "string"}
},
"required": ["summary_ja", "summary_en", "detail_ja", "detail_en"],
"additionalProperties": False
}
Since output of all fields specified in required is guaranteed, missing fields or JSON syntax errors fundamentally cannot occur.
Only 3 Changes Needed in JSONL
For existing batch inference JSONL, we simply removed the JSON output instruction from the system prompt and added output_config. Result parsing also simplifies to just json.loads().
{
"recordId": "article_id",
"modelInput": {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
- "system": "... Generate a JSON with summary_ja, summary_en ... Output ONLY valid JSON.",
+ "system": "... Generate summaries in both Japanese and English.",
"messages": [{"role": "user", "content": "..."}],
+ "output_config": {
+ "format": {
+ "type": "json_schema",
+ "schema": { ... }
+ }
+ }
}
}
Using the same prompt and schema for both on-demand and batch ensures consistency in quality.
Schema Format Differences Between APIs
When using Structured Outputs across multiple APIs, how schemas are passed differs by API.
| API | Parameter | schema type | name required |
|---|---|---|---|
| Converse API | outputConfig.textFormat.structure.jsonSchema |
JSON string | Yes |
| InvokeModel API | modelInput.output_config.format |
dict object | No |
| Batch Inference (JSONL) | modelInput.output_config.format |
dict object | No |
It's practical to centrally manage the schema as a dict and only use json.dumps() for the Converse API.
# Converse API -- stringify schema, name required
outputConfig={"textFormat": {"type": "json_schema", "structure": {
"jsonSchema": {"name": "my_schema", "schema": json.dumps(SCHEMA)}}}}
# Batch Inference (in JSONL) -- dict as is, no name
"output_config": {"format": {"type": "json_schema", "schema": SCHEMA}}
When Batch Is Not Suitable: Comparison with Prompt Caching
If the system prompt is large and repeated with the same prompt, on-demand + prompt caching can be more cost-efficient.
Our system includes a tag master CSV (about 3,000 entries, approximately 15,000 tokens) in the system prompt for tag classification processing. For this process, we reduce input costs for the second and subsequent items by 90% using prompt caching.
system=[
{"text": tag_prompt},
{"cachePoint": {"type": "default"}}
]
Cost comparison for 1,000 tag classifications (system prompt about 15,000 tokens, article input about 500 tokens, output about 100 tokens):
| Method | Input Cost | Output Cost | Total |
|---|---|---|---|
| Batch (50% OFF) | $6.20 | $0.20 | $6.40 |
| On-demand (no cache) | $12.40 | $0.40 | $12.80 |
| On-demand (with cache) | $1.60 | $0.40 | $2.00 |
*Pricing calculated using Claude 3.5 Haiku (input $0.80/1M, output $4.00/1M)
When a large system prompt is repeatedly used, the 90% OFF from prompt caching exceeds the 50% OFF from batch. However, prompt caching has a TTL constraint (5 minutes), and the cache expires if the interval between requests exceeds 5 minutes, reverting to normal pricing. This assumes a workload that can be processed continuously. Batch inference has no such time constraints, offering the operational convenience of "deploy anytime and forget."
Decision Flowchart
The minimum requirement for Batch Inference is 1,000 items. For fewer items, it's not suitable as wait time alone exceeds 15 minutes. For 1,000+ items, processing time remains nearly constant due to parallel processing (measured: 1,000 items in 17 minutes, 7,950 items in 21 minutes).
Our system's approach is as follows:
| Process | System Prompt | Method | Reason |
|---|---|---|---|
| AI Summary Generation | About 800 tokens | Batch Inference | Small prompt, 50% OFF effective |
| Article Evaluation | About 600 tokens | Batch Inference | Small prompt, 50% OFF effective |
| Tag Classification | About 15,000 tokens | On-demand + Cache | Large prompt, 90% OFF with caching is advantageous |
Implementation Points
Adding the us. prefix to model IDs enables Cross-Region Inference Profiles, which helps avoid capacity shortages in specific regions through cross-region load balancing. Additionally, checking for duplicate recordIds, ensuring file size is under 200MB, and confirming record counts are within the 1,000-50,000 range before submission helps prevent failures at the Validating stage.
Execution Results
First Run: Pilot (1,000 items)
| Item | Value |
|---|---|
| Processing Time | 17 minutes |
| Success Rate | 973/1,000 (97.3%) |
| JSON Parsing Success Rate | 100% (successful records) |
| Error Breakdown | Grammar compilation timed out: 26 items, Content filtering: 1 item |
No JSON syntax errors occurred in successful records.
The 26 errors showed no reproducibility; based on the phenomenon, they may have been due to temporary infrastructure load or cold start timeouts.
Second Run: Scale-up (6,284 items)
| Item | Value |
|---|---|
| Processing Time | 18 minutes |
| Success Rate | 6,281/6,284 (99.95%) |
| Errors | 3 items |
Processing time remained almost the same despite a 6-fold increase in volume. The error rate improved from 2.7% to 0.05%, suggesting some warm-up effect on the infrastructure side.
Retrying Failed Items
The 30 failed items from the first and second runs were individually retried using the on-demand Converse API, resolving all cases. Since the minimum batch requirement is 1,000 items, it was more reasonable to retry a small number of failures immediately using on-demand.
Summary of Benefits
| Aspect | On-demand One by One | Batch + Structured Outputs |
|---|---|---|
| Processing Time (1,000 items) | About 50 minutes (3 sec/item) | 17 minutes (parallel processing) |
| Processing Time (7,000 items) | About 6 hours | About 20 minutes (2 batches) |
| Cost | On-demand pricing 100% | 50% reduction |
| JSON Parsing | Regular expressions + validation | Only json.loads() |
| Parsing Failure Rate | Could occur at several % | 0% (schema guaranteed) |
| Prompt Management | Separately managed for batch | Completely identical to production |
Improvements were achieved in processing time, cost, and quality. In particular, the ability to unify prompt management was a significant gain for future maintainability.
Conclusion
By combining Batch Inference and Structured Outputs, we achieved both type-safe JSON output and 50% cost reduction simultaneously. Since prompts and schemas can be completely shared between on-demand and batch processing, we avoided the operational burden of "managing separate prompts for batch."
In the future, when bulk update maintenance of existing data becomes necessary due to new model releases or prompt tuning, we plan to utilize Batch Inference and Structured Outputs.