
I tried image analysis with the multimodal capabilities of Foundation Models
This page has been translated by machine translation. View original
The Foundation Models framework has long been capable of text generation. However, the use case commonly seen with cloud LLMs—"inputting an image for analysis"—was not possible.
At WWDC26, multimodal capabilities were newly added to Foundation Models, making it possible to combine images with prompts. Curious about what kind of image analysis could be done, I decided to try it out.
This article introduces the steps to perform image analysis using the multimodal features of Foundation Models. I hope it serves as a useful reference for those who want to try similar experiments.
Test Environment
- MacBook Pro (Apple M2 Pro)
- macOS Tahoe 26.4.1
- Xcode 27.0 Beta
- iPhone 17 Pro Simulator (iOS 27.0 Beta)
- iPhone 16e physical device (iOS 27.0 Beta)
About Foundation Models Multimodal Features
The Foundation Models framework is a framework that enables on-device inference on devices equipped with Apple Intelligence, which debuted at WWDC25. Recently, Riru Ossa introduced a method using Foundation Models at "try! Swift Tokyo 2026" to replace diary content with emoji.
In addition to text generation, it also supports multimodal prompts that include images. The main use cases available for image analysis include the following:
- Caption generation that describes the content of an image
- Identification of objects appearing in an image
- Answering questions about an image (Visual Q&A)
However, operation requires a device compatible with Apple Intelligence. Please refer to the Apple official page for compatible devices.
Implementation Steps
Step 1: Project Setup
Create a new iOS project in Xcode and use the FoundationModels framework. No additional SPM dependencies are required, as it can be used as a system framework.
No special configuration is needed in Info.plist, but you must use a device with Apple Intelligence enabled.
First, add a simple screen that runs a process when a button is tapped and displays the result as text, in order to run the sample code. The action1() part is intended to have the processing described below added to it.
import SwiftUI
import FoundationModels
struct ContentView: View {
@State private var text: String = ""
var body: some View {
VStack {
Text(text)
Button("Run", action: action1)
}
}
func action1() {
// Add Foundation Models processing here
}
}
Also, don't forget to add the image to be analyzed to .xcassets. Here, we use a photo of the Wada family's beloved dog "Maronii."

It was taken during a walk in the park, and it's a photo of me wearing a green T-shirt while holding Maronii. I wonder what kind of response will be generated when this photo is analyzed.
Step 2: Creating a Session
Retrieve the device's default model with SystemLanguageModel.default. Check whether it is available using isAvailable before using it. All subsequent code will be added inside action1().
// Check if the device supports Apple Intelligence
guard SystemLanguageModel.default.isAvailable else {
text = "Apple Intelligence is not available"
return
}
let session = LanguageModelSession()
Step 3: Building a Prompt with an Image
Send a request using the session created in Step 2. Pass text and an Attachment together using result builder syntax. Result builder syntax is a Swift notation that allows multiple elements to be arranged inside a closure to compose them as a single prompt.
In the current beta version, only two types of initializers are implemented for Attachment: CGImage and a file URL. When using UIImage, convert it using the .cgImage property before passing it.
Task {
// Image to be analyzed
let uiImage = UIImage(named: "SampleImage")
// Convert UIImage → CGImage (UIImage cannot be passed directly to Attachment)
guard let cgImage = uiImage?.cgImage else {
text = "Failed to load image"
return
}
do {
let response = try await session.respond {
"What is in this image? Please explain in Japanese."
Attachment(cgImage)
}
text = response.content
print(response.content)
} catch {
text = "Error: \(error.localizedDescription)"
print("Error: \(error)\n\(String(reflecting: error))")
}
}
The following analysis results were returned. To check the variance in output, the same prompt was executed 4 times. The time in seconds shown in parentheses is the processing time measured by the difference in Date() before and after execution; longer responses tended to have longer processing times.
| Response | Processing Time |
|---|---|
| This image shows a person holding a small dog. The dog has black and brown fur, a black nose, blue eyes, large ears, and a slender body. In the background, grass and trees are visible. | 3840.7 ms |
| In this image, a small dog is in a person's arms. | 2736.5 ms |
| In this image, a small Chihuahua dog is in a person's arms. | 2584.1 ms |
| In this image, there is a person holding a small dog. The dog has black and brown fur. Green trees are visible in the background. | 3050.1 ms |
Even with the same image and prompt, the expression changes each time, with some instances identifying the breed as "Chihuahua" and others not. This is probabilistic behavior typical of LLMs, and it was confirmed that on-device models exhibit the same behavior.
Step 4: Analysis Using Structured Output
By attaching the @Generable macro to a Swift struct or enum, the model's output can be received as an instance of that type. The framework converts type information into a JSON schema and passes it to the model.
The @Guide macro is for conveying the meaning of a property to the model in natural language, and is not required. If a property name is sufficiently clear, the model may be able to understand the intent. However, it is used when you want to improve output quality or control the range of values to be generated.
The description in @Guide is written in English, following the official documentation samples. This is because English is considered to convey intent more accurately to the model as an instruction.
@Generable
struct ImageAnalysisResult {
@Guide(description: "A description of the image content")
var description: String
@Guide(description: "A list of detected objects in the image")
var detectedObjects: [String]
@Guide(description: "The dominant colors visible in the image")
var dominantColors: [String]
}
Note that @Generable type information consumes the context window. The more properties there are and the longer the @Guide descriptions, the more consumption increases, so it is effective to omit unnecessary properties and keep property names concise.
ImageAnalysisResult should be defined at the top level of the file (outside of ContentView).
The analysis results using this ImageAnalysisResult are as follows. Add action2() to ContentView and switch the button action from action1 to action2 to verify.
func action2() {
// Check if the device supports Apple Intelligence
guard SystemLanguageModel.default.isAvailable else {
text = "Apple Intelligence is not available"
return
}
let session = LanguageModelSession()
Task {
// Image to be analyzed
let uiImage = UIImage(named: "SampleImage")
// Convert UIImage → CGImage (UIImage cannot be passed directly to Attachment)
guard let cgImage = uiImage?.cgImage else {
text = "Failed to load image"
return
}
do {
let response = try await session.respond(
generating: ImageAnalysisResult.self
) {
"Please analyze this image. Please explain in Japanese."
Attachment(cgImage)
}
print(response.content.description)
print(response.content.detectedObjects)
print(response.content.dominantColors)
} catch {
text = "Error: \(error.localizedDescription)"
print("Error: \(error)\n\(String(reflecting: error))")
}
}
}
The following analysis results were obtained.
A person holding a Chihuahua dog.
["dog", "human"]
["black", "brown", "green"]
Verification
In the current beta version, image analysis does not work in the simulator, so verification must be done on a physical device (see Troubleshooting for details).
Confirm in advance that Apple Intelligence is enabled on the physical device.
- Open the Settings app → "Apple Intelligence & Siri" → Turn on "Apple Intelligence"
- Confirm that the language and region is set to a supported language such as English (US)
- Wait for the model download to complete
Once the above preparations are complete, tapping the button will return a response after a few seconds. Since processing is done on-device, no communication to external networks occurs.
Notes
Input Image Size
Since the framework automatically performs scaling and color conversion before sending to the model, pre-conversion is not necessary. However, since larger images consume more tokens, caution is needed for large images from the perspective of response speed and context window.
Japanese Prompts
Writing prompts in Japanese will work, but the language of the response depends on the instructions in the prompt. If you want a response in Japanese, it is best to explicitly state "Please answer in Japanese."
Troubleshooting
ModelManagerError 1001 Occurs
When running on the iOS Simulator, the following error occurred.
Error Domain=FoundationModels.LanguageModelError Code=-1
└─ ModelManagerServices.ModelManagerError Code=1001
This occurs when the Vision model component does not exist. Since SystemLanguageModel.default.isAvailable only checks the readiness of the text generation model, this can occur with Vision features even when this check passes.
In the current beta version, Vision features including image analysis do not work in the Simulator. This can be resolved by testing on a physical device.
Summary
Using the multimodal features of Foundation Models, I was able to implement on-device image analysis with a simple API. Since no server upload is required, I feel it can be utilized for privacy-conscious app development.
On the other hand, there are constraints requiring a device compatible with Apple Intelligence and a beta version of Xcode, which makes preparing the development environment time-consuming at this point. As the target device range will expand after the official release, I look forward to what the future holds.
I hope this serves as a useful reference for those who want to try the multimodal API.