I tried image analysis with the multimodal capabilities of Foundation Models

2026.06.13

This page has been translated by machine translation. View original

The Foundation Models framework has long been capable of text generation. However, the "input an image for analysis" use case commonly seen with cloud LLMs was not supported.
At WWDC26, multimodal capabilities were newly added to Foundation Models, making it possible to combine images with prompts. Curious about what kind of image analysis could be done, I decided to try it out.
This article introduces the steps for performing image analysis using the multimodal features of Foundation Models. I hope it serves as a reference for those who want to conduct similar experiments.
 Testing EnvironmentMacBook Pro (Apple M2 Pro)
macOS Tahoe 26.4.1
Xcode 27.0 Beta
iPhone 17 Pro Simulator (iOS 27.0 Beta)
iPhone 16e physical device (iOS 27.0 Beta)
!This article was written based on testing with beta software, and behavior and APIs may change at the time of official release.
 About Foundation Models Multimodal FeaturesThe Foundation Models framework is a framework that enables on-device inference on devices equipped with Apple Intelligence, which was introduced at WWDC25. Not long ago, Riroussa introduced a method using Foundation Models to replace diary content with emoji at "try! Swift Tokyo 2026."
https://dev.classmethod.jp/articles/please-save-genmoji/
In addition to text generation, it also supports multimodal prompts that include images. The main use cases for image analysis include the following:
Caption generation describing image content
Identification of objects appearing in images
Answering questions about images (Visual Q&A)
However, an Apple Intelligence-compatible device is required for operation. Please refer to the Apple official page for compatible devices.
 Implementation Steps Step 1: Project SetupCreate a new iOS project in Xcode and use the FoundationModels framework. No additional SPM dependencies are required, as it can be used as a system framework.
No special configuration is needed in Info.plist, but you must use a device with Apple Intelligence enabled.
First, add a simple screen that executes a process and displays the result as text when a button is tapped to run the sample code. The action1() section is intended to contain the process described below.
import SwiftUI
import FoundationModels

struct ContentView: View {
    @State private var text: String = ""

    var body: some View {
        VStack {
            Text(text)
            Button("Run", action: action1)
        }
    }

    func action1() {
        // Add Foundation Models processing here
    }
}
Also, don't forget to add the image to be analyzed to .xcassets. Here, I'll use a photo of the Wada family's beloved dog, "Maronii."
It's a photo taken during a walk in the park, with me wearing a green T-shirt and holding Maronii. I wonder what kind of answer will be generated when this photo is analyzed.
 Step 2: Creating a SessionGet the device's default model with SystemLanguageModel.default. Check availability with isAvailable before using it. All subsequent code will be added inside action1().
// Check if the device supports Apple Intelligence
guard SystemLanguageModel.default.isAvailable else {
    text = "Apple Intelligence is not available"
    return
}
let session = LanguageModelSession()
 Step 3: Building a Prompt with an ImageSend a request using the session created in Step 2. Pass text and an Attachment together using result builder syntax. Result builder syntax is a Swift notation that allows you to compose multiple elements into a single prompt simply by listing them inside a closure.
In the current beta version, only two types of initializers are implemented for Attachment: CGImage and file URL. When using UIImage, convert it using the .cgImage property before passing it.
Task {
    // Image to be analyzed
    let uiImage = UIImage(named: "SampleImage")

    // Convert UIImage → CGImage (UIImage cannot be passed directly to Attachment)
    guard let cgImage = uiImage?.cgImage else {
        text = "Failed to load image"
        return
    }

    do {
        let response = try await session.respond {
            "What is in this image? Please explain in Japanese."
            Attachment(cgImage)
        }
        text = response.content
        print(response.content)
    } catch {
        text = "Error: \(error.localizedDescription)"
        print("Error: \(error)\n\(String(reflecting: error))")
    }
}
The following analysis results were returned. To check for variability in output, the same prompt was run 4 times. The seconds in parentheses are the processing times measured as the difference between Date() before and after execution, and longer responses tended to take longer to process.


Response
Processing Time


This image shows a person holding a small dog. The dog has black and brown fur, a black nose, blue eyes, large ears, and a slender body. A lawn and trees are visible in the background.
3840.7 ms

This image shows a small dog in a person's arms.
2736.5 ms

This image shows a small Chihuahua dog in a person's arms.
2584.1 ms

This image shows a person holding a small dog. The dog has black and brown fur. Green trees are visible in the background.
3050.1 ms

Even with the same image and prompt, the wording changes each time, and there are instances where the breed "Chihuahua" is identified and instances where it is not. This is probabilistic behavior typical of LLMs, and it was confirmed that the same behavior occurs on-device.
Full source code for Steps 1–3struct ContentView: View {
    @State private var text: String = ""

    var body: some View {
        VStack {
            Text(text)
            Button("Run", action: action2)
        }
    }

    func action1() {
        // Check if the device supports Apple Intelligence
        guard SystemLanguageModel.default.isAvailable else {
            text = "Apple Intelligence is not available"
            return
        }
        let session = LanguageModelSession()

        Task {
            // Image to be analyzed
            let uiImage = UIImage(named: "SampleImage")

            // Convert UIImage → CGImage (UIImage cannot be passed directly to Attachment)
            guard let cgImage = uiImage?.cgImage else {
                text = "Failed to load image"
                return
            }

            do {
                let response = try await session.respond {
                    "What is in this image? Please explain in Japanese."
                    Attachment(cgImage)
                }
                text = response.content
                print(response.content)
            } catch {
                text = "Error: \(error.localizedDescription)"
                print("Error: \(error)\n\(String(reflecting: error))")
            }
        }
    }
}
 Step 4: Analysis Using Structured OutputBy attaching the @Generable macro to a Swift struct or enum, you can receive the model's output as an instance of that type. The framework converts type information into a JSON schema and passes it to the model.
The @Guide macro is used to convey the meaning of a property to the model in natural language, and is not required. If a property name is sufficiently clear, the model may be able to understand the intent. However, it is used when you want to improve output quality or control the range of values to be generated.
The description of @Guide is written in English, following the official documentation samples. This is because English is considered to convey intent more accurately as instructions to the model.
@Generable
struct ImageAnalysisResult {
    @Guide(description: "A description of the image content")
    var description: String
    @Guide(description: "A list of detected objects in the image")
    var detectedObjects: [String]
    @Guide(description: "The dominant colors visible in the image")
    var dominantColors: [String]
}
Note that @Generable type information consumes the context window. The more properties there are and the longer the @Guide descriptions, the more is consumed, so it is effective to omit unnecessary properties and keep property names concise.
ImageAnalysisResult should be defined at the top level of the file (outside of ContentView).
The results of analysis using this ImageAnalysisResult are as follows. Add action2() to ContentView and switch the button action from action1 to action2 to check.
func action2() {
    // Check if the device supports Apple Intelligence
    guard SystemLanguageModel.default.isAvailable else {
        text = "Apple Intelligence is not available"
        return
    }
    let session = LanguageModelSession()

    Task {
        // Image to be analyzed
        let uiImage = UIImage(named: "SampleImage")

        // Convert UIImage → CGImage (UIImage cannot be passed directly to Attachment)
        guard let cgImage = uiImage?.cgImage else {
            text = "Failed to load image"
            return
        }

        do {
            let response = try await session.respond(
                generating: ImageAnalysisResult.self
            ) {
                "Please analyze this image. Please explain in Japanese."
                Attachment(cgImage)
            }
            print(response.content.description)
            print(response.content.detectedObjects)
            print(response.content.dominantColors)
        } catch {
            text = "Error: \(error.localizedDescription)"
            print("Error: \(error)\n\(String(reflecting: error))")
        }
    }
}
The following analysis results were obtained.
A person holding a Chihuahua dog.
["dog", "human"]
["black", "brown", "green"]
 VerificationIn the current beta version, image analysis does not work in the simulator, so verify on a physical device (see Troubleshooting for details).
Confirm in advance that Apple Intelligence is enabled on the physical device.
Settings app → "Apple Intelligence & Siri" → Turn on "Apple Intelligence"
Confirm that the language and region are set to a supported language such as English (US)
Wait for the model download to complete
With the above preparations in place, a response will be returned a few seconds after tapping the button. Since processing is done on-device, no communication with external networks occurs.
 Notes Input Image SizeSince the framework automatically scales and converts colors before sending to the model, pre-conversion is not necessary. However, larger images consume more tokens, so caution is needed with large images from the perspective of response speed and the context window.
 Japanese PromptsWriting prompts in Japanese works, but the language of the response depends on the instructions in the prompt. If you want answers in Japanese, it is a good idea to explicitly state "Please answer in Japanese."
 Troubleshooting ModelManagerError 1001 OccursWhen running on the iOS simulator, the following error occurred.
Error Domain=FoundationModels.LanguageModelError Code=-1
  └─ ModelManagerServices.ModelManagerError Code=1001
This occurs when the model component for Vision does not exist. Since SystemLanguageModel.default.isAvailable only checks the readiness of the text generation model, this can occur with Vision features even after passing this check.
In the current beta version, Vision features including image analysis do not work in the simulator. This is resolved by testing on a physical device.
 SummaryUsing the multimodal features of Foundation Models, on-device image analysis could be implemented with a simple API. Since no server upload is required, I feel it can be utilized for privacy-conscious app development.
On the other hand, there are constraints in that an Apple Intelligence-compatible device and a beta version of Xcode are required, and setting up the development environment takes some effort at this point. As the range of target devices expands after the official release, I look forward to what the future holds.
I hope this serves as a reference for those who want to try out the multimodal API.
 ReferencesAnalyzing images with multimodal prompting | Apple Developer Documentation
Foundation Models | Apple Developer Documentation

I tried image analysis with the multimodal capabilities of Foundation Models

Testing Environment

About Foundation Models Multimodal Features

Implementation Steps

Step 1: Project Setup

Step 2: Creating a Session

Step 3: Building a Prompt with an Image

Step 4: Analysis Using Structured Output

Verification

Notes

Input Image Size

Japanese Prompts

Troubleshooting

ModelManagerError 1001 Occurs

Summary

References

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Response	Processing Time
This image shows a person holding a small dog. The dog has black and brown fur, a black nose, blue eyes, large ears, and a slender body. A lawn and trees are visible in the background.	3840.7 ms
This image shows a small dog in a person's arms.	2736.5 ms
This image shows a small Chihuahua dog in a person's arms.	2584.1 ms
This image shows a person holding a small dog. The dog has black and brown fur. Green trees are visible in the background.	3050.1 ms