I tried extracting exercise records from Ring Fit Adventure's results screen using the multimodal capabilities of Foundation Models

2026.06.18
This page has been translated by machine translation. View original
I'm developing an app called "NSEasyConnect" that uses Vision.framework OCR to analyze Ring Fit Adventure screenshots and quantify exercise records. However, misrecognitions such as confusing 1 and 7 occur frequently, and I've been dealing with this by separately implementing numerical correction logic.
In this context, I decided to try whether the Foundation Models multimodal feature could replace this processing more simply. The basic usage of the multimodal feature was introduced in the following article.
https://dev.classmethod.jp/articles/foundation-models-multimodal-image-analysis/
This article introduces the implementation steps for extracting specific fields from images as structured data using @Generable. Since I needed to revisit the design several times before getting the expected results, I'll also share that trial-and-error process. I hope this serves as a reference for those who want to try similar verification.
 Verification EnvironmentMacBook Pro (16-inch, 2023), Apple M2 Pro
macOS Tahoe 26.4.1
Xcode 27.0 Beta
iPhone 16e physical device (iOS 27.0 Beta)
!This article contains content verified using beta software, and behavior and APIs may change at the official release.
 Images to AnalyzeScreenshots can be saved to the camera roll using the Nintendo Switch's "Send to Smartphone" feature. I used the following 2 images for verification.
The first image is a result screen (rfa1) displaying two fields: total activity time and total calories burned.
The second image is a result screen (rfa2) that displays total activity time and total calories burned, as well as total running distance.
In Ring Fit Adventure, the fields displayed vary depending on the type of workout. Running distance is only shown on days when running-type events are completed.
 Implementation Steps Step 1: Project SetupCreate a new iOS project in Xcode. The basic setup is the same as in the multimodal basics article.
Add the two target screenshots to .xcassets. Here they were added with the names rfa1 and rfa2.
First, prepare a simple screen that executes processing and displays text when a button is tapped.
import SwiftUI
import FoundationModels

struct ContentView: View {
    @State private var text: String = ""

    var body: some View {
        ScrollView {
            VStack(spacing: 16) {
                Text(text)
                    .frame(maxWidth: .infinity, alignment: .leading)
                    .padding()
                Button("Run", action: action1)
            }
        }
    }

    func action1() {
        // Add processing here
    }
}
 Step 2: Define a Struct with @GenerableDefine the fields you want to extract as a @Generable struct. RingFitResult is defined at the top level of the file (outside ContentView).
I'll document the trial and error from the initial design attempt to the final design.
 Initial Design (Version That Didn't Work)Initially, I attempted the following design. The policy was to convert activity time displayed as "13 minutes 11 seconds" into seconds, and represent running distance as an optional Double?.
@Generable
struct RingFitResult {
    @Guide(description: "Total activity time in seconds. Convert from minutes and seconds shown on screen (e.g. 8分56秒 = 536, 25分10秒 = 1510)")
    var totalActivitySeconds: Int
    @Guide(description: "Total calories burned as a decimal number in kcal (e.g. 29.48)")
    var caloriesBurned: Double
    @Guide(description: "Total running distance in km as a decimal number. Set to nil if the running distance is not displayed on the screen")
    var runningDistanceKm: Double?
}
After analyzing image 1 (13 minutes 11 seconds, 29.48 kcal, no running distance) 3 times, the same result was returned each time.
Activity time (seconds): 781
Calories burned: 29.48
Running distance: Optional(-1.0)
The calories burned were obtained correctly, but two problems were found with the activity time and running distance.
Problem 1: Misreading of activity time (781 seconds, correct answer is 791 seconds)
781 = 13 × 60 + 1 (calculated as 13 minutes 1 second)
791 = 13 × 60 + 11 (correct answer)
The model misread "11 seconds" as "1 second" and then performed the conversion to seconds. Delegating both reading and calculation to the model simultaneously also creates the problem of making it difficult to determine which step introduced the error.
Problem 2: Failure to generate nil for running distance (Optional(-1.0) was returned)
Double? can represent nil at the type level, but the model seems to have a strong tendency to "return a number in numerical contexts," trying to substitute with -1.0 or 0.0.
Also, in the process of resolving problems 1 and 2, I needed to rewrite the description multiple times, and writing in English made it difficult to verify whether the intended meaning was conveyed correctly, resulting in high adjustment costs. Even when I came up with prohibitive phrases like NEVER return -1, it was hard to get a feel for how well they would be understood by the model. Therefore, in the final design, I adopted Japanese description and also verified whether it would affect accuracy.
 Final Design (Version That Worked)Based on the three points of problems 1 and 2 and the high adjustment cost of English description, the design was changed as follows.
@Generable
struct RingFitResult {
    @Guide(description: "活動時間の「分」の部分のみを整数で（例：'13分11秒'なら13）")
    var activityMinutes: Int
    @Guide(description: "活動時間の「秒」の部分のみを整数で、0〜59の範囲（例：'13分11秒'なら11）")
    var activitySeconds: Int
    @Guide(description: "合計消費カロリーをkcal単位の小数で（例：29.48）")
    var caloriesBurned: Double
    @Guide(description: "走行距離が数値で表示されていればtrue、'－'または表示なしならfalse")
    var runningDistanceAvailable: Bool
    @Guide(description: "走行距離をkm単位の小数で。runningDistanceAvailableがtrueのときのみ有効")
    var runningDistanceKm: Double
}
Here is a summary of the three key points of the design change.
Separate reading from calculation
totalActivitySeconds was removed and split into two properties: activityMinutes and activitySeconds. The model is only responsible for reading the numbers, while the conversion to seconds (minutes × 60 + seconds) is performed on the app side. By explicitly stating the range constraint 0 to 59 in the description, the risk of misreading 11 as 1 is also reduced.
Bool + Double separation is more stable than Optional<Double>
runningDistanceKm: Double? was removed and split into the pair runningDistanceAvailable: Bool and runningDistanceKm: Double. The model can make more stable judgments with the binary choice of Bool than by generating nil. On the app side, when runningDistanceAvailable is false, it is treated as nil.
Note that I also tried several versions with prompt adjustments to fix the activity time problem, but Optional(0.0) or Optional(-1.0) still came back for the running distance. Even adding prohibitive text like NEVER return -1 to the description was not stable, so Bool separation proved effective.
Japanese is fine for @Guide description
In the multimodal basics article, writing in English was recommended following the official documentation sample. This time, I confirmed that Japanese description with the same content as the English version achieved equivalent accuracy for both images across 3 attempts each. If you prioritize code readability, writing in Japanese is not a problem. However, since this verification was centered on simple numerical reading tasks, there remains a possibility that differences may appear in cases requiring more complex conditional branching or abstract judgment.
 Step 3: Analyze Image 1Add processing to action1() to analyze image 1.
func action1() {
    guard SystemLanguageModel.default.isAvailable else {
        text = "Apple Intelligence is not available"
        return
    }
    let session = LanguageModelSession()

    Task {
        let uiImage = UIImage(named: "rfa1")

        // Convert UIImage → CGImage (UIImage cannot be passed directly to Attachment)
        guard let cgImage = uiImage?.cgImage else {
            text = "Failed to load image"
            return
        }

        do {
            let response = try await session.respond(
                generating: RingFitResult.self
            ) {
                "リングフィットアドベンチャーのリザルト画面です。各フィールドの値を取り出してください。"
                Attachment(cgImage)
            }
            let result = response.content
            let totalSeconds = result.activityMinutes * 60 + result.activitySeconds
            let distance: Double? = result.runningDistanceAvailable ? result.runningDistanceKm : nil
            print("Activity time (seconds): \(totalSeconds)")
            print("Calories burned: \(result.caloriesBurned)")
            print("Running distance: \(distance.map { "\($0) km" } ?? "None")")
            text = """
            Activity time (seconds): \(totalSeconds)
            Calories burned: \(result.caloriesBurned) kcal
            Running distance: \(distance.map { "\($0) km" } ?? "None")
            """
        } catch {
            text = "Error: \(error.localizedDescription)"
            print("Error: \(error)\n\(String(reflecting: error))")
        }
    }
}
The analysis results for image 1 are as follows. The same values were returned all 3 times.
Activity time (seconds): 791
Calories burned: 29.48 kcal
Running distance: None
activityMinutes: 13 and activitySeconds: 11 were read, and the app converted them to 13 × 60 + 11 = 791 seconds. It was also confirmed that runningDistanceAvailable: false correctly treated the running distance as "None."
 Step 4: Analyze Image 2Add action2() to ContentView and verify the behavior with image 2, which displays the running distance field. The structure is almost identical to action1(), but the duplicate code will be refactored together in Step 5.
func action2() {
    guard SystemLanguageModel.default.isAvailable else {
        text = "Apple Intelligence is not available"
        return
    }
    let session = LanguageModelSession()

    Task {
        let uiImage = UIImage(named: "rfa2")

        guard let cgImage = uiImage?.cgImage else {
            text = "Failed to load image"
            return
        }

        do {
            let response = try await session.respond(
                generating: RingFitResult.self
            ) {
                "リングフィットアドベンチャーのリザルト画面です。各フィールドの値を取り出してください。"
                Attachment(cgImage)
            }
            let result = response.content
            let totalSeconds = result.activityMinutes * 60 + result.activitySeconds
            let distance: Double? = result.runningDistanceAvailable ? result.runningDistanceKm : nil
            print("Activity time (seconds): \(totalSeconds)")
            print("Calories burned: \(result.caloriesBurned)")
            print("Running distance: \(distance.map { "\($0) km" } ?? "None")")
            text = """
            Activity time (seconds): \(totalSeconds)
            Calories burned: \(result.caloriesBurned) kcal
            Running distance: \(distance.map { "\($0) km" } ?? "None")
            """
        } catch {
            text = "Error: \(error.localizedDescription)"
            print("Error: \(error)\n\(String(reflecting: error))")
        }
    }
}
The analysis results for image 2 are as follows. The same values were returned all 3 times.
Activity time (seconds): 1586
Calories burned: 104.68 kcal
Running distance: 1.02 km
activityMinutes: 26 and activitySeconds: 26 were read, and the app converted them to 26 × 60 + 26 = 1586 seconds. runningDistanceAvailable: true was set, and the running distance of 1.02 km was also accurately obtained.
 Step 5: Availability Check and Fallback ProcessingThe Foundation Models multimodal feature requires an Apple Intelligence-compatible device and iOS 27 or later. In an actual app, it is necessary to implement a fallback to Vision.framework OCR processing for cases where it cannot be used depending on the device or settings.
Judgment is performed in 3 layers.
func analyzeImage(named imageName: String) async -> RingFitResult? {
    // ① Attachment API requires iOS 27 or later. Devices on earlier versions fall back to Vision
    guard #available(iOS 27, *) else {
        return await fallbackToVisionOCR(named: imageName)
    }

    // ② Fall back if Apple Intelligence is disabled or model is not downloaded
    //    * isAvailable does not confirm the readiness state of Vision sub-models,
    //      so runtime errors are caught in the catch block
    guard SystemLanguageModel.default.isAvailable else {
        return await fallbackToVisionOCR(named: imageName)
    }

    guard let cgImage = UIImage(named: imageName)?.cgImage else {
        return nil
    }

    // ③ Attempt analysis with Foundation Models
    do {
        let session = LanguageModelSession()
        let response = try await session.respond(
            generating: RingFitResult.self
        ) {
            "リングフィットアドベンチャーのリザルト画面です。各フィールドの値を取り出してください。"
            Attachment(cgImage)
        }
        return response.content
    } catch {
        // Runtime error such as Vision sub-model not loaded → fall back to Vision
        print("Foundation Models failed, falling back to Vision: \(error)")
        return await fallbackToVisionOCR(named: imageName)
    }
}

// Existing OCR processing using Vision.framework
func fallbackToVisionOCR(named imageName: String) async -> RingFitResult? {
    // Existing implementation (NSEasyConnect OCR processing)
    return nil
}
The role of each layer is summarized below.


Layer
Judgment Content
Target Cases


#available(iOS 27, *)
Confirmation of Attachment API existence
Devices on iOS 26 or earlier

isAvailable
Readiness state of text generation model
Apple Intelligence disabled or not downloaded

catch
Runtime error capture
Vision sub-model not loaded, etc.

Refactoring action1() and action2() to call this function allows the duplicate code to be consolidated.
func action1() {
    Task {
        if let result = await analyzeImage(named: "rfa1") {
            let totalSeconds = result.activityMinutes * 60 + result.activitySeconds
            let distance: Double? = result.runningDistanceAvailable ? result.runningDistanceKm : nil
            text = """
            Activity time (seconds): \(totalSeconds)
            Calories burned: \(result.caloriesBurned) kcal
            Running distance: \(distance.map { "\($0) km" } ?? "None")
            """
        }
    }
}
Full source code for Steps 1–5import SwiftUI
import FoundationModels

@Generable
struct RingFitResult {
    @Guide(description: "活動時間の「分」の部分のみを整数で（例：'13分11秒'なら13）")
    var activityMinutes: Int
    @Guide(description: "活動時間の「秒」の部分のみを整数で、0〜59の範囲（例：'13分11秒'なら11）")
    var activitySeconds: Int
    @Guide(description: "合計消費カロリーをkcal単位の小数で（例：29.48）")
    var caloriesBurned: Double
    @Guide(description: "走行距離が数値で表示されていればtrue、'－'または表示なしならfalse")
    var runningDistanceAvailable: Bool
    @Guide(description: "走行距離をkm単位の小数で。runningDistanceAvailableがtrueのときのみ有効")
    var runningDistanceKm: Double
}

struct ContentView: View {
    @State private var text: String = ""

    var body: some View {
        ScrollView {
            VStack(spacing: 16) {
                Text(text)
                    .frame(maxWidth: .infinity, alignment: .leading)
                    .padding()
                Button("Analyze Image 1", action: action1)
                Button("Analyze Image 2", action: action2)
            }
        }
    }

    func action1() {
        Task {
            if let result = await analyzeImage(named: "rfa1") {
                let totalSeconds = result.activityMinutes * 60 + result.activitySeconds
                let distance: Double? = result.runningDistanceAvailable ? result.runningDistanceKm : nil
                text = """
                Activity time (seconds): \(totalSeconds)
                Calories burned: \(result.caloriesBurned) kcal
                Running distance: \(distance.map { "\($0) km" } ?? "None")
                """
            }
        }
    }

    func action2() {
        Task {
            if let result = await analyzeImage(named: "rfa2") {
                let totalSeconds = result.activityMinutes * 60 + result.activitySeconds
                let distance: Double? = result.runningDistanceAvailable ? result.runningDistanceKm : nil
                text = """
                Activity time (seconds): \(totalSeconds)
                Calories burned: \(result.caloriesBurned) kcal
                Running distance: \(distance.map { "\($0) km" } ?? "None")
                """
            }
        }
    }

    func analyzeImage(named imageName: String) async -> RingFitResult? {
        guard #available(iOS 27, *) else {
            return await fallbackToVisionOCR(named: imageName)
        }
        guard SystemLanguageModel.default.isAvailable else {
            return await fallbackToVisionOCR(named: imageName)
        }
        guard let cgImage = UIImage(named: imageName)?.cgImage else {
            return nil
        }
        do {
            let session = LanguageModelSession()
            let response = try await session.respond(
                generating: RingFitResult.self
            ) {
                "リングフィットアドベンチャーのリザルト画面です。各フィールドの値を取り出してください。"
                Attachment(cgImage)
            }
            return response.content
        } catch {
            print("Foundation Models failed, falling back to Vision: \(error)")
            return await fallbackToVisionOCR(named: imageName)
        }
    }

    func fallbackToVisionOCR(named imageName: String) async -> RingFitResult? {
        // Existing implementation (NSEasyConnect OCR processing)
        return nil
    }
}
 Comparison with Vision.frameworkNSEasyConnect had implemented similar processing using Vision.framework text recognition. Here is a summary of the differences between the two approaches.



Vision.framework (OCR)
Foundation Models (Multimodal)


Implementation cost
Text recognition → value conversion → correction logic required
Only @Generable struct definition needed

Time conversion
Custom implementation required to convert "13 min 11 sec" → seconds
Read minutes and seconds as separate properties, calculation performed on app side

Handling misrecognition
Correction heuristics needed for confusions like 1 and 7
Context understanding reduces misrecognition

Optional handling
Running distance field presence must be determined manually
Stable generation achieved by separating into Bool + Double

Operating environment
All devices, offline
Requires Apple Intelligence-compatible device

Processing speed
Fast
Takes several seconds

Vision.framework operates quickly on all devices, but requires custom implementation for converting and correcting recognized text to numerical values. While Foundation Models multimodal feature is limited to Apple Intelligence-compatible devices, the appeal is being able to extract structured data simply by defining a struct.
 SummaryBy combining Foundation Models multimodal feature with @Generable, I was able to extract exercise records as structured data from Ring Fit Adventure result screens.
I needed to revisit the @Generable design once before getting the expected results. The insights gained are summarized below.
Separating reading from calculation is more stable. When I tried to have the model convert activity time to seconds, misreading and calculation errors compounded. It is better to have the model only read minutes and seconds, with calculations performed on the app side.
A Bool + Double pair is more stable than Optional<Double>. Having the model generate nil causes it to try substituting with -1.0 or 0.0. Separating the existence check into a Bool yielded stable results.
@Guide description written in Japanese achieved equivalent accuracy. Japanese can be used if code readability is a priority.
Implementation cost was dramatically reduced compared to Vision.framework OCR. Misrecognition correction heuristics are also no longer needed.
It was found that the 3B model on iPhone 16e is sufficient for this level of image analysis. For use cases where support for non-Apple Intelligence devices is unnecessary, the Foundation Models multimodal feature feels like a strong candidate. I hope this serves as a reference for those who want to try similar verification.
 ReferencesAnalyzing images with multimodal prompting | Apple Developer Documentation
Foundation Models | Apple Developer Documentation
I tried extracting exercise records from Ring Fit Adventure's results screen using the multimodal capabilities of Foundation Models

Verification Environment

Images to Analyze

Implementation Steps

Step 1: Project Setup

Step 2: Define a Struct with @Generable

Initial Design (Version That Didn't Work)

Final Design (Version That Worked)

Step 3: Analyze Image 1

Step 4: Analyze Image 2

Step 5: Availability Check and Fallback Processing

Comparison with Vision.framework

Summary

References

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Layer	Judgment Content	Target Cases
`#available(iOS 27, *)`	Confirmation of `Attachment` API existence	Devices on iOS 26 or earlier
`isAvailable`	Readiness state of text generation model	Apple Intelligence disabled or not downloaded
`catch`	Runtime error capture	Vision sub-model not loaded, etc.
	Vision.framework (OCR)	Foundation Models (Multimodal)
Implementation cost	Text recognition → value conversion → correction logic required	Only `@Generable` struct definition needed
Time conversion	Custom implementation required to convert "13 min 11 sec" → seconds	Read minutes and seconds as separate properties, calculation performed on app side
Handling misrecognition	Correction heuristics needed for confusions like 1 and 7	Context understanding reduces misrecognition
Optional handling	Running distance field presence must be determined manually	Stable generation achieved by separating into `Bool + Double`
Operating environment	All devices, offline	Requires Apple Intelligence-compatible device
Processing speed	Fast	Takes several seconds