
A GitHub Action has been added to automatically evaluate LLM accuracy in CI with Langfuse
This page has been translated by machine translation. View original
This is Suenaga from the Retail App Co-Creation Division.
A GitHub Action called langfuse/experiment-action has been released from Langfuse, so I gave it a try.
This time, I used the TypeScript SDK and Vercel AI SDK to call Claude, and set up a CI that evaluates a small multimodal agent that guesses country names from images. I also set up a configuration to manage prompts with Git and reflect changes to Langfuse's prompts when merging to main.
The verification code used this time can be found here.
What experiment-action can do
The action handles all of the following for you:
This context is created by the GitHub Action and handles the CI-specific setup for you:
- initializes the Langfuse SDK client from the action inputs
- loads the dataset items from
dataset_nameand appliesdataset_version- adds default metadata under
langfuse.*, such as commit SHA, branch, job URL, and actor.
So all you need to write on the user side is:
- The function to evaluate (task)
- The scoring function (evaluator)
- The exception to throw when below the threshold (
RegressionError)
Just these three things. The task is simply a function that receives each dataset item, calls the LLM, and returns the output.

Trying it out in practice
The subject is a task of "passing three illustrations that evoke Japan, the United States, and France, and having the model answer the country name in one word."
The Dataset is created in advance from a TypeScript script. The relative path to the image is put in the input, and the image files themselves are placed in the repository.
// scripts/seed-dataset.ts
const items = [
{ input: { question: QUESTION, imagePath: "assets/japan.png" }, expectedOutput: "日本" },
{ input: { question: QUESTION, imagePath: "assets/usa.png" }, expectedOutput: "アメリカ" },
{ input: { question: QUESTION, imagePath: "assets/france.png" }, expectedOutput: "フランス" },
];
await langfuse.api.datasets.create({ name: "country-from-image-dataset" });
for (const item of items) {
await langfuse.api.datasetItems.create({
datasetName: "country-from-image-dataset",
...item,
});
}
The actual test images look like this.

And this is how they appear in the Langfuse UI.

Here is the function to evaluate (task). It reads the image from a file, sends it to the LLM, and returns the resulting text as output.
// experiments/support-agent-gate.ts
async function classifyCountry(input: { question: string; imagePath: string }) {
const [imageBytes, systemPrompt] = await Promise.all([
readFile(resolve(REPO_ROOT, input.imagePath)),
readFile(PROMPT_FILE, "utf-8"),
]);
const { text } = await generateText({
model: anthropic("claude-haiku-4-5-20251001"),
system: systemPrompt.trim(),
messages: [
{
role: "user",
content: [
{ type: "text", text: input.question },
{ type: "file", data: imageBytes, mediaType: "image/png" },
],
},
],
experimental_telemetry: { isEnabled: true },
});
return text.trim();
}
For scoring, I prepared two metrics: exact_match (exact match) and contains_expected (whether the expected value is included), and used the average of contains_expected (avg_accuracy) as the gate. If avg_accuracy falls below the threshold, a RegressionError is thrown to fail the CI. Since there are only 3 questions this time, the threshold is set to 1.0, making it a strict setting where "getting even one question wrong means red."
export async function experiment(context: RunnerContext) {
const result = await context.runExperiment({
name: "PR gate: country-from-image",
task: classifyTask,
evaluators: [exactMatch, containsExpected],
runEvaluators: [avgAccuracy],
});
const accuracy = result.runEvaluations.find(
(e) => e.name === "avg_accuracy",
)?.value;
if (typeof accuracy !== "number" || accuracy < THRESHOLD) {
throw new RegressionError({
result,
metric: "avg_accuracy",
value: typeof accuracy === "number" ? accuracy : 0,
threshold: THRESHOLD,
});
}
return result;
}
On the workflow side, all you need to do is call the action. The action tag is pinned by SHA to match the official README.
# .github/workflows/langfuse-experiment.yml (excerpt)
- uses: langfuse/experiment-action@887e7936bdf64a2197aa7dcfdc8a9e4afd85e229 # v1.0.3
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
with:
langfuse_public_key: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
langfuse_secret_key: ${{ secrets.LANGFUSE_SECRET_KEY }}
langfuse_base_url: https://jp.cloud.langfuse.com
experiment_path: experiments/support-agent-gate.ts
dataset_name: country-from-image-dataset
github_token: ${{ github.token }}
When a PR is created, the workflow runs and posts the score along with a link to the Langfuse experiment comparison view on the PR like this.

For example, for the image of the United States this time, the model returned "アメリカ合衆国" (United States of America). exact_match becomes 0, but since contains_expected is 1, avg_accuracy remains 1.0 and passes the gate. LLM output tends to have variations in expression, so using only exact match as a metric can cause CI to fail unintentionally. I configured it so that exact match is kept for observation purposes, while partial match is used for the gate.
Managing prompts with Git and auto-reflecting them on main merge
This is something many people may already be doing, but since I had the opportunity, I also set up prompts/country-classifier.md to be managed with Git, and configured it to automatically reflect to Langfuse's latest label when merged to main.
The mechanism is simple: it just calls a script from the workflow that calls langfuse.prompt.create({ name, prompt: body, labels: ["latest"] }) on push to main.
# .github/workflows/langfuse-prompt-promote.yml (excerpt)
on:
push:
branches: [main]
paths:
- "prompts/**"
jobs:
promote:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v6
with: { node-version: "24", cache: "pnpm" }
- run: pnpm install --frozen-lockfile
- env:
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
LANGFUSE_BASE_URL: https://jp.cloud.langfuse.com
run: pnpm tsx scripts/promote-prompt.ts
Since the task on the experiment-action side also reads from the same prompts/country-classifier.md, experiments on the PR run with the "post-change prompt" as-is, and after merging, the latest version in Langfuse switches over.

Closing thoughts
Being able to review accuracy in CI while checking scores is something many people have been waiting for, so this is a welcome update. With the addition of the Japan region as well, there have been exciting updates recently and it's been fun to keep up with them.
See you 👋