A German Online Bank Improved Payment Trouble SLA from 48% to 85% by "Not Embedding" AI into the UI - AWS Summit Hamburg 2026 Report

A German Online Bank Improved Payment Trouble SLA from 48% to 85% by "Not Embedding" AI into the UI - AWS Summit Hamburg 2026 Report

2026.05.22

This page has been translated by machine translation. View original

This is Konishi from the Berlin office.

This is the second session report from AWS Summit Hamburg 2026. This time I'll summarize the session "Zero Human Touch: How N26 Automated Payment Disputes with AI" by N26, a neobank based in Berlin.

1779273174000_R0003138

Carolina Tavares (Lead Product Manager, Cards) and Alexander (Lead Software Engineer) from N26 took the stage and spoke about AI automation of chargeback processing from both a product and technical perspective.

The talk, including the organizational hurdles involved in deploying AI in production in what must be a heavily regulated financial industry, was very stimulating.

What is N26

N26 is a neobank (mobile-only bank) founded in Berlin, Germany. Operating primarily in Europe, it had approximately 4.8 million active customers as of end of 2024, with an annual transaction volume of around 140 billion euros.

When I first moved to Germany, this was essentially the only option for a "German online banking app you could use in English." The UI is polished and I still rely on it heavily today.

IMG_1972

The Problem: Chargeback Processing Was Overwhelmed

What is a Chargeback

A chargeback is the processing of a dispute against a card payment. There are broadly two types.

  • Unauthorized: Fraudulent use of a card. What you'd call fraud. The criteria for judgment are relatively clear
  • Authorized: Cases where the customer themselves approved the payment, but something went wrong. For example, "I ordered a chair online but it never arrived"

On a personal note, I once withdrew cash from a street ATM in front of a local supermarket, and that same night nearly 100,000 yen was fraudulently withdrawn from my account from some foreign country. I immediately contacted N26's chat support, and they handled the refund and other responses within one business day — I was deeply impressed (I don't think it was AI-automated back then).

This session is about automating authorized chargebacks, that is, cases where the customer made the payment themselves but something went wrong. Apparently this type is more complex in terms of verifying evidence and making judgments, making it harder to automate.

The Situation Before Implementation

image

As of December 2024, N26's operations team was buried in a backlog.

When a user filed a dispute through the app, the case would pile up in the backlog and sit there until an analyst manually reviewed it. The analyst would first check the submitted documents, translate any non-English documents, cross-reference them against Mastercard's criteria, and if documents were missing, ask the customer to resubmit — going back and forth multiple times.

As a result, customers were left waiting days without knowing whether they'd get their money back. Since having a poor customer experience at the moment when people most need to rely on their bank is a serious problem, an improvement project was launched.

The goal was to automate 70% of key processes end-to-end. The aim was to ensure scalability so that the operations team wouldn't need to grow proportionally as the user base increased.

Starting with Domain Understanding

Alex's team was originally responsible for the chargeback UI submission flow and some automated rules (for fraud cases). However, covering complex judgments like authorized chargebacks with rule-based approaches has its limits.

So the first thing they tackled was a domain understanding session with all stakeholders. They gathered Backend, AI/Data Science, and the operations team analysts who manually process chargebacks every day, and carried out flow mapping and use case identification.

1779273568000_R0003141

From this, the system was organized into four subdomains.

  1. Customer Dispute Journey: The flow by which customers apply for a chargeback and upload evidence through the app
  2. Workflow Automation: Orchestration that transitions chargeback submissions through various states (received, under review, completed, etc.)
  3. AI Decisioning: Evaluates submitted evidence against Mastercard's criteria and returns a structured judgment result
  4. Back Office: Case history management, audit trails, operational visibility

This was organized using Domain-Driven Design (DDD) context maps. The relationships between each context (Customer-Supplier, Published Language, Anti-Corruption Layer, etc.) were also defined, and I got the impression they spent a substantial amount of time on the design.

How AI Was Integrated — Why It Wasn't Built Directly into the UI

When thinking about automating chargeback processing with AI, the first idea that comes to mind is having customers interact with AI directly in the app to resolve issues. Alex mentioned that "there was a temptation to try plugging AI directly into the UI submission flow."

However, N26 is a regulated financial institution. Feeding customers' multilingual free text directly into an LLM for judgment carries high risk, and auditing the output is difficult. What N26 chose instead was an approach of drawing a clear boundary between the customer and AI, and integrating AI as a decision engine within a structured workflow.

What the AI receives is the following three things:

  • Transaction information for the disputed payment
  • Documents uploaded by the customer
  • Responses to the chargeback questionnaire

In addition to these, the AI is given Mastercard's chargeback criteria and a list of evidence required to satisfy each criterion. The AI cross-references these and returns one of four decisions.

  1. Accept: All evidence meets Mastercard criteria → Proceed to refund processing
  2. Reject: Meets rejection criteria → Notify customer with details of the reason
  3. Feedback Required: Additional documents needed → Ask customer to resubmit with specific documents specified
  4. Human Specialist: Case is too complex → Escalate to a human analyst

Rather than having the LLM generate free text, the design has it output structured decision results. This is in line with the financial regulatory environment where auditability is required.

Ensuring Safety

Due to the regulated industry, safety is ensured through three layers.

AI Guardrails: Validation rules are applied before decision requests reach the AI. Customer free text input and inappropriate documents are filtered here.

Controlled Rollout: Rather than rolling out to all customers at once, deployment is gradual, starting from specific subsets and specific chargeback types. The aim is to limit the blast radius (scope of impact) if problems arise.

Traceability: For all chargeback submissions, model output, submitted evidence, AI reasoning process, and model version are recorded. This serves not only for audit compliance, but also as a feedback loop for model improvement.

Architecture

What's technically interesting is the loosely coupled design between the backend and the AI decision engine.

When a customer submits a chargeback through the N26 app, the Kotlin backend receives it. Workflow Automation manages state transitions, and when it determines that AI processing is needed, it puts a request into an SQS queue.

A Lambda function picks this up and executes the decision on a model on Amazon Bedrock (※ the AWS portion at the end of the session mentioned Anthropic's Opus model, but as far as I could hear, N26's side didn't explicitly name the model).

The structured result is returned to an SQS queue, and the Kotlin backend picks it up and continues business processing. Customer data is referenced from an S3 bucket.

By using asynchronous communication, the design achieves loose coupling, enables retry handling, and allows the backend to retain control of the business process while leveraging AI's decision-making capabilities. Even if the AI service goes down, the backend is relatively unaffected.

1779274119000_R0003143

Gradual Rollout

N26 divided the model deployment into four phases.

Phase 1: Feasibility Validation

First validated whether LLMs could be used for evidence analysis and chargeback decisions. The conclusion was "yes, they can. However, throwing unstructured data at them directly won't yield useful results. Domain knowledge of chargebacks is also necessary."

Phase 2: Shadow Mode

The AI decision engine runs in the production environment, but results are not applied to actual cases. This is the phase where the model is evaluated and improved alongside human analysts reviewing the output.

Phase 3: Recommender Mode

The stage where AI decisions and reasoning are presented to analysts. Analysts use this as a reference while making final decisions themselves. Even this alone improved analyst work efficiency.

Phase 4: Live Decisioning

AI decisions are actually applied. However, rather than going all-in at once, the number of cases is gradually increased while monitoring alongside analysts.

The previous article about Deutsche Bahn also discussed the autonomous level model, and N26 is likewise taking an approach of building trust incrementally.

Lessons Learned

Here are the three lessons Alex shared:

Lesson 1: Fix the inputs first. Before implementation, the UI was just a single free text field and file upload, leading to many incomplete submissions and multiple back-and-forth exchanges with analysts. This was changed to dedicated UI flows for each chargeback type, with a format that clearly requests the necessary evidence. If you want AI to make good decisions, improve the quality of inputs before tuning the model.

1779274560000_R0003145

Lesson 2: Define the AI contract from day one. In response to the temptation mentioned earlier of "let's just try connecting AI directly to the UI for now," N26 clearly defined the request/response structure (AI contract) from the start. The interface was decided upfront: decision requests would include Mastercard criteria and evidence, and responses would include one of four decision results. This allowed the backend team to focus on structuring and submitting data, while the AI/Data Science team focused on model improvement, enabling both to proceed in parallel.

Lesson 3: The hardest part wasn't the technology — it was the organization. Making Backend, AI/Data Science, Platform, and Operations function as a single product team was the greatest challenge. The message "One product team. AI became an engineering capability — not a separate technology" was particularly striking.

1779274748000_R0003148

Results

1779274942000_R0003150

Measurable results have been achieved.

  • SLA compliance rate: 48% → 65% (upon Recommender introduction) → over 85% (Full automation)
  • Customer satisfaction: 38% → over 51%
  • End-to-end automation rate: Over 60% of chargebacks processed without human intervention
  • Backlog completely cleared

On customer satisfaction — it was emphasized that the approval rate for chargebacks didn't increase; rather, with the same judgment criteria, satisfaction improved because the process became faster and more transparent.

Closing Thoughts

While the previous Deutsche Bahn case was about AI agent-based infrastructure operations, this one is about integrating AI into the backend of a customer-facing product. The directions are different, but there were many common themes: gradual rollout, collaboration with humans, and an emphasis on traceability.

I'm also writing reports on other sessions from AWS Summit Hamburg, so feel free to check them out.

https://dev.classmethod.jp/tags/aws-summit-hamburg/

Reference Documents


生成AI活用はクラスメソッドにお任せ

過去に支援してきた生成AIの支援実績100+を元にホワイトペーパーを作成しました。御社が抱えている課題のうち、どれが解決できて、どのようなサービスが受けられるのか?4つのフェーズに分けてまとめています。どうぞお気軽にご覧ください。

生成AI資料イメージ

無料でダウンロードする

Share this article

AWSのお困り事はクラスメソッドへ