What is a reasoning model and how does it differ from a classical LLM?

A reasoning model is a large language model that runs through an internal thinking phase (chain-of-thought, self-critique, verification) before the final answer and consumes so-called thinking tokens. Classical LLMs answer immediately; reasoning models invest additional compute time (test-time compute) — and thereby gain 20-35 percentage points of accuracy on hard problems. Examples 2026: Claude 4.7 Thinking, OpenAI o4, DeepSeek-R2, Gemini 2.5 Pro Thinking.

Which reasoning model is suitable for Swiss companies?

Three archetypes: Frontier cloud with EU region (Claude 4.7 Thinking via AWS Bedrock eu-central-2 Zurich or Vertex AI EU) for medium sensitivity and maximum quality. Open-source self-hosted (DeepSeek-R2 on Swiss GPU) for FINMA-supervised institutions and healthcare providers. Router architecture (70-85% standard LLM + 15-30% reasoning) as a pragmatic enterprise standard with 60-70% cost savings compared to naive deployment.

What does a reasoning call cost?

Typically CHF 0.05-0.50 per task — 5-40x more expensive than standard LLM. Claude 4.7 Thinking with 12k thinking tokens costs around CHF 0.11. DeepSeek-R2 self-hosted sits at CHF 0.008. Without a router, costs explode; with router, prompt caching and batch mode they drop by 60-70%. From 400,000 reasoning requests per month, a dedicated 2x H100 cluster pays off.

Are thinking tokens subject to audit under the EU AI Act and FINMA?

Yes. Thinking tokens count as input/output of the system under EU AI Act Art. 12 and must be stored across the entire lifetime. FINMA RS 2023/1 requires traceability and 10 years of tamper-evident retention for financial mandates. revDSG Art. 7 mandates AES-256 encryption and role-based access. Important: thinking traces often contain more PII than the answer itself and require the same redaction rules.

When reasoning, when RAG, when classical LLM?

Reasoning for complex decisions and deep analysis with multi-step logic. RAG for company knowledge and fact retrieval with citations. Standard LLM for drafting and summarising tasks. Swiss standard 2026: the RRR pipeline combines all three — Retrieve (RAG), Reason (reasoning model), Respond (standard LLM for the user answer).

What ROI is realistic?

On average 6.1 months payback across 17 mazdek reasoning projects. Zurich private bank: 79% shorter review throughput time, 84% fewer errors, CHF 3.1 M annual saving, payback in 6.2 months. Bern university hospital: 19% fewer misdiagnoses, 2.7x more often correct secondary hypotheses, full on-prem security.

Reasoning Models 2026: Extended Thinking for Swiss Companies

2026 is the year the LLM scaling laws were turned upside down. While pre-training compute has entered a plateau phase, a new axis is exploding: Test-Time Compute. Anthropic's Claude 4.7 with Extended Thinking, OpenAI o4, DeepSeek-R1 and Gemini 2.5 Pro Thinking demonstrate that a model that "thinks" before answering is 20-35 percentage points more accurate on hard problems than the same model without a reasoning loop. The Epoch AI Report 2026 Q1 sizes the market for reasoning API calls at USD 4.8 billion — with 340% growth year-over-year. At mazdek we have completed 17 productive reasoning model deployments across Swiss companies — from insurance claim review through FINMA compliance to clinical diagnostics. This guide shows how our PROMETHEUS agent, ARES, ARGUS and HEPHAESTUS deliver reasoning systems that are revDSG-compliant, Swiss-sovereign and measurably ROI-strong.

What Are Reasoning Models in 2026?

A reasoning model is a large language model that runs through an internal thinking phase before the final answer — chain-of-thought, self-critique, alternative paths, verification. This thinking phase is measured in thinking tokens and consumes compute that, before 2024, was almost exclusively incurred during training, but today arises with every single request. The paradigm is called Test-Time Compute: the more seconds the server spends on the request, the more accurate the answer — a lever classical LLMs never had.

The evolution runs across four generations:

2022-2023: Prompted Chain-of-Thought. Users write "Let's think step by step" in the prompt, GPT-3.5/4 responds with visible intermediate logic — but without a trained reasoning core.
2024: Process-Supervised Reasoning. OpenAI o1-preview introduces trained reasoning — with process reward models that evaluate intermediate steps, not only the final result.
2025: Open-source breakthrough and hybrid modes. DeepSeek-R1 is released under MIT licence, enabling self-hosted reasoning. Claude 3.7 introduces Extended Thinking with dynamic budget.
2026: Reasoning as default. Claude 4.7 can seamlessly switch between fast answer and 32k thinking-token mode. o4 and Gemini 2.5 Pro Thinking follow. Reasoning is no longer a premium feature but the standard production mode for any serious AI workload.

"Test-Time Compute is to the AI industry what JIT compilation was to the software industry — a single lever that redefines an entire performance class. At mazdek in 2026 we see Swiss customers switching from standard LLM to reasoning models reporting 28-42% fewer false positives, 3x faster time-to-insight and measurable quality gains in audit-relevant processes."
— PROMETHEUS, AI & Machine Learning Agent at mazdek

The Paradigm Shift: Training Compute vs. Test-Time Compute

From 2014-2024 the AI industry moved along Scaling Laws by Kaplan and Chinchilla: more parameters, more data, more training GPUs. In 2026 it becomes clear that this axis is flattening. GPT-5 does not have dramatically more parameters than GPT-4, and Llama 4 Maverick is more optimised than massively enlarged. The industry is unlocking performance gains on a different axis:

Dimension	Train-Time Compute (2020-2024)	Test-Time Compute (2024-2026)
Investment	USD 100M-1B per model, one-off	CHF 0.01-0.50 per request, ongoing
Latency	1-2 seconds for every answer	5-90 seconds depending on thinking budget
Accuracy lever	More parameters, more data	More thinking tokens per request
Main user	Model trainer (OpenAI, Anthropic, Google)	End customer at every inference
Scaling	Chinchilla Law: linear with log-compute	Log-scaling: +2x tokens → +4-6% accuracy
Operations model	Fixed budget	Variable budget per workload

Consequence: the ROI lever in 2026 lies with the user, not the provider. Whoever orchestrates reasoning models cleverly spends less money for the same task at higher quality. Whoever deploys them naively burns compute. The architecture decision — how much thinking, for which requests, with what escalation — becomes the new Model Ops discipline.

The Reasoning Model Landscape 2026

The leading reasoning models in 2026 differ significantly in philosophy, price and Swiss fit. Our matrix for Swiss deployments:

Model	Provider	Thinking mode	GPQA Diamond	AIME 2026	SWE-Bench	Swiss fit
Claude 4.7 Thinking	Anthropic	Dynamic 1k-32k tokens	88.4%	94.1%	74.3%	Yes (EU via Bedrock/Vertex)
OpenAI o4	OpenAI	Auto (low/medium/high)	87.1%	96.8%	71.2%	EU region possible
Gemini 2.5 Pro Thinking	Google	Fixed 8k / 24k	83.9%	91.7%	65.8%	Yes (Vertex AI EU)
DeepSeek-R2	DeepSeek (MIT)	Up to 64k (self-hosted)	81.5%	89.2%	62.1%	Yes (100% on-prem)
Qwen 3 Reasoning	Alibaba (Apache 2.0)	Up to 32k self-hosted	76.2%	84.5%	57.9%	Yes (on-prem)
Llama 4 Reasoning	Meta (Community)	Up to 16k self-hosted	72.4%	79.1%	54.3%	Yes (on-prem)
Mistral Magistral	Mistral (Apache)	4k-16k, EU cloud	70.1%	76.4%	51.8%	Yes (EU, France)

For Swiss companies we recommend three archetypes — depending on sensitivity, budget and workload profile:

Frontier cloud with EU region (Claude 4.7 Thinking via AWS Bedrock eu-central-2 Zurich or Vertex AI EU): for medium sensitivity, maximum quality. Ideal for trust companies, law firms, due diligence.
Hybrid with open-source reasoning self-hosted (DeepSeek-R2 on Swiss GPU cluster): for FINMA-supervised institutions and healthcare providers. Full data sovereignty, no API costs, Swiss GPU at Green Geneva or Infomaniak.
Router architecture (frontier + open-source depending on task class): the pragmatic standard. 70% of requests go to a fast standard LLM, 30% escalate to a reasoning model — the mazdek default stack for enterprise.

Reference Architecture: The Swiss-Sovereign Reasoning Stack

Every productive reasoning deployment at mazdek follows a 7-layer architecture. The layers are explicitly decoupled so that individual components can be swapped without re-architecture:

+------------------------------------------------------------+
|  1. Task Layer: IRIS / Slack / Client Portal / n8n flow    |
+-----------------------------+------------------------------+
                              | Natural-language request
                              v
+-----------------------------+------------------------------+
|  2. Intent Router: PROMETHEUS — classifier (~30 ms)        |
|     - simple  -> standard LLM (GPT-5 nano / Claude Haiku)  |
|     - medium  -> thinking mode 2k-4k tokens                |
|     - complex -> thinking mode 8k-16k tokens               |
|     - research-> thinking + multi-agent + tool use         |
+-----------------------------+------------------------------+
                              | Task with tier
                              v
+-----------------------------+------------------------------+
|  3. Reasoning Layer: Claude 4.7 / o4 / DeepSeek-R2         |
|     - Chain-of-thought  - Self-consistency  - Verification |
|     - Tool use inside the thinking loop (code, search)     |
+-----------------------------+------------------------------+
                              | Reasoning + answer
                              v
+-----------------------------+------------------------------+
|  4. Guardrails: ARES — PII redaction, prompt injection     |
|     Output policies · Citation enforcement · Red team      |
+-----------------------------+------------------------------+
                              | Validated answer
                              v
+-----------------------------+------------------------------+
|  5. Observability: ARGUS — Langfuse + OpenTelemetry        |
|     - Thinking token cost  - Latency  - Eval regression    |
|     - Reasoning trace replay for FINMA audit               |
+-----------------------------+------------------------------+
                              | Events + metrics
                              v
+-----------------------------+------------------------------+
|  6. Feedback Loop: ORACLE — post-hoc eval & fine-tune      |
|     - RAGAS / DeepEval  - Human feedback from client portal|
|     - DPO training for domain-specific reasoners           |
+-----------------------------+------------------------------+
                              | Model updates
                              v
+-----------------------------+------------------------------+
|  7. Infrastructure: HEPHAESTUS — Green / Infomaniak CH     |
|     K8s + vLLM + Triton · H100/B100 · ISO-27001 · revDSG   |
+------------------------------------------------------------+

Layer Details

Intent Router: A 30 ms classification, typically a 3B model, decides the thinking tier. Our PROMETHEUS agent maintains this routing logic with productive eval data. In a typical enterprise workload only 15-25% of requests land on the reasoning model — but they produce 60-80% of the quality gain.
Reasoning Layer: The heart. We combine Claude 4.7 Extended Thinking (for deep reasoning) with DeepSeek-R2 (for cost sensitivity, self-hosted). The choice is made per use case and tenant.
Guardrails: ARES inspects both the reasoning and the final answer for PII, hallucinations and prompt-injection traces. Important: thinking-token contents are not automatically visible to the user, but can contain sensitive data — therefore the same redaction rules apply as for the output.
Observability: ARGUS captures every token. A single productive reasoning workflow generates 60-120 MB of reasoning traces per day, which must be stored in a FINMA-compliant way for 18 months. See the LLM observability article.
Feedback Loop: ORACLE runs weekly evals against a gold set and triggers fine-tuning if accuracy drops by more than 2pp.
Infrastructure: HEPHAESTUS operates the stack on Swiss GPU clusters. For self-hosted reasoning we recommend vLLM with continuous batching — reduces cost per thinking token by 45-60% compared to naive serving.

Technical Deep Dive: The Reasoning Loop in Detail

A reasoning model differs mechanically from classical LLM inference. Here is the productive TypeScript code of our PROMETHEUS reasoner for Claude 4.7 Extended Thinking:

import Anthropic from '@anthropic-ai/sdk'
import { trace } from '@opentelemetry/api'
import { classifyIntent } from './router'
import { redactPII } from './ares-guardrails'

const anthropic = new Anthropic({ baseURL: process.env.BEDROCK_EU_ENDPOINT })
const tracer = trace.getTracer('mazdek-prometheus-reasoner')

type Tier = 'simple' | 'medium' | 'complex' | 'research'

const BUDGETS: Record<Tier, number> = {
  simple: 0,       // no thinking
  medium: 4000,
  complex: 12000,
  research: 24000,
}

export async function reason(task: string, ctx: Ctx) {
  return tracer.startActiveSpan('prometheus.reason', async (span) => {
    const tier = await classifyIntent(task, ctx)
    const budget = BUDGETS[tier]
    span.setAttributes({
      'mazdek.tier': tier,
      'mazdek.thinking_budget': budget,
      'mazdek.tenant': ctx.tenantId,
    })

    // No thinking for simple tasks — straight to Haiku
    if (tier === 'simple') {
      return await callFastModel(task)
    }

    const redacted = redactPII(task)

    const response = await anthropic.messages.create({
      model: 'claude-opus-4-7',
      max_tokens: 4096,
      thinking: { type: 'enabled', budget_tokens: budget },
      messages: [{ role: 'user', content: redacted }],
    })

    // Extract thinking block and answer
    const thinking = response.content.find((c) => c.type === 'thinking')
    const answer = response.content.find((c) => c.type === 'text')

    // ARGUS logging — thinking counts toward the audit trail
    await logReasoningTrace({
      traceId: ctx.traceId,
      thinking_tokens: response.usage.thinking_tokens,
      output_tokens: response.usage.output_tokens,
      thinking_content: thinking?.thinking,
      answer: answer?.text,
      cost_chf: calcCost(response.usage, tier),
    })

    span.addEvent('reasoning_complete', {
      thinking_tokens_used: response.usage.thinking_tokens,
      budget_used_pct: (response.usage.thinking_tokens / budget) * 100,
    })
    span.end()

    return answer?.text
  })
}

Five production details that make the difference between "works in a notebook" and "runs in Zurich private banking":

Dynamic budget instead of fixed value: Giving every request 32k thinking tokens burns money. Our router estimates the required depth per request — simple FAQs need 0, M&A due diligence 24k.
Thinking is subject to audit: In a FINMA context the reasoning trace must be stored alongside the answer. Retention 10 years for financial mandates, 18 months for operational processes.
Redact PII before thinking starts: Without redaction, sensitive information ends up in the reasoning trace, which in turn is streamed into Langfuse, OpenTelemetry and Swiss storage — a revDSG violation is likely.
Cost guardrail: A reasoning agent in an infinite loop can burn CHF 400 per request. We enforce hard token limits per tenant and weekly budget alerts.
Check eval regression: Model updates (e.g. from Claude 4.6 to 4.7) sometimes drop accuracy on a specific workload — ORACLE detects this within 12-48 hours and rolls back.

6 Practical Use Cases With Measurable ROI

From 17 productive reasoning model deployments in 2025/2026, six patterns distil themselves that every Swiss company should examine:

1. Claims Review in Insurance

A Swiss property insurer with CHF 1.2 billion in premiums uses Claude 4.7 Thinking to assess complex claims — hit-and-runs, goodwill decisions, fraud suspicion. The reasoning model reads 30-80 pages of casefile, generates a 4-stage analysis, flags fraud patterns. Result after 9 months: 28% faster case throughput, 41% fewer false goodwill refusals, fraud detection up 2.3x. Payback: 5.1 months.

2. Due Diligence for Private Equity

A Zurich PE boutique uses o4 and Claude 4.7 Thinking to analyse 150-300-page info memos on potential targets. The reasoning identifies inconsistencies between financial model, competitive analysis and management claims. Result: 62% shorter pre-LOI phase, 3 deal-killers uncovered across 18 transactions that were missed before the reasoning system.

3. Clinical Decision Support

A Bern university hospital (see AI healthcare article) uses DeepSeek-R2 self-hosted for diagnostic support in the emergency department. The reasoner integrates lab values, symptoms, imaging findings and patient history. Result: 19% fewer misdiagnoses on complex presentations, secondary hypotheses identified 2.7x more often. Fully on-prem, zero patient data leaves the hospital network.

4. FINMA Compliance Reviews

A Geneva private bank automates FINMA circular impact analyses. Every change in RS 2023/1, RS 2024/3 or MiFID equivalence rules is mirrored by the reasoner against existing processes. Result: Review time per circular cut from 14 days to 2 days, compliance team relieved by 40%.

5. Legal Research for Law Firms

A Zurich commercial law firm deploys Claude 4.7 Thinking with tool use against Swisslex and EUR-Lex. The reasoner cites rulings, recognises conflicting case law and assesses the strength of arguments. Result: 3x faster first drafts, 100% source transparency through citation enforcement in ARES.

6. Engineering Review and Code Auditing

A Basel fintech uses o4 for critical code reviews — payment logic, cryptography, race conditions. The reasoner finds issues that classical linters and SAST tools miss. Result: 14 production-relevant bugs prevented over 3 months, code review throughput halved. Combined with AI-assisted coding.

Cost Control: Understanding the Reasoning Economy

Reasoning models are 5-40x more expensive per request than standard LLMs. Without deliberate cost management, a thoughtless rollout burns through the annual budget in 3 weeks. Our rules of thumb from productive deployments:

Router instead of default thinking: 70-85% of all requests need no reasoning. Classify with a 3B model before the reasoning call — savings: 8-12x total budget.
Prompt caching: Claude 4.7 Thinking supports prompt caching — identical contexts are billed at 10% of the normal price. For compliance reviews with a fixed circular context this saves 60-80%.
Batch mode for non-real-time workloads: Due diligence runs, compliance sweeps, monthly audits can run through the batch API at 50% of the price.
Self-hosted for high volume: From around 400,000 reasoning requests per month, a 2x H100 cluster with DeepSeek-R2 pays off compared to the Claude API — break-even at CHF 18,000 per month.
Eval gating: Do not throw 24k tokens at every request. Start at 4k, escalate only if the confidence score drops below 0.7. Saves 40% thinking compute.

A realistic cost calculation for a Swiss mid-market firm with 10,000 daily AI requests, of which 20% are in the reasoning tier:

Scenario	Monthly cost	Quality
All GPT-5 standard	CHF 2,400	72% accuracy
All Claude 4.7 Thinking (12k)	CHF 28,800	89% accuracy
Router (80% fast, 20% thinking 8k)	CHF 6,100	87% accuracy
Hybrid + prompt cache + batch	CHF 3,900	86% accuracy
Self-hosted DeepSeek-R2 + Claude spike	CHF 4,200 (fixed)	85% accuracy

The practically optimal point: router + prompt cache + batch mode — 60-70% lower cost than naive deployment at almost identical quality.

Reasoning Model vs. RAG vs. Classical LLM

The most frequent question: when reasoning, when RAG, when standard LLM? Our decision matrix:

Criterion	Reasoning model	RAG	Standard LLM
Domain knowledge	Training cut-off	Your knowledge	Training cut-off
Multi-step logic	Strong	Weak	Medium
Latency	5-90 s	0.8-2 s	0.3-1.5 s
Cost per task	CHF 0.05-0.50	CHF 0.01-0.05	CHF 0.001-0.02
Hallucination risk	Low (self-verification)	Very low (citations)	Medium-high
Ideal for	Complex decisions, deep analysis, opinions	Company knowledge, fact retrieval, support	Drafting, summarising, standard chat

The Swiss enterprise standard architecture 2026 combines all three: RAG delivers company context, reasoning processes it with multi-step logic, standard LLM drafts the final user response. We call this the "RRR pipeline" — Retrieve, Reason, Respond.

Governance: EU AI Act, revDSG and FINMA for Reasoning Models

Reasoning models raise new regulatory questions that classical LLMs did not: who is liable for the thinking that was never shown to a human? Is the reasoning trace part of the "automated decision" under revDSG Art. 21? The key regulatory conditions in 2026:

EU AI Act Art. 12 (logging obligation): Thinking tokens count as "input/output of the system". They must be stored alongside the answer across the entire lifetime of the system.
EU AI Act Art. 13 (transparency): Users must be able to recognise that the system is reasoning internally. Best practice: UI hint "The assistant is thinking harder (up to 20 s)" at reasoning tier.
EU AI Act Art. 14 (human oversight): For high-risk systems (banking, health, justice) the reasoning trace must be visible to the human reviewer. Not only the answer, but the path.
revDSG Art. 7 (data security): Thinking traces often contain more PII than the answer. AES-256 at rest, TLS 1.3, role-based access mandatory.
revDSG Art. 21 (automated decision): If the reasoning answer has legally relevant effect (credit decision, claims settlement, HR), the affected person must be able to request human review — and the reasoning trace is part of the justification.
FINMA RS 2023/1: Requires full traceability. The reasoning trace must be archived for 10 years, replayable, tamper-evident.
OR Art. 41/55: If a reasoning model reasons incorrectly and damage results, the company is liable, not the model provider. Duty of care: eval regime, red-team tests, written governance.

Our EU AI Act guide contains templates for all of the above articles, adapted for reasoning systems.

Case Study: Zurich Private Bank Automates FINMA Credit Risk Reviews

A Zurich private bank (CHF 38 billion AuM, 410 employees) runs quarterly credit risk reviews — a 6-week process with 14 analysts that applies FINMA circular RS 2017/7 and Basel III rules to every single credit exposure.

Starting Point Q4 2025

14 analysts work 6 weeks on 1,850 individual exposures
On average 12,200 person-hours per quarterly review
Error rate in sample audit: 3.8% (risk classifications too low)
FINMA review 2025 criticised "insufficient traceability" in 7% of analyses

mazdek Transformation: 14 Weeks, 5 Agents

We deployed a reasoning-model-based review network:

PROMETHEUS: Reasoning orchestration with Claude 4.7 Thinking (12k-24k tokens per exposure) via AWS Bedrock eu-central-2 Zurich.
ORACLE: RAG layer with the Basel III rulebook, FINMA circulars and the bank's internal risk model.
ARES: Citation enforcement (every classification must cite an RS source), PII redaction (client names are pseudonymised).
ARGUS: Tamper-evident archival of all reasoning traces in WORM storage, FINMA retention 10 years.
IRIS: Human-in-the-loop — every high-risk classification is approved by the responsible analyst in the client portal.

Results Q2 2026 (after 2 quarters of operation)

Metric	Q4 2025	Q2 2026	Delta
Review throughput time	6 weeks	9 days	-79%
Person-hours per review	12,200	2,800	-77%
Error rate in sample audit	3.8%	0.6%	-84%
FINMA finding on traceability	7%	0%	Eliminated
Reasoning cost per exposure	—	CHF 4.12	—
Reasoning cost per review	—	CHF 7,620	—
Annual saving	—	CHF 3.1 M	—
Payback time	—	6.2 months	—

Crucial: no job was cut. The 14 analysts were reassigned to focus reviews for the top-100 risks and to new credit product development — with higher value-add. The next FINMA inspection explicitly praised the traceability.

Implementation Roadmap: To a Productive Reasoning System in 12 Weeks

Our 5-phase process for Swiss companies:

Phase 1: Discovery & Use Case Selection (Weeks 1-2)

Workshop: which decisions today require > 30 minutes of human analysis?
Reasoning matrix: volume × complexity × risk × eval criteria
Pick top 3 candidates, build a gold eval set (100-500 cases with human-validated answer)

Phase 2: Proof of Concept (Weeks 3-5)

PROMETHEUS builds the reasoning loop with Claude 4.7 Thinking in a sandbox
Eval against the gold set: accuracy, F1, calibration
Benchmark cost per task, optimise the thinking budget

Phase 3: Guardrails, Router & RRR Pipeline (Weeks 6-8)

ORACLE builds the RAG layer with company knowledge
Intent router classifies tasks into simple/medium/complex
ARES implements PII redaction, citation enforcement, output policies
EU AI Act and FINMA compliance check

Phase 4: Infrastructure & Observability (Weeks 9-10)

HEPHAESTUS deploys the stack on Swiss GPU / Bedrock eu-central-2
ARGUS instruments Langfuse, Prometheus, WORM archival
NANNA runs an end-to-end eval across a 1,000-task set

Phase 5: Rollout & Continuous Improvement (Weeks 11-12)

Shadow run: the reasoner runs in parallel to humans, no live effect
Supervised rollout: 10% traffic, weekly drift reviews
Full production: 100% with human oversight on low-confidence cases
Monthly eval regression, quarterly model upgrades

The Future: Multi-Agent Reasoning, Agentic Search and Infinite Thinking

Reasoning models in 2026 are only the first wave. What is on the horizon for 2027-2028:

Multi-agent reasoning: Multiple specialised reasoners debate and converge on an answer. First products (OpenAI Swarm 2.0, Anthropic Council) show 8-15pp accuracy gain on research tasks.
Agentic search within the thinking loop: The model decides, during thinking, when it needs a web search, a DB query or a code run. Combines reasoning with MCP.
Tool use inside reasoning (Sonnet 4.8 roadmap): During thinking the model calls a Python sandbox, SymPy, formal theorem provers — real mathematical proofs instead of approximate computation.
Infinite thinking (Anthropic draft): The model runs for hours and days, saving intermediate states in external memory. Relevant applications: research papers, complex legal opinions, entire due diligence reports.
Domain-fine-tuned reasoners: DPO training on the Swiss legal corpus, FINMA rulebook, clinical guidelines. Our ORACLE pipeline makes this possible for mid-sized firms starting at CHF 45,000.
On-device reasoning: With DeepSeek-R3-Mini-30B, productive reasoning runs on a single RTX 6000 Ada in 2027 — full sovereignty for banks and public authorities.

Conclusion: Reasoning Models Are the AI Discipline of 2026

The decisive takeaways for Swiss decision-makers in 2026:

New scaling axis: Test-time compute has replaced training compute as the primary quality lever. Whoever does not actively orchestrate this axis misses the 2026 performance dimension.
Router-first architecture: Not every request needs reasoning. 70-85% standard LLM + 15-30% reasoning is the sweet spot for Swiss enterprise.
Governance frontier: Thinking traces are audit-relevant, PII-sensitive and legally effective. Without ARGUS observability, ARES guardrails and revDSG-compliant archival, no productive deployment is possible.
ROI under 7 months: Our 17 projects show an average payback of 6.1 months — faster than classical LLM projects (8-12 months), because reasoning models automate deeper process layers.
Swiss-sovereign is feasible: DeepSeek-R2 and Llama 4 Reasoning run on-prem on Swiss clusters. Full revDSG and FINMA compliance without US dependency.
Act now: Thinking tokens have become 40% cheaper per year, accuracy frontiers keep rising. Whoever goes productive in 2026 will have an insurmountable lead in process quality by 2027.

At mazdek, 19 specialised AI agents orchestrate the entire reasoning programme: PROMETHEUS for orchestration and routing, ORACLE for RAG and eval, ARES for compliance and redaction, ARGUS for 24/7 observability and WORM audit, HEPHAESTUS for Swiss GPU infrastructure, IRIS for human-in-the-loop, NANNA for eval regression and red-team tests. 17 productive reasoning deployments have been running since 2025 — DSG, GDPR, EU AI Act, FINMA and OR compliant from day one.

Web & E-Commerce

AI & Automation

19 AI Agents

By Company Size

Specializations

Up to 70% cheaper

Learn

Company

Latest Articles

Development

AI & Cloud

Enterprise

Specialized

Reasoning Models 2026: Extended Thinking, Test-Time Compute and Chain-of-Thought for Swiss Companies

Get this article summarized by AI