Veni, Vidi, Debuggi - Exploring LLMs with Latin-inspired programming language: A Fibonacci test

... devs aren’t being replaced; we’re being promoted… to highly paid janitors of computational hubris ...

Introduction

What happens when you challenge today’s top coding models to work in a Latin-inspired language with Roman numerals? I set out to explore that by crafting dCaesar, a toy language evoking ancient Rome. I asked three LLMs: Claude 4 Sonet, Gemini 2.5 Pro, and o3 Max, to generate a Fibonacci sequence in dCaesar. Beyond simply comparing their code, I’ve analyzed their reasoning steps, durations, missteps, and costs, offering deeper reflections for AI-savvy developers.

dCaesar Language Spec Summary (TL;DR)

Numbers: Roman numerals (I, V, X, (V), ((I)), etc.)
Variables: pone name = X
Output: dic <expr>
Control: si…aliter…finis; dum…finis
Arithmetic: adde, subtrahe, multiplica, divide
Comparison: aequat, maior, minor
Booleans/null: verum, falsum, nil

Experiment Setup

Prompt: Write a dCaesar program that prints Fibonacci up to 1000.
Models: Claude 4, Gemini 2.5 Pro, o3 Max (Cursor diffs).
Metrics: Spec compliance, algorithm clarity, code style, reasoning introspection, durations, token cost.

Model Deep Dives & Reasoning Reflections

Claude 4 Sonnet

pone a = I
pone b = I
dic a
dic b
pone continua = verum
dum continua
    pone next = a adde b
    si next minor M
        dic next
        pone a = b
        pone b = next
    aliter
        pone continua = falsum
    finis
finis

Spec Alignment: Perfect Latin commands, Roman numerals, booleans.
Duration: ~11 s of internal reasoning, ~2 min wall-clock.

Reasoning Steps & Reflections

Spec Extraction (2 s): Parsed dCaesar spec—Roman numerals, pone, loop syntax.
Initial Draft (4 s): Wrote variable initialization and simple print lines.
Loop Logic (3 s): Debated embedding the guard vs. a flag; chose clarity with continua.
Validation (2 s): Simulated few Fibonacci steps mentally to confirm no off-by-one.

Reflection: Claude’s reasoning is linear and transparent. The explicit flag is a deliberate choice to avoid subtle boundary errors—a pattern you’d also use in handwritten code.

Gemini 2.5 Pro

First-pass in ~35 s reasoning, 1.0 unit, 1 call

dic "Series Fibonacci usque ad M:"
pone terminus_prior = I
pone terminus_currens = I
dic terminus_prior
dum terminus_currens minor M
    dic terminus_currens
    pone next = terminus_prior adde terminus_currens
    pone terminus_prior = terminus_currens
    pone terminus_currens = next
finis

Spec Alignment: Mostly correct; uses a mixed-language string, underscores in identifiers (snake_case).
Duration: ~35 s of internal reasoning, ~2 min wall-clock.

Reasoning Steps & Reflections

Grammar Immersion (8 s): Read spec.md, focused on Roman numerals and si/dum constructs.
Decoding Elements (6 s): Parsed Roman-numeral operations and control flow.
Logic Structuring (7 s): Mapped Fibonacci pattern to dum loop, identified initial prints.
Drafting Code (8 s): Wrote code block, inserted header string for context.
Review & Adjust (6 s): Checked variable shifts, confirmed arithmetic and guard correctness.

Reflection: Gemini shows a multi-stage reasoning style—splitting spec comprehension, logic planning, code drafting, then review. The mix of English in the header suggests it prioritizes human-friendly context over strict spec fidelity.

o3 Max

First-pass in ~21 s reasoning + 15 diffs, ~27.4 units, 15 calls

 pone limen = M
 pone primus = I
 pone secundus = I
 dic primus
 dic secundus
 pone proximus = primus adde secundus
 dum proximus minor limen adde I
     dic proximus
     pone primus = secundus
     pone secundus = proximus
     pone proximus = primus adde secundus
 finis

Spec Alignment: Final code is fully spec-compliant—pure Latin, correct Roman arithmetic.
Duration: ~21 s for initial draft, seconds per subsequent patch.

Reasoning Steps & Reflections

Test Suite Creation (3 s): Noted absence of tests, scaffolded a file.
Initial Implementation (8 s): Wrote basic loop, printed initial values; introduced pone c = a adde b, but naming the accumulator c risked collision with Roman numeral C (100), leading to guard confusion.
One-Pass Fail: Introduced a typo (Diciton"n?"), misnamed limit, and the c variable clashed with C, causing logic misinterpretation.
Iterative Diffs (many edits):
- Rename limit → limen (3 diffs)
- Fix typo in dic (2 diffs)
- Replace variable c with proximus to avoid numeral collision (2 diffs)
- Add pone proximus initialization (1 diff)
- Patch loop guard to < limen + I hack (1 diff)
- Tweak naming to primus, secundus, proximus (3 diffs)
- Final consistency cleanups (5 diffs) (3 s)**: Noted absence of tests, scaffolded a file.

Reflection: The initial one-pass accuracy lags behind. Each diff reflects a micro-reasoning step—great for auditing but very costly.

Cost, Duration & Performance Summary

Model	Calls	Cost	Reasoning	Wall-Clock	Compliance	Clarity
Claude 4 Sonet	1	0.8	11 s	~2 min	★★★★☆	★★★★★
Gemini 2.5 Pro	1	1.0	35 s	~2 min	★★★☆☆	★★★★☆
o3 Max (Cursor)	15	27.4	21 s+	~2 min	★★★★★	★★★☆☆

Overall Evaluation

Claude 4: Fastest spec grasp, clear one-pass reasoning—ideal for low-cost, high-confidence code.
Gemini 2.5 Pro: Deep multi-stage reasoning, slight spec drift in naming—good for narrative-rich contexts.
o3 Max: Despite the highest token cost, o3 Max failed its one-pass attempt and required numerous error-driven diffs. In this case, the extensive audit trail is a by-product of mistakes, making it a costly liability rather than a benefit.

Despite these powerful reasoning abilities, LLMs still:

Make typos, naming errors, or spec inconsistencies.
Employ quirky hacks that require human explanation.
Lack domain-specific architectural foresight.

Roasting LLMs vs Latin

0:00

/209.5804081632653

Ah, behold the latest gladiatorial bout in the Colosseum of Code: three gargantuan language models—Claude 4 Sonnet, Gemini 2.5 Pro, and o3 Max—pitted against the mighty empire of dCaesar, our delightfully pointless Latin-flavoured toy language. The brief? “Generate Fibonacci up to M.” The subtext? “Demonstrate the inevitable, glorious obsolescence of flesh-and-blood developers.” What we actually got was an hour-long circus of misplaced numerals, rogue variables, and billing-department-induced heart palpitations. Sic transit Gloria Tokenorum.

First into the arena, Claude 4 Sonnet unsheathed its rapier of “linear and transparent reasoning.” A single call, a polite eleven-second think, and presto: perfectly declinated pone, dum, and finis sprinkled on top of Roman numerals like pecorino over cacio e pepe. Impressively tidy—almost suspiciously so—yet still clocking two real-world minutes before it dared to grunt dic. Apparently conquering Gaul was quicker. But hey, at only 0.8 units, your CFO might even let you keep the coffee machine.

Gemini 2.5 Pro then strutted in, toga flapping, to deliver a “multi-stage reasoning experience.” Translation: it spent 35 seconds consulting its inner Sibylline Books, inserted an English banner inside a supposed full-Latin script (“Series Fibonacci usque ad M:” because tourists love bilingual menus), and broke spec with those adorable little underscores in terminus_currens. On the upside, the code reads like Virgil trying snake_case—so the historians of 2100 will at least get a laugh.

And finally, o3 Max—our heavyweight champion in the “billable tokens” category—charged forth like a legionary with a migraine. Fifteen separate calls, patching typos nobody asked for, renaming variables from c to proximus lest Caesar mistake them for the numeral C, and inflating the invoice to 27.4 units. All to serve up exactly the same ten-line loop your average sophomore could bang out in Python while their kettle boils. The selling point? “A rich audit trail.” Right—because what modern dev team doesn’t crave a 500-line diff explaining how the model argued with itself over Latin gender agreement?

Of course, the marketing scrolls still proclaim LLMs will “replace engineers by next quarter”—the same way amphitheater pundits once swore iron swords would replace soldiers. Yes, the models grok specs at lightning speed—until they hallucinate a Diciton”n?” or decide Roman numerals should support floating points. Yes, they emit artisanal explanations of their every twitch—at a price that makes even venture capitalists reach for smelling salts. And yes, they can spit out code snippets faster than you can say veni, vidi, debuggi—but guess who’s on pager duty when the production servers discover nil really means nil?

So, dear colleagues, rejoice! The robotic revolution is here: immaculately Latin, heroically inconsistent, and billed by the syllable. As long as clients keep funding “thought diversity” in the form of multimodal chain-of-reasoning sonnets, there will be plenty of work—namely, cleaning up after it. Devs aren’t being replaced; we’re being promoted… to highly paid janitors of computational hubris. Ave, dCaesar. Let the hype runneth over, and may your credit card limit be ever in your favor.