Veni, Vidi, Debuggi - Exploring LLMs with Latin-inspired programming language: A Fibonacci test
... devs aren’t being replaced; we’re being promoted… to highly paid janitors of computational hubris ...

Introduction
What happens when you challenge today’s top coding models to work in a Latin-inspired language with Roman numerals? I set out to explore that by crafting dCaesar, a toy language evoking ancient Rome. I asked three LLMs: Claude 4 Sonet, Gemini 2.5 Pro, and o3 Max, to generate a Fibonacci sequence in dCaesar. Beyond simply comparing their code, I’ve analyzed their reasoning steps, durations, missteps, and costs, offering deeper reflections for AI-savvy developers.
dCaesar Language Spec Summary (TL;DR)
- Numbers: Roman numerals (I, V, X, (V), ((I)), etc.)
- Variables:
pone name = X
- Output:
dic <expr>
- Control:
si…aliter…finis
;dum…finis
- Arithmetic:
adde
,subtrahe
,multiplica
,divide
- Comparison:
aequat
,maior
,minor
- Booleans/null:
verum
,falsum
,nil
Experiment Setup
- Prompt: Write a dCaesar program that prints Fibonacci up to 1000.
- Models: Claude 4, Gemini 2.5 Pro, o3 Max (Cursor diffs).
- Metrics: Spec compliance, algorithm clarity, code style, reasoning introspection, durations, token cost.
Model Deep Dives & Reasoning Reflections
Claude 4 Sonnet
pone a = I
pone b = I
dic a
dic b
pone continua = verum
dum continua
pone next = a adde b
si next minor M
dic next
pone a = b
pone b = next
aliter
pone continua = falsum
finis
finis
- Spec Alignment: Perfect Latin commands, Roman numerals, booleans.
- Duration: ~11 s of internal reasoning, ~2 min wall-clock.
Reasoning Steps & Reflections
- Spec Extraction (2 s): Parsed dCaesar spec—Roman numerals,
pone
, loop syntax. - Initial Draft (4 s): Wrote variable initialization and simple print lines.
- Loop Logic (3 s): Debated embedding the guard vs. a flag; chose clarity with
continua
. - Validation (2 s): Simulated few Fibonacci steps mentally to confirm no off-by-one.
Reflection: Claude’s reasoning is linear and transparent. The explicit flag is a deliberate choice to avoid subtle boundary errors—a pattern you’d also use in handwritten code.
Gemini 2.5 Pro
First-pass in ~35 s reasoning, 1.0 unit, 1 call
dic "Series Fibonacci usque ad M:"
pone terminus_prior = I
pone terminus_currens = I
dic terminus_prior
dum terminus_currens minor M
dic terminus_currens
pone next = terminus_prior adde terminus_currens
pone terminus_prior = terminus_currens
pone terminus_currens = next
finis
- Spec Alignment: Mostly correct; uses a mixed-language string, underscores in identifiers (snake_case).
- Duration: ~35 s of internal reasoning, ~2 min wall-clock.
Reasoning Steps & Reflections
- Grammar Immersion (8 s): Read spec.md, focused on Roman numerals and
si
/dum
constructs. - Decoding Elements (6 s): Parsed Roman-numeral operations and control flow.
- Logic Structuring (7 s): Mapped Fibonacci pattern to
dum
loop, identified initial prints. - Drafting Code (8 s): Wrote code block, inserted header string for context.
- Review & Adjust (6 s): Checked variable shifts, confirmed arithmetic and guard correctness.
Reflection: Gemini shows a multi-stage reasoning style—splitting spec comprehension, logic planning, code drafting, then review. The mix of English in the header suggests it prioritizes human-friendly context over strict spec fidelity.
o3 Max
First-pass in ~21 s reasoning + 15 diffs, ~27.4 units, 15 calls
pone limen = M
pone primus = I
pone secundus = I
dic primus
dic secundus
pone proximus = primus adde secundus
dum proximus minor limen adde I
dic proximus
pone primus = secundus
pone secundus = proximus
pone proximus = primus adde secundus
finis
- Spec Alignment: Final code is fully spec-compliant—pure Latin, correct Roman arithmetic.
- Duration: ~21 s for initial draft, seconds per subsequent patch.
Reasoning Steps & Reflections
- Test Suite Creation (3 s): Noted absence of tests, scaffolded a file.
- Initial Implementation (8 s): Wrote basic loop, printed initial values; introduced
pone c = a adde b
, but naming the accumulatorc
risked collision with Roman numeralC
(100), leading to guard confusion. - One-Pass Fail: Introduced a typo (
Diciton"n?"
), misnamedlimit
, and thec
variable clashed withC
, causing logic misinterpretation. - Iterative Diffs (many edits):
- Rename
limit
→limen
(3 diffs) - Fix typo in
dic
(2 diffs) - Replace variable
c
withproximus
to avoid numeral collision (2 diffs) - Add
pone proximus
initialization (1 diff) - Patch loop guard to
< limen + I
hack (1 diff) - Tweak naming to
primus
,secundus
,proximus
(3 diffs) - Final consistency cleanups (5 diffs) (3 s)**: Noted absence of tests, scaffolded a file.
- Rename
Reflection: The initial one-pass accuracy lags behind. Each diff reflects a micro-reasoning step—great for auditing but very costly.
Cost, Duration & Performance Summary
Model | Calls | Cost | Reasoning | Wall-Clock | Compliance | Clarity |
---|---|---|---|---|---|---|
Claude 4 Sonet | 1 | 0.8 | 11 s | ~2 min | ★★★★☆ | ★★★★★ |
Gemini 2.5 Pro | 1 | 1.0 | 35 s | ~2 min | ★★★☆☆ | ★★★★☆ |
o3 Max (Cursor) | 15 | 27.4 | 21 s+ | ~2 min | ★★★★★ | ★★★☆☆ |
Overall Evaluation
- Claude 4: Fastest spec grasp, clear one-pass reasoning—ideal for low-cost, high-confidence code.
- Gemini 2.5 Pro: Deep multi-stage reasoning, slight spec drift in naming—good for narrative-rich contexts.
- o3 Max: Despite the highest token cost, o3 Max failed its one-pass attempt and required numerous error-driven diffs. In this case, the extensive audit trail is a by-product of mistakes, making it a costly liability rather than a benefit.
Despite these powerful reasoning abilities, LLMs still:
- Make typos, naming errors, or spec inconsistencies.
- Employ quirky hacks that require human explanation.
- Lack domain-specific architectural foresight.
Ah, behold the latest gladiatorial bout in the Colosseum of Code: three gargantuan language models—Claude 4 Sonnet, Gemini 2.5 Pro, and o3 Max—pitted against the mighty empire of dCaesar, our delightfully pointless Latin-flavoured toy language. The brief? “Generate Fibonacci up to M.” The subtext? “Demonstrate the inevitable, glorious obsolescence of flesh-and-blood developers.” What we actually got was an hour-long circus of misplaced numerals, rogue variables, and billing-department-induced heart palpitations. Sic transit Gloria Tokenorum.
First into the arena, Claude 4 Sonnet unsheathed its rapier of “linear and transparent reasoning.” A single call, a polite eleven-second think, and presto: perfectly declinated pone, dum, and finis sprinkled on top of Roman numerals like pecorino over cacio e pepe. Impressively tidy—almost suspiciously so—yet still clocking two real-world minutes before it dared to grunt dic. Apparently conquering Gaul was quicker. But hey, at only 0.8 units, your CFO might even let you keep the coffee machine.
Gemini 2.5 Pro then strutted in, toga flapping, to deliver a “multi-stage reasoning experience.” Translation: it spent 35 seconds consulting its inner Sibylline Books, inserted an English banner inside a supposed full-Latin script (“Series Fibonacci usque ad M:” because tourists love bilingual menus), and broke spec with those adorable little underscores in terminus_currens. On the upside, the code reads like Virgil trying snake_case—so the historians of 2100 will at least get a laugh.
And finally, o3 Max—our heavyweight champion in the “billable tokens” category—charged forth like a legionary with a migraine. Fifteen separate calls, patching typos nobody asked for, renaming variables from c to proximus lest Caesar mistake them for the numeral C, and inflating the invoice to 27.4 units. All to serve up exactly the same ten-line loop your average sophomore could bang out in Python while their kettle boils. The selling point? “A rich audit trail.” Right—because what modern dev team doesn’t crave a 500-line diff explaining how the model argued with itself over Latin gender agreement?
Of course, the marketing scrolls still proclaim LLMs will “replace engineers by next quarter”—the same way amphitheater pundits once swore iron swords would replace soldiers. Yes, the models grok specs at lightning speed—until they hallucinate a Diciton”n?” or decide Roman numerals should support floating points. Yes, they emit artisanal explanations of their every twitch—at a price that makes even venture capitalists reach for smelling salts. And yes, they can spit out code snippets faster than you can say veni, vidi, debuggi—but guess who’s on pager duty when the production servers discover nil really means nil?
So, dear colleagues, rejoice! The robotic revolution is here: immaculately Latin, heroically inconsistent, and billed by the syllable. As long as clients keep funding “thought diversity” in the form of multimodal chain-of-reasoning sonnets, there will be plenty of work—namely, cleaning up after it. Devs aren’t being replaced; we’re being promoted… to highly paid janitors of computational hubris. Ave, dCaesar. Let the hype runneth over, and may your credit card limit be ever in your favor.