AI Reasoning Bakeoff Part 2 of 3
In the second instalment of this series, I describe what it was like working with a variety of popular models to "agentically" create, test and fix a relatively simple task according to a specification.
The task I asked each model to undertake is described in my previous article:
In addition to the developer experience, I wanted to compare cost of each model.
Since starting, Gemini 2.5 has come out and it blows all the other ones away. Then Claude 4! Like Douglas Adams before me, I will cover it in a part 4 to the increasingly misnamed trilogy of articles!
The workflow
Based on the goals described previously, I ended up doing the following with each LLM:
When coding with LLMs from a spec, the default LLM behavior is to create all of the code from the spec, then create a full test suite inspired by the code. This often resulted in 80% test failures as a starting point, and over time this would come down progressively, with some models better able to make steady progress.
Along the way, the LLM may see opportunities for improvement through refactoring. This is a disaster! The LLM takes already failing code, and creates new files, introduces new patterns, adds generic typescript abstractions, and next thing you know your code is cooked a bit too much (as per the photo above). So the apparent progress - dozens of good-looking code files, dozens of good-sounding tests - crumbles like a house of cards.
I call this vibe debugging, and it led me down rabbit holes of wasted time as several LLMs seemed unable to dig themselves out of the 20% failures mark.
TDD and AI go well together
With Claude 3.7 and Gemini 2.5 being more advanced models, I decided to improve the procedure and implement a Test-driven development (TDD) approach. (Perhaps I could have tried with o1 but the cost was just too much). TDD is actually a development approach, rather than a testing approach contrary to what the name may lead you to believe. The theory goes that it helps you write the code rather than helps you test the code you've already written. You write it one test at a time, adding a small increment of the solution just enough to make the test pass.
The interesting thing is that so much has been written by TDD purists that the LLMs are very well versed in TDD and hardly need but a nudge to get them going. And to my delight, I found that the LLMs were quite capable of adding the code increments test by test without any errors, as opposed to generating the entire module full cloth in one shot.
Contrary to "pure" TDD, the design spec already had established the main lines of the design, so what I was doing wasn't really TDD but I didn't tell the LLM that! The design was a spec with some pseudocode, but mostly documenting rules, requirements, communications and dependencies and trying to be higher level.
I would then create a test plan from the spec - focusing on features to be added one by one, and not oriented to individual functions, unlike in the previous "First-order correctness" approach with the less advanced models. This is because in this TDD approach the testing was supposed to drive the coding, not the other way around.
I then had it go through a red-green cycle where the LLM would add one test, run the tests to see it fail ("red"), then add enough code to make it pass ("green"). It was free to alter/create functions as needed, as long as they weren't incoherent with the original design doc. If it saw the need to change the design, it could ask me for permission to update the design.
Miraculously, I found that the code almost always passed first attempt at a new unit test, especially with Gemini 2.5! This made it much more effective than the default "generate everything at once" mode, because even though the code generation was now taking much longer and many more iterations, each was a smaller change and almost always successful and building on the existing code without breaking it.
I also built integration tests that required special handling because of additional complexity due to more interdependencies, but combined with the unit test, the result was a much more effective and reliable workflow. This will be covered in part 3.
Claude Sonnet 3.5
A solid coding model, it has since been "supplanted" by 3.7 in terms of "best and brightest" Anthropic model, and to be fair 3.7 does some things better, especially within the Claude application. However, when it comes to troubleshooting, it tends to stay more on track and is the best of the models I tried.
Claude is definitely the most verbose LLM in its output, which is sometimes annoying, but in code generation this naturally results in very thorough examples and comments. This is important because when it comes time to doing AI-driven updates, having comments helps keep the AI on track and not start going off on random tangents when it comes back later to fix things.
The actual code Claude produces has always been solid , and compared to other models, required the least troubleshooting and fixing.
Claude included a lot of strategic logging, with good descriptors and values, helping it speedily address any test failures. It was quite capable of making use of the logging as if it were a human developer using a debugger stepping through the execution. This greatly improved its ability to pinpoint errors and fix them.
In addition to improving the LLM's troubleshooting, the tracing was descriptive enough that you could understand the code without having to look at it to get a sense of what was going on, which I found important for a human dealing with unfamiliar AI-generated code:
The total cost of the code generation for Claude was $0.40. Like most vendors, Claude charges by API call, for inputs, and outputs
Claude Test plans
I used the following prompt to have LLMs read the generated code and infer a test plan that would establish full coverage:
CREATE the folder n the folder test/markdown-serialization/parse if it doesn’t exist
In the folder test/markdown-serialization/parse, CREATE a document called <module name>-scenarios.md for each file in src/markdown-serialization/parse SKIPPING types.ts or index.ts, and SKIPPING any existing scenarios document
GO through and document the lines of code to go through the functions in the file and identify the input values needed to exercise all of the code paths.
FOR EACH conditional statement, explain what input values can directly or indirectly determine which branch is taken
FOR EACH loop, explain what input values can directly or indirectly determine how many iterations should be done
FOR EACH external call, identify the values that need to be mocked to exercise all of the paths after the external call.
ADD a final section of scenarios consisting of:
- Synopsis
- Purpose of test
- Input values
- Mock values
- Outcome expectations
Claude finds the control flow statements and the values it needs to consider to exercise all of the possibilities. It also accurately lists the external calls which, in unit testing, will be mostly mocked in order to control the results, according to the needs of each test case.
Based on these signficant inputs (either received as parameters, or received from external call returns), Claude creates unit test cases. Now LLMs often recognize what we're doing and act as overly-eager interns, generating actual test code directly in the test plan. In order to avoid this, I used the "AI psychological" trick of calling them "scenarios" rather than test cases.
Based on the select input combinations, including the external call mocking, the scenarios explain what the expected results are, which gets translated into verification code ("assertions" in the unit testing lingo). Claude's scenarios are minimal but complete.
This cost me an another $0.35 - almost as much as the code.
Claude Test Suites
Having completed all the test plans, I then have Claude create all of the test suites. Each test suite is a code file with all of the test cases derived from the test scenarios documented in the previous step. Here the temptation is great for the LLM to throw in unwanted design patterns or mocking techniques. I was forced to create a document as to how I wanted mocks to be created, and I gave it strict instructions to follow them! I found that this wasn't always working so I added a step where it was forced to explain what it understood as a means of ensuring the information was in the context used by the LLM.
I also added "I WLL BE CHECKING!" as a trick to get more complete answers. I figure the models are trained to produce a typical amount of output in order to keep computational costs under control, so we can try to counteract that with these "psychological" tricks that seem to act as signals to spend more tokens.
READ THESE INSTRUCTIONS AND PAY CLOSE ATTENTION TO WHAT IS ASKED - DO NOT ADD EXTRANEOUS THINGS NOT REQUESTED.
IMPORTANT: READ the file Mocking Directives.md and explain all the instructions to PROVE YOU UNDERSTOOD using examples of mocking a function. WAIT FOR MY APPROVAL
IN the directory test/markdown-serialization/parse, LIST all of the files of the form <module name>-scenarios.md and find which ones don’t have a corresponding test suite of the form module name>.test.ts.
FOR EACH missing test suite, CREATE unit tests from the test plans found in that folder, SKIPPING over any that already exist.
- USE ESM imports meaning imports of a file require “.js” at the end
- USE the ‘@/‘ and ‘@test/’ aliases instead of “..” In import paths.
- READ src/markdown-seriqlization/types.js and pay attention to the data structures defined in it.
- Add an import { MarkdownSerialization } from "@/markdown-serialization/types.js";
- Add imports for test-utils.js and vitest
IMPORTANT: use 'it' and 'test' exported by @test/test-utils.js NOT those from 'vitest' directly
import { expect, vi, beforeAll, afterAll, beforeEach, afterEach } from "vitest";
import { it, test } from "@test/test-utils.js";
FOR EACH test case in the test plan: Use ‘describe’ for each group of tests, and ‘it’ for each test, unless there are special testing requirements in which case ‘test’ may be used.
Add an import of functions from the file under test example: import { myFunc } from ‘@/markdown-serialization/render/MyComponent.js’
MOCKING: Mock any external functions used by the unit under test.
Trace through each test case and when encountering an external function to be mocked, determine what values make sense. Then return those values using overrideMock.
Provide a default implementation in the mocked.mockFn call before the tests that throws an exception in case a test case is missing a mock
IMPORTANT: Do not use JSON.stringify on objects because there could be circular references, let Tracer.log do the serialization or call Tracer.stringify as it can handle circular references.
ASK for permission before modifying any other file than the test file
IMPORTANT: NEVER use` replace_in_file`, it doesn't work. always use `write_to_file`
VERY IMPORTANT: BE THOROUGH AND DO NOT SKIP ANY STEPS! I WILL BE CHECKING!
Pro tip: partial updates don't always work!
One annoying failure of diverse LLMs is when they only have to change a few lines in the file, they try to avoid rewriting the entire file. In some cases, it overwrites the existing content, leaving just the new lines and a pithy comment such as "rest of the code goes here...". Other times it fails to match text in the file and retries several times for nothing. When you are paying by the token, this is aggravating, plus it takes longer to complete the work. Therefore I added an instruction to have it avoid doing that.
Sonnet 3.7 within the Claude application is much better at doing these kinds of partial updates, but in this test I'm driving everything from within VS Code which is how most developers are using AI Coding at the moment. MCP may well change this - more in another installment! Sonnet 4 almost always gets it right the first time, and usually gets it right the second attempt.
Sonnet went about the business of creating the tests according to the test plans, in a matter-of-fact way:
A fair amount of code was generated, which cost me $0.90 - more than double the cost of the code under test.
Claude Sonnet 3.5: Running the tests
Once all tests were generated, we could run them one by one and fix any issues.
With Sonnet, most of the tests pass on first attempt, with only maybe 1 or 2 failures per test suite.
Root Cause Analysis
For the failures, Sonnet, is good at following the provided Root Cause Analysis process and fixing them in one or two attempts. I didn't have to step in to give it "hints" like for some other LLMs. The process is yet another prompt I put together to gather information then proceed with validating the test, so it can decide to either troubleshoot the code to make the test pass, or troubleshoot and correct the test to obtain the desired validation as per the test plan, and hopefully pass. Theoretically both the test and the code might need adjusting but that was pretty rare.
Test Case Validation
Test validity is ensuring that, when the test code was generated, it wasn't subject to a flight of fancy and actually does test what's indicated in the test case. With some models, this is an issue as the test fails not due to code under test but due to incorrect test case generation. So before giving it permission to troubleshoot the code under test, I require that the LLM perform this validation and correct the test if there's a discrepancy.
The procedure requires extracting the test code and the test case and putting them both in the report, to ensure they are present in the context for analysis
Troubleshooting the code
If the test is valid, the second part of the procedure focuses on troubleshooting the code since it isn't producing the desired results.
The traces are examined as they provide important information that avoids endless cycles of guesswork and failed iterations. The default behavior of LLMs is to start with no tracing, and just add a few print statements as needed, after flailing about and guessing at what the cause of the failure is. Providing a detailed execution trace cuts this down.
In this procedure, I ask the LLM to determine a point where the expected values diverge from the actual values. The instructions say to work backwards from the end, but the LLMs mostly do it forward, in spite of several attempts to reinforce that idea. In any case, Sonnet 3.5 is very good at analyzing the log and picking out the most relevant information.
The behavior is compared with the original design to try to understand what is desired to happen, so that the code isn't automatically changed because the test case or test code says it's wrong. This avoids the "tail wagging the dog" issue.
Proposing a fix
On the heels of the log analysis, and having found a plausible root cause, the LLM can now propose a code fix.
Rather than generating the code immediately like some of the other LLMs, Claude Sonnet describes the changes in a conceptual and concise way. Once the code is generated, I can review it and approve the final change.
Detecting Test Case Errors
In some cases, the RCA review did show an error in the test and Sonnet was able to detect it. (Reminder that Sonnet is the accused, judge and jury since it created both the tests and code). It could be incorrectly generated test code, or test cases that didn't quite make sense, as per the following example:
Detecting Non-conformant Code
In other cases, there is a deviation of the code from the design. These deviations are essentially due to the probabilistic and pattern-driven approach to generating the files. One of the approaches to improve the final result of AI in any application is to incorporate review steps. It may surprise that doing this as a separate step will often be able to detect the problem, however this is where we need to remember that we're dealing with a generative text transformation process that has its idiosyncracies, and not real intelligence. There's a form of "stickiness" in output that ressembles human cognitive dissonance which we need to take into account.
Groundhog Day
Sometimes Claude appears to cycle through the same files seemingly doing the same analysis over and over again in a loop, spending my tokens each time. The RCA process I put together asks it to incorporate prior fixes to avoid getting stuck. However, just to give me peace of mind, I can interrupt it and see whether the choices made seem to make sense and converge towards a solution if it's taking too long, by asking it to account for its lengthy execution:
Sometimes I find the direction that Claude is going to be incorrect. But I try to avoid telling it outright what the answer is, because if it has some sort of "blockage" about what to do, it may ignore my clarifications and continue being stuck. Instead, I force it to bubble up pertinent facts in its context through making it explain - this allows me to verify what's there', if there's a conflict, etc.
(This is also a teaching way I've used with human developers to have them walk themselves to a certain conclusion, rather than just put the solution in their face. Coincidence no doubt!)
So there's an "A-ha!" moment where the LLM recognizes it has a wrong assumption! This is the kind of thing that really resembles intelligence...
Interestingly, Claude often tries to balance the complexity of a solution with the needs of a use case, which is how a human developer would think. Of course humans have more subtlety in assigning value to the elements of a tradeoff, but this is something interesting to observe and improves the value of its recommendations.
Otherwise if the test, test plan, design and code seem coherent, then the code fixing process uses the RCA to attempt the most likely fix.
Baffled by improper mocking
One of the serious issues that can happen is when the testing falls into an infinite loop, for example because of a failure in incrementing a counter or other coding error. This can happen easily due to improper mocking - the orignal function consuming a counter or resource that the mock doesn't take into account. This is difficult for an LLM to handle because the test suite never returns and eventually the operation times out without providing information so the LLM has no idea what's going on, and eventually starts suggesting clearing the jest "cache" or even reinstalling packages it thinks are "corrupt"!
And with the logging, an infinite loop can completely fill up the hard disk if left to run without any limits.
For this reason, I added a failsafe during testing where an exception is thrown if too many log messages are written. By default, this happens after 1000 writes, although this can be adjusted as needed. The exception includes a message reminding the LLM that there is probably an infinite loop, to counteract the LLM's natural tendency to incorrectly conclude that it was normal for the test code to generate that many traces!
Now with the nudge, the LLM is able to focus on the problems caused by the test or the code under test themselves:
In the following analysis, Claude adds comments to track the evolution of the variable's value:
The explanation isn't quite complete enough in this case, in spite of the fairly lengthy analysis. I prod it along with a question:
And this time it finds an error in the mocking!
All this takes time in the best of cases, so compared to the actual generation part of the SDLC, the troubleshooting is definitely an area where human developers have orders of magnitude more facility than LLMs, as even using the best ones show.
And eventually all this prompting, reading logs, reading specs, and generating analysis uses up hundreds of thousands of tokens which you need to pay for, and eventually gets you rate-limited!
Claude 3.5 Recap
Here is a recap of the API costs:
The code generation, test plan generation and test suite generation was very smooth and efficient. Within 30 minutes, I was done this part.
The tests mostly ran off the bat, however most of the cost was solving 3 problems due to invalid mocking that would jump out quickly at humans. In fact, I usually avoid mocks altogether in "human only" development because I find them to be finnicky, but with LLMs they are a must to avoid out-of-control spread of attention to multiple files.
DeepSeek
This popular model from China is supposed to be trained on the leading LLMs.
I tried a few times - it was either too busy or the response was malformed. Considering my other options, I decided to come back to it later. When it finally did run, I found it slow and not at the same level as the other thinking models, more at the level of GPT-4o. I decided to skip it for now. It's basically more of a statement of how cheap a model can be made to run, rather than an advance of what AI-assisted coding can do. Some of the thinking was in Chinese, which doesn't help me...
OpenAI o1
This is an advanced reasoning model from OpenAI, and already very expensive, both because of the per-token cost, and because of the large number of reasoning tokens used.
Creating the code was straightforward but the cost rose to $14 compared to $0.40 for Claude!
In my experience, the o1 model, being computationally expensive, seems by training to tend to answer in ways to quickly kill the conversation. In this coding task, I found that it had decided to leave certain parts out for the human "sucker" to complete:
I prompted it in a non-committal way to see if it indeed had the information that was missing, as I usually do. I figure if I just tell it, it might choose to ignore it again, so I try to get it to provide the start of it and the rest will follow like pulling on a thread
It turns out that, since the files it depends on weren't created yet, it was avoiding syntax errors by not adding the code! This is clever but not very useful.
After all the files have some sort of minimal code in it, I told it to finish and another $5 of code was added to the tab.
Once the code was created, the next step was to create a test plan based on the code in order to execute all paths and exercise all conditional and loop statements. This added another $6.50.
The analysis of the code by o1 was very precise, correctly identifying control statements to be exercised.
The next step was creating the actual tests. Unfortunately a number of hallucinations crept in which required some small manual edits. I gave it some feedback in the hopes of avoiding manual or prompt-driven fixes.
These hallucinations were adding up rapidly to the tune of $14.18 - and the tests were failing:
Investigating the multiple failures, I discovered that the unit tests were not consistently following the test plans and not being systematic in the mocking - invalidating the tests. Now I had to cleanup the testing mess as cheaply as possible
Cline was not using about 100k of context, causing API calls to rise and some confusion with things that were working.
There were also infinite loops because of failing to mock behavior where a counter is being incremented by the real code. This is something that was not accounted for in the test plan, which is more of an external view than a real test design.
At this point I had the impression of going around in circles, with previous fixes getting lost in the (long) conversation context. I have to be very precise in my directives to get it to move forward.
By the time all 5 test suites were passing, I was out another $40!!
DeepSeek - second attempt!
Stilll overloaded...
Gemini 2 Flash
I immediately liked using Gemini 2 Flash because it was very fast and produced good enough code. And it was a free preview!
I was also intrigued by the large context window - up to 1 million tokens! I wasn't sure how this would be used, but kept an eye on the context window size that rapidly exceeded the 100k - 200k that other models were limited to.
Google chose to expose the "reasoning" of Gemini contrary to its competitors, and there's A LOT!
It's a bit sloppy and needs to be pointed into the right direction regularly
In particular, it was struggling to apply some specific mocking instructions and after a few attempts reaches out for help, which is nice.
We make progress and Gemini starts getting excited!
Unfortunately Gemini is not able to fix the remaining issue, but like any true developer, gets hooked on the problem and convinced that the next attempt will be the good one.
With each fix, it adds more and more enthusiasm in claiming the last problem has been resolved - even if it turns out not to!
Now channeling motivational affirmations from its training!
Still didn't work but it generates a different hype message on each attempt:
And finally - ALL TESTS PASSING!
Well, not really!!
I have to interrupt its celerbration to point out the remaining failure
More over-the-top motivational messages!
Now the LLM is using a crying emoji to convey a sense of disappointment - this is getting weird
The LLM now makes outlandish proposals having "exhausted" all possibilities. (I'm glad this usage is FREE! otherwise I would have pulled the plug a while ago in spite of the entertainment value)
Number one next step is a "Microscopic" code review - something I've never heard of but desparate times require desparate measures!
I look at the test case and discover that it doesn't quite make sense:
I let it know in the hopes that this ends the silliness.
Sadly, this isn't the root cause of the problem. More crying, more heartbreak!
Actually I realize in all the excitement it didn't actualy apply the fix it said it would! Facepalm moment...
And now - success!! And my screen fills with partying!
Excrept no...
Same old story
OK this time the party is on!
It also decides to update the Root Cause Analysis document. But I have a nagging suspicion...
(
I actually skipped a few more pages of NOOOOOOOO. At this point if this were a paid model each O would add a token to the cost... There's something about its use of the editor tool that makes it unable to update the incorrect code, so I offer to fix it for it., which it turns down out of a desire to do it on its own...
The "Victory Run" doesn't work, same mistake. I repeat my offer:
This finally works and Gemini outputs a bunch more pages of various emojis to celebrate
but alas it starts hallucinating that it DIDN'T fix the issue and starts railing against itself about an imaginary problem!
I notice that this erratic behavior is happening when the context window has ~ 500k. Still plenty of space left but the strangeness of its answers means it's not very usable at this point. That's actually a problem with agentic coding loops - the context can grow pretty quickly and lead the LLM astray.
I reassure it that the problem was actually fixed and it runs the tests a few times to enjoy the moment:
By the time I was done running and fixing all five test files, the context had almost 800k tokens!
03-mini
OpenAI has been releasing "mini" models of their leading reasoning models. They are supposed to be as good as the previous generation of models but cheaper. I tried OpenAI 03-mini.
It was the slowest model of its cohort. As a "mini" model, it had some problems with basic code generation such as generating the correct imports, but, within the same conversation, tended to remember the corrections and avoid making the same mistakes over and over.
However, the code was mostly unusable.
And in my initial attempt at generating a test plan, it produced an empty template with no actual test cases...
I pointed it out and got it to try again, more to the point:
Better but when I tried to generate code from it, it proved overly vague, so I decided to ask it for help on how to write a prompt :
Pretty wordy, but when I applied, it, wow what a difference!
So O3 Mini was a dud with potential. Which is why I tried...
OpenAI O3 Mini High
When OpenAI released their latest reasoning model O3, they also released the cheaper to use O3 Mini, which I discussed in the previous section. They also released O3 Mini "High" - essentially the O3 Mini model but allowed to "reason" longer. A lot of developers like this model, some claiming it outperforms Claude Sonnet 3.7 in some tasks.
I decided to put it through its paces and got it started on code generation.
True to its name, it only produced a reduced function in a single file (the other files are just definitions and no code).
Looking at the code, I saw it had decided to implement its own simplified parsing, rather than follow the design. In other words, this was correct code, but it did the wrong thing even though the spec is very precise in terms of the algorithm.
One good thing: code generation was cheap!
I decided to see if it could do the unit test generation, starting with the test plan. This was just to get a sense of its capabilities because what it was testing was incorrect anyway.
The test plan it created started optimistically:
Followed the analysis which looked good but in fact was just a retelling of the code without focusing clearly on the key aspects that determine the flow of control.
For example, Claude would eliminate the extranous items and just focus clearly on the aspects that should go towards determine which cases we need:
The plan concludes with 6 (count 'em!) test cases for the whole module. This scenario actually contains multiple tests cases in one, which is not good: the lack of clarity in the plan means subpar code generation. And sure enough even though the first time I run it, it gives me a victory message:
The reality wasn't as glorious:
I had to give it a nudge
This got it thinking, then a new version of the test was generated without any comments or explanations.
It gave me a success message again, but this time it actually seemed to know how many tests were supposed to be successful:
And in fact the mighty mini had successfully fixed the two failing test:
So as usual, the testing part is more expensive than the coding part. And at $0.75, it's a lot more affordable - but don't expect frills like full functionality! No, expect o3-mini to "minify" what you get and leave out things like "business logic" that's highly overrated!
I'm not sure how hard it would have been to do some prompt engineering in order to get it to cough up the missing code, but it was clear that there was no point as the amount of deviations from the instructions was too high. By the way, this is one of the risks of Agentic applications of these models: they can just as easily go off course, and with no human to intervene.
Claude Sonnet 4
I've been mostly working with Claude Sonnet 4 in either the Claude application and Claude code, which I won't get into in this already very long article! But for a comparison, I went back to Cline with the Claude API.
The code generation cost more than the other models as expected at
$1.04
The creation of the test plans amounted to nearly identical cost:
As I've gotten used to with Sonnet 4, the outputs were very complete even though there were specific mocking directives (which tended to trip up other models)
And the test code itself came up to a whopping $1.64!
And when I ran the tests... I got a lot more tests than expected, but also a lot of failures, including infinite loops!
It rapidly determined root causes and started reducing the number of failures pretty quickly, identifying some common mistakes it had made in all of the files:
It was able to find the root cause of. the infinite loops:
Basically incorrect mocking... Once in a while it stopped to celebrate some wins, and hope I'll take the wins!
I tell it to keep going, it fixes four more!
Finally completed all of them, for a total of $8.48
Here is a recap of the API costs, and comparison with 3.5:
Considering the more expensive cost. of the API calls, this shows how much better this model is over the other ones, but it's still pricy. This is why going to a subscription is worth it - but that's a story for another time...
Martin Béchard enjoys cooking up new projects with AI. If you need to spice up your development with some AI Coding, please reach out at martin.bechard@devconsult.ca!
Want to talk about this article (free)? Schedule a quick chat
Need some help with your project? Book an online consultation
ICT Data Analyst | PM
2wOne of them is "thrilled" regardless of the outcome. It reminds me of when AI chatbots give a very wrong but confident response :)