Why I'm Betting on LLMs for UI Testing

Why I'm Betting on LLMs for UI Testing

[cross post of my original 6/29/25 blog since that blog may be behind a paywall sometimes.]

Right now, we have two GenAI camps — the All-In-Folks and the Skeptics. Some people view GenAI as the solution to every problem in the world; others are absolutely refusing to use it or trust it and are clinging to the traditional ways of doing things. I tend to be a pragmatic dude, somewhere in the middle. I am excited to use it (every day), but I also try to understand the limitations and be realistic. The more I use it, the more excited I get about its future. Like any other new technology, I try to look ahead — a lot of the current limitations will eventually be worked out. Skate to where the puck is going, not where it has been.

For all the time we spend talking about GenAI generating code, we spend very little time talking about GenAI generating and executing tests. I believe they are so much better at the latter than the former.

Specifically, I have been playing with GenAI for UI testing. There’s much literature about GenAI for generating unit tests, but not nearly as much for integration and end-to-end testing. I had theorized that LLMs would be pretty good at that.

The reason UI Testing is business critical is that it tends to be the last line of defense before something is pushed to production. If your test processes are following the Test Pyramid, you'll have a healthy amount of unit tests, followed by integration tests. But the UI tends to be the place where all functionality comes together and you often find those pesky problems that escaped through the cracks of unit and integration tests.

My first exposure to UI Testing was in 1997. I was working at Microsoft and testing Microsoft Office 2000 at the time. Two things I immediately noticed: the frameworks for driving UIs were clunky, and the tests themselves were brittle. Not that much has changed about the space in the last 30 years. Recently, as I am leading testing the Amazon Store, this has been on my mind a lot.

There’s lots of inputs you could feed an LLM to have it generate tests: specs, design docs, production code, pre-existing tests, code coverage information (so that it can aim at generating code that tests paths that haven’t been exercised yet), operational issues (to give it some hints as to where problems may lurk), etc. 

The main question for me was: in what format should those auto-generated integration tests be?

  • Path#1 is having the LLM generate tests as code (e.g. as Selenium, Appium, etc), then executing those tests as usual. 
  • Path#2 is taking a bigger leap — what if we used an LLM for generating the tests as natural language, then used another LLM to execute them?

Article content

There are pros and cons to both paths.

  • Using LLMs to generate test code means you can inspect the code, and once it’s generated it becomes "deterministic" — meaning you know exactly what gets executed on every single test run. This is within people’s comfort zone. [Not to be pedantic but I would challenge that if you are testing a distributed system with retries, your tests are already not deterministic, but I digress…]
  • But if we’ll live on a world where both LLMs and humans write integration tests, it’s so much easier for a human to write a test in natural language than it is to write it using a traditional UI framework like Selenium or Appium. So test authoring in natural language wins.
  • Generating tests as code in a traditional framework also means that whoever is inspecting that code needs to be knowledgeable in that particular framework. In contrast, anybody can inspect a natural language test and immediately understand what it’s doing. We all speak natural languages!
  • Generating tests as code means you are now on the hook for maintaining and evolving that code for the rest of your life, like any other piece of code. Software upgrade? Your problem. Production code changes, so you have to change the test code to reflect that? Also your problem.
  • Most interestingly though, generating tests as natural language and having them executed by an LLM creates an opportunity to think differently. UI tests are notoriously flakey. Two of the reasons are: [1] unexpected message boxes can pop up randomly; [2] often tests drive a UI by navigating the DOM and looking for element ids, if those change, the automation doesn’t know what to do. Both of those are easily navigable by an LLM.

To illustrate this, I wrote a simple UI test and fed it to an LLM for execution. As I walk you through this example, I encourage you to not just focus on the Action that the LLM took, but focus on the Reasoning of the LLM. The little thinking bubbles in my graphs are verbatim what the LLM reasoning field was, which gives you a great insight into how LLMs reason through their world.

Article content

Step 1: It typed “Harry Potter book” in the search result.

Article content

Step 2: It clicked on the Search button

Article content

Step 3: It verified that the right book was present in the search results.

Article content

Step 4: It clicked on the correct book. Amusingly, it did that while trying to save me money, since it picked that version because “it was the most economical option!”

Article content

Step 5: This is where it goes off the rails a bit. The “Add to cart” button is below the fold here, so it’s not visible in the current screenshot. The bot found an “Add prime to get fast, free delivery” button and it theorized that if it clicked it, it would add it to the cart. It was not correct, but to be fair, it was not a terrible guess, as the bot was trying to find a path.

Article content

Step 6: It figured out on its own that that was not correct, and it theorized that if it scrolled down, it would find it, which is correct.

Article content

Step 7: It found the “Add to Cart” button, it’s happy now!

Article content

Step 8: It’s validating that it added it to the cart. Notice it validates in four different ways, which is better than how I would have validated if I was writing this test!

Article content
Verification #1:Cart subtotal shows the right amount of money
Article content
Verification #2: “Added to card” confirmation message
Article content
Verification #3: Cart icon shows 1 item
Article content
Verification#4: The Harry Potter book is visible in the cart preview

This simple test illustrates the power of LLMs in the domain of UI Testing.

The elephant in the room is non-determinism. How do you guarantee that the LLM takes the correct path navigating through your app as it is testing it? In my little example, the bot clicking on the “Add Prime” button was a hint that when navigating a UI, a bot could definitely get sidetracked.

More amusingly, in another execution, the bot couldn’t login with the given username and password, so it attempted to create a brand new account all on its own. When that too failed, it actually attempted to chat with customer service to work things out. This is actually really impressive — that was one determined little bot!

To address that, one idea we’ve been kicking around among some friends is giving each test an “execution budget.” It took 8 steps to execute my little example. So if a bot is still trying to accomplish the task after 20 steps, it probably veered off course and it’s doing something it isn’t supposed to be doing. So tests could have a budget as a guardrail.

More broadly, as we shift more towards depending on LLMs for test execution, we need to spend a lot more time thinking about guardrails - what actions should the bot simply never take?

Another idea is to have a “judge LLM” that analyzes the steps the “executing LLM” took and its reasoning and decides whether that was correct or not. I see the LLM-as-a-Judge pattern being more used these days on a lot of tasks.

We also have the ability to gather data objectively. We have hundreds of thousands of legacy tests written in frameworks like Appium, Selenium, etc. Could we auto-generate a natural language test suite with the exact same tests, and run both suites in parallel for a while, comparing their results? Ideally, at some point we would have concrete evidence that we can replace the traditional tests with natural languages tests.

Will it also replace human testers? I don't like to use the word "replace," but it will definitely shift the work they do. In some orgs, we employ armies of manual testers that perform the same repetitive tasks to certify every release candidate. Sometimes because writing automation is too expensive; sometimes because the product changes too quickly to even write test automation; sometimes it's lack of forwarding thinking and unwillingness to invest in engineering excellence. But any situation where a release is gated by a human being is not scalable or sustainable. It is also non-deterministic... humans are notoriously non-deterministic and make mistakes too. I would like GenAI to perform those repetitive tests, so that these testers can focus on applying their intuition and hard-earned experience in how products can fail to explore the surface more freely and more creatively.

One more consideration is that LLMs are slower to run these tests than traditional automation, and GPUs are expensive. So bots lose in both latency and cost today. But that will not always be the case. We're looking into ways to cache things, to process in different ways, etc, to tackle both of those. I don't want to wait until latency and cost are fully solved to take a bet on this, because if we do that, we'll simply get started way too late. Another way of thinking about this is that the savings in human authoring and maintenance are worth the latency/cost even as of right now.

Lastly, LLM-based test execution opens up doors - for us to do things we simply couldn't do before. A fascinating whitepaper from some researchers at Amazon describes how they created agents with diverse personas and goals for testing purposes. You can have a test with a set of instructions that can be interpreted differently depending on the persona - which allows you to discover bugs that you may not otherwise find. For example, what if one of your personas was blind? We don't need to write special tests to validate whether the Amazon Store is accessible to people with disabilities; we simply should run all our tests, but with personas that reflect different disabilities. This detaches the test steps from the execution behavior - that is something we could never truly do before. We can add all kinds of non-functional guardrails to our already existing functional tests.

To me, the advancement of LLM-based UI-testing is not a matter of IF, it’s a matter of WHEN.

My first experience testing professional software was working at Microsoft in 1997. I installed a veeeery early debug build of Microsoft Office 2000. Excitedly, I double-clicked on the Word icon. The hard-drive frantically spun for 5 minutes, then I got a message box that it had crashed. The thing didn’t even start. It was unimpressive. The next day, I installed that morning’s build, and it opened, but crashed within 3 minutes of me playing with the app. Yet somehow, a couple of years later, we shipped that codebase (much improved) to millions of households. It took a lot of very determined engineers to try things out, learn, iterate, improve, relentlessly. Today’s world is no different.

I am both excited and terrified. But I think it’s time to jump in.


Andrew Spina

Software Engineer at Amazon

1w

I’m expecting AI to do a better job of automatically diagnosing failures too. It could have knowledge of underlying systems and when it sees a failure (the beta tests in Listings can be flaky due to a deep stack) it could trace the issue and provide a few possible explain actions (bad changes in a dependencies pipeline, LSE, etc.). I’d also like to see such tests generated and run even before program development starts. This would allow us to use the red-green cycle for a whole project and include it as part of development progress. All too often UX development is done last which leads to late discoveries-- "What happens when the customer checks *both* checkboxes?" that extend timelines.

Like
Reply

I loved reading this! I have been looking into similar concepts. My thought was if a LLM could be trained to test from real customer usage. Have an agent monitor real customers and then develop test steps based on the common patterns customers do. One of the hardest things is to try and test like a real customer, they often do things differently and not the way we would expect.

Carlos Guzman

Sr. Manager of Software Development. Creating delightful experiences for book lovers everywhere!

1mo

Couple notes: 1. There is a third camp when it comes to AI usage: the skeptics who can be convinced when they see something working. I think this is the biggest buckets at Amazon, actually, and it's great to see the click in their brains as they jump into the all-in boat! 2. Excited and Terrified is right my friend... :) Thanks for the Deep Dive which hopefully will convert a few more skeptics (it's real software folks, I used it often!).

Like
Reply
Wes McDaniel

Principal Quality Assurance Engineer at Amazon

1mo

I think one the biggest issues QA faces in flaky test automation. We can now finally fix thanks to LLMs without constant, massive KTLO. Huge win!

To view or add a comment, sign in

Others also viewed

Explore topics