How LLMs Use Monte Carlo Tree Search to Improve Their Own Thinking
The paper on Planning of Heuristics published on 17th February, 2025 by Chaoxu Mu, Xufeng Zhang and Hui Wang talks about something really fascinating! The authors introduce an interesting idea on using Self Reflection Driven LLM with the Monte Carlo Search Tree. I know it sounds complicated, but don't worry this article will help you deeply understand (or atleast try to) what the paper is all about!
Basic Terminologies
So let's get our basics down and understand the key terms of the paper before we dive deeper.
1) Combinatorial Optimization Problems (COPs) :- These are problems where we try to find the best solution from a huge number of possible combinations. Imagine trying to deliver a 100 packages, all over a big city, going at it randomly would be a waste of time. But having a method or heuristic allows for better optimization.
2) Heuristics :- A heuristic is a principle of sorts, which allows for better decision making than making blind ones. In algorithms like A* Search or Greedy First Search they have a heuristic attached to them, because of which they will deal with the same problem in a different manner. In layman language, it is a "shortcut", allowing the program to make better decisions.
3) Monte Carlo Search Tree (MCTS) :- It is a method to make decisions by simulating many possible outcomes, and using those outcomes to create simulations to see, which one fits the best. Imagine how the Chess AI Stockfish works, it thinks of every possible move for black and white and then ranks based on the previous or new simulation it runs to see how good a move is. MCTS has 4 main steps in its working :-
Selection - Selecting the most promising path.
Expansion - Expand on the promising path, and try new moves.
Simulation - Imagining how the new path may play out
Backpropagation - Based on how that path performs, storing it in your memory
4) Optimization :- It is simply about making something both effective and functional at the same time. Like how a washing machine saves more time and energy instead of washing clothes by hand.
What is the paper about?
The paper introduces a new idea of Planning of Heuristics (PoH), where it uses Large Language Models (LLMs) and Monte Carlo Search Tree (MCTS) to automatically improve heuristics. Now let's see how does this actually work :-
1) Generating Heuristics :- First the system would start with a heuristic or decision making shortcuts to solve a problem. Imagine you want to travel to point A, your heuristic would be to find the most direct and efficient path, to point A.
2) Evaluating Heuristics :- Now this heuristic would be checked to see how does it fair in simulations done by the MCTS, to see how the heuristic would perform in different scenarios.
3) Refining Heuristics :- Based on the results, the system would redefine it's heuristics to make it as efficient as possible. For example, if a path goes directly to point A, but takes more time than taking an indirect path, then the indirect path, would fare much better.
What's the importance?
In traditional methods, we have always handcrafted heuristics, which takes time to make and may not always be the best solution. But what this paper does with the introduction of PoH is that, we can automatically improve heuristics basically the system learns and keeps on refining the heuristic on its own! That too without human input, so it learns from itself instead.
Key Findings of the Paper
The paper tests PoH on two classical problems :-
Traveling Salesman Problem :- Imagine you’re a salesperson trying to visit several cities with the least travel distance.
Flow Shop Scheduling Problem (FSSP): It is about scheduling tasks on multiple machines in order to minimize processing times.
PoH improves upon the previous methods by reducing the optimality gap (which means how far was the solution made by the system far off from the best solution), from 0.483% to 0.233% on these problems. These means it performs better on these questions.
Key Takeaways from the Paper
The paper combines two things Large Language Models, which understand and generate language, with the the Monte Carlo Tree Search, which makes decisions through simulations. Both of which when combined together, allows to automatically improve heuristics for solving problems related to optimization.
It also focuses on Self-Reflection in LLMs, means the model evaluates what it has done and on the basis of that deduces whether it's choice was optimal or not. This method can be applied to a wide range of optimization problems where finding an optimal solution is usually difficult or time consuming.
Final Thoughts
Planning of Heuristics is a powerful example of how LLMs can think and evolve themselves from just language rules but to problem solvers. By teaching them to plan, reflect and optimize we can apply to them to real world scenarios for better results.