Enabling intelligent search for AI agents
At Asari AI, we are building AI systems that will co-invent the future. As a part of this mission, we are exploring new paradigms that rethink how AI agents search through vast possibility spaces and reason toward novel discoveries, enabling them to find solutions to complex problems more reliably and efficiently.
Together with MIT researchers, we developed EnCompass, a flexible framework for AI agents that separates workflow logic from search strategy. EnCompass will be presented at NeurIPS 2025 and represents an important step in building AI that can plan, explore, and discover new approaches to meaningful real-world challenges.
The agent begins its search from a single starting point, ready to explore!
Search complexity in multi-step reasoning
Our work is focused on a future in which AI agents can autonomously work through multi-step tasks like designing engineering projects or building layered, interconnected systems. Such workflows require chaining together hundreds or more reasoning steps and interaction. This creates two fundamental challenges.
First, the possible paths that the agent could take grow exponentially. Most AI agents use large neural networks, including non-deterministic large language models (LLMs), which don't always give the same answer to the same question. As agents reason for longer, the number of potential paths explodes. Since agents can't feasibly evaluate every option, they need efficient search strategies to find good solutions.
Second, errors compound. Each step adds uncertainty, and small errors accumulate quickly. When each step depends on previous ones being correct, even a 95% success rate per step means overall success plummets to virtually zero.
Addressing these challenges requires better tools for building AI agents that can search intelligently. EnCompass focuses specifically on "program-in-control" agents, which are systems where the agent's workflow is (in part) defined programmatically in code. In these systems, developers typically hard-code search strategies directly into each agent's workflow, wrapping parts of it in for/while loops to implement resampling or iterative refinement. This approach has serious limitations: strategies like resampling don't scale well as workflows get longer, while techniques like tree search over the agent's execution paths become practically impossible to implement. The tight coupling between workflow and search also makes strategies difficult to test, swap out, or improve.
EnCompass addresses this by separating agent workflow specification from search strategy. Developers can define what their agent does in clean, readable code, then apply varied search strategies without rewriting the workflow itself. Crucially, this enables search strategies that are more sophisticated and contextual than traditional approaches, which can leverage information across the entire execution history and adapt dynamically to the agent's progress.
Understanding cascading uncertainty
Imagine an AI agent trapped in an escape room. To escape, it must solve four puzzles in order: decode a cipher, solve a logic grid, answer a riddle, and enter a final combination lock. Each puzzle depends on information from previous ones.
The agent calls an LLM to solve each puzzle. After each puzzle attempt, an evaluation function provides feedback, either graded (e.g. fraction of cipher symbols decoded, number of grid cells correct) or binary (i.e. answer is right or wrong). These scores help the agent gauge progress, but don't guarantee later success.
Because LLMs are non-deterministic, they can produce answers that seem reasonable but are actually incorrect. The cipher might be decoded in a way that scores well but is fundamentally wrong. The logic grid then builds on this flawed foundation, managing to score 80% correct by chance or partial alignment. Only when the agent reaches the riddle, which gives clear pass/fail feedback rather than partial scores, does it become obvious that something went wrong earlier in the sequence. The agent must then backtrack to reconsider earlier decisions.
This cascading uncertainty is what makes multi-step reasoning difficult: early mistakes aren't immediately obvious, but they compound and eventually cause failure in downstream steps that depend on them.
Now imagine twenty puzzles in sequence, or fifty. Each puzzle depends on previous solutions, and each has multiple possible interpretations. The search space of possible paths multiplies rapidly. Since early mistakes can be obscured by plausible partial solutions, the agent must explore many different paths before finding one that works.
To see why this matters, let's look at the escape room solution in Python pseudocode:
def solve_escape_room():
cipher = solve_cipher()
logic = solve_logic(cipher)
riddle = solve_riddle(cipher, logic)
code = solve_combination(cipher, logic, riddle)
open_door(code)This clearly expresses what the agent should do. However, finding the solution using non-deterministic LLMs is where traditional programming becomes cumbersome. You would have to write complicated loops to try each puzzle multiple times, manually track which attempts were successful, and hard-code rules for when and where to backtrack.
def solve_escape_room():
# Try the cipher multiple times
cipher_solutions = []
for attempt in range(N):
candidate = llm.solve_cipher()
score = evaluate_cipher(candidate)
cipher_solutions.append((candidate, score))
best_cipher, best_cipher_score = max(cipher_solutions, key=lambda x: x[1])
# Now try the logic puzzle multiple times
logic_solutions = []
for attempt in range(N):
candidate = llm.solve_logic(best_cipher)
score = evaluate_logic(candidate)
logic_solutions.append((candidate, score))
best_logic, best_logic_score = max(logic_solutions, key=lambda x: x[1])
if best_logic_score == 0:
# Backtrack to attempt cipher again
...This approach quickly becomes unmanageable as the agent grows complex. Each additional puzzle requires more convoluted search logic for resampling and backtracking. Moreover, the search logic becomes intertwined with the problem-solving code, making it nearly impossible to swap in different search strategies or parallelize exploration without rewriting core logic. The result is a brittle, monolithic system that resists iteration and experimentation, which is exactly what you don't want when building agents with non-deterministic LLMs.
Decoupling logic from search
EnCompass solves these problems by separating the two key aspects of agent
programming: the underlying workflow and the search strategy. Instead of hardcoding search logic,
you simply mark decision points with branchpoint() and evaluation points with
record_score().
Branchpoints identify non-deterministic actions you may want to revisit. For example, marking each puzzle as a branchpoint lets you attempt the cipher multiple times or return to it after the logic puzzle fails. Scores provide contextual information about which paths are worth exploring. Collectively, these annotations tell EnCompass where the agent can explore alternatives and how well each path performs, enabling complex search strategies without cluttering the code.
@encompass.compile # Python decorator for EnCompass functions
def solve_escape_room():
branchpoint()
cipher = solve_cipher()
record_score(evaluate_cipher(cipher))
branchpoint()
logic = solve_logic(cipher)
record_score(evaluate_logic(logic))
branchpoint()
riddle = solve_riddle(cipher, logic)
record_score(evaluate_riddle(riddle))
branchpoint()
code = solve_combination(cipher, logic, riddle)
success = open_door(code)
record_score(success)Ready to begin?
@encompass.compile
def solve_escape_room():
...
solve_escape_room().search(
"beam",
beam_width=2, branching=3
)At runtime, you can tell EnCompass which search strategy to use and it automatically manages the execution. For example, the animation above shows beam search using a beam width of 2 and a branching factor of 3. EnCompass samples 3 different cipher decodings at the first branchpoint, scores each of them, and keeps the top 2 options. At the logic puzzle, it generates 3 attempts for both cipher paths (6 total), scores them, and keeps only the 2 highest-scoring combinations.
When one path reaches the riddle and fails, the other paths continue exploring. EnCompass might find success through a cipher interpretation that scored slightly lower initially but leads to a correct riddle answer. Other search strategies may handle exploration differently. While beam search maintains multiple parallel paths, Monte-Carlo Tree Search (MCTS) repeatedly explores and expands the tree of future actions from a given decision point.
The key advantage with EnCompass is that you can swap between these distinct strategies without changing a single line of the core escape room code. The workflow stays clean and readable because the framework handles all the complex search logic behind the scenes.
Code translation case study
We applied EnCompass to several practical tasks to validate these improvements. Similar to the escape room example, these tasks all involved sequential steps with dependencies and uncertainty. Across our case studies, implementing search strategies at inference time (when the agent runs) required three to six times fewer lines of code changes with EnCompass compared to standard Python implementations.
| Case Study | LoC | Added Lines | Added Words | Changed Lines | Changed Words | Removed Lines | Removed Words | New Functions | Indent Changed |
|---|---|---|---|---|---|---|---|---|---|
| Code Repo Translation | 597 | 423 | 2735 | 24 | -62/+186 | 9 | 28 | 20 | 189 |
| 75 | 514 | 8 | -0/+40 | 0 | 0 | 1 | 0 | ||
| Hypothesis Search | 11 | 21 | 120 | 3 | -1/+13 | 0 | 0 | 2 | 10 |
| 8 | 27 | 1 | -0/+9 | 0 | 0 | 0 | 0 | ||
| Reflexion | 20 | 27 | 181 | 6 | -13/+31 | 0 | 0 | 2 | 8 |
| 9 | 32 | 3 | -4/+13 | 0 | 0 | 0 | 0 |
Beyond code efficiency, EnCompass enables clearer, more maintainable code. To see how, consider a simplified version of a Java-to-Python translation agent that iterates through source functions, translating and testing each one using an LLM.
Without EnCompass, one way to support general search strategies over this agent’s workflow is to use a state machine. The states track locations in the workflow, i.e., beginning to translate a function or beginning to test a function. Transitions between states implement steps of the workflow such as translating and testing a specific function in the codebase.
class State(Enum):
BEGIN_TRANSLATE = auto()
BEGIN_UNIT_TEST = auto()
def step(state: State, frame: dict[str, Any]):
frame = frame.copy()
if state == State.BEGIN_TRANSLATE:
frame["target_fn"] = translate(frame["source_fn"])
compile_success = compile_(frame["target_fn"])
return State.BEGIN_UNIT_TEST, frame, compile_success
if state == State.BEGIN_UNIT_TEST:
unit_test_score = run_unit_test(frame["target_fn"])
frame["source_fn"] = next(frame["source"])
return State.BEGIN_TRANSLATE, frame, unit_test_scoreHowever, this approach has multiple drawbacks that prevents this style of coding from scaling to complex workflows.
- The order of execution of different parts of the program is no longer obvious from the code, especially when the workflow employs control-flow structures like if/else statements and loops.
- Using a dictionary
framethat carries the state and real-time program data makes it difficult to know whether a data field exists at any point, leading to runtimeKeyErrors that are hard to debug. - Because runtime variables are accessed through the
framedictionary, linters and static type checkers no longer work, making it impossible to check for type errors. - Simple changes such as increasing or decreasing the step granularity requires structural changes to the state machine, creating more opportunities for bugs.
@encompass.compile
def translate_functions(source):
for source_fn in source:
branchpoint()
target_fn = translate(source_fn)
compile_success = compile(target_fn)
record_score(compile_success)
branchpoint()
unit_test_score = run_unit_test(target_fn)
record_score(unit_test_score)EnCompass addresses these limitations by separating search from workflow
logic. We added branchpoint() markers to the two key phases—translation and testing—where the agent calls the LLM and we want to explore different possible solutions. With the workflow
unchanged, we experimented with multiple search strategies by simply adjusting a few search
parameters:
- Global best-of-N: Run the entire translation N times, keep the best result
- Local best-of-N: Apply best-of-N at each step (per Java class or method)
- Beam search: A generalization of both approaches with configurable beam width and branching factor
Beam search bridges the gap between local and global strategies. By sampling and filtering at each step (like local best-of-N), it reduces compounding errors. By maintaining multiple beams (unlike local best-of-N's single path), it avoids locally optimal decisions that are globally suboptimal. Our experiments confirmed this: beam search with both beam width and branching factor greater than one achieved better scaling than either global or local best-of-N alone.
Moreover, EnCompass enables mixing search strategies at different levels of granularity within the same workflow. In our code translation example, we found that applying beam search at the class level as well as the method level yielded the best performance, demonstrating the value of fine-grained control over search implementation details.
Loading chart...
Beam search is also the most difficult of these three strategies to implement without EnCompass. The traditional approaches might use explicit state machines or hard-coded beams of program states, which are cumbersome and suffer from the issues described earlier. By making a greater range of search strategies practical to implement and experiment with, EnCompass unlocks approaches that would otherwise be too complex to attempt, potentially revealing better scaling laws.
Conclusion
Efficient and reliable reasoning is the foundation for AI systems that can tackle long-horizon, open-ended tasks. EnCompass rethinks how AI agents search by separating what agents do from how they explore solutions, turning sophisticated search strategies into a composable part of agent design. By bridging the gap between human intent and reliable execution, EnCompass makes it easier to build AI systems that can serve as true collaborators in discovery and invention.
We view EnCompass as a building block in a broader effort to make intelligent systems more trustworthy, efficient, and creative. Designing and evaluating more general search strategies remains an open and critical area of AI research. Search at scale is fundamental to learning, discovery, invention, and ultimately how intelligent systems push the boundaries of what is possible.
If you're interested in building AI agents at the frontier or exploring these ideas further, we'd love to hear from you at hello@asari.ai.
You can read the full research paper here.