Share my research Sync is a column that welcomes scholars to share their research advances with more than 1.5 million global AI enthusiasts. Beyond the technological growth, Share my research Research and interesting research also calls for interesting stories. Contact us: China. zhang@jiqizhixin.com
Meet the authors
Institution: Penn State University, Duke University, Google Deep Mind, Washington University, Meta, Niang Technology University, and Oregon State University. Co -first authors are Shukun Zhang Penn State University and Ming yen Duke University.
In recent years, the LLM Multi -Agent System has gained extensive focus on their cooperation approach to solve complex issues. However, despite the activity of activity, failing to do any work in these systems is a common scene. This leaves developers a critical question: which agent, at where, was it responsible for failure? To indicate the cause of the root, it seems to be reduced through extensive interaction logs, such as finding injections into a grass-one-time demand and labor attempt.
This is a familiar disappointment for developers. In the fast complex multi -agent system, failure is not only common but also incredibly diagnosed due to the agent’s cooperation and the independent nature of long information chains. Grind the system to prevent system repetition and correction, without ways to quickly identify a failure.
To deal with this challenge, researchers Penn State University And Duke UniversityIn collaboration with institutions, including Google Deep MindHas introduced the problem of novel research “Automatic failure attribution.” They have built the first bench mark dataset for this job, Who and whenAnd has developed and evaluated several automatic attribution methods. This task not only highlights the complexity of the work, but also pushes a new way towards enhancing the reliability of the LLM multi -agent system.
The paper has been accepted as Top Terror Machine Learning Conference, Spot Light Presentation in ICML 2025And code and datases are now completely open source.
Paper :https://arxiv.org/pdf/2505.00212
Code :https://github.com/Mingyin1/agents_failure_atTRIBITION
Datasate :https://hugingface.co/datasets/KEVIN355/Who_and_ When
The background and challenges of research
The LLM -powered multi -agent system has shown immense potential in many domains. However, these systems are critical. Errors of a single agent, misunderstandings among agents, or information delivery can lead to the failure of the entire work.
Currently, when a system fails, developers are often left with manual and defective methods for debugging.
Manual log archeology : Developers should manually review long interaction logs to find the source of the problem.
To rely on skill : The debugging process relies deeply on the developer’s deep understanding and hand work about the system.
This “needle” approach to debugging is not only inaccessible, but also a strong hindrance to the repetition of the high -speed system and the improvement of the reliability of the system. Automatic, systematic procedures are urgently needed to indicate the reasons for failures, which effectively eliminate the difference between “diagnosis results” and “system improvement”.
Basic contributions
This article contributes a number of important partnerships to address the challenges mentioned above.
1. To explain a new issue: Paper is the first person to formally form a “automatic failure attribution” as a specific research work. This job has been described by identifying it Agent responsible for failure And The decision of the decisive error This led to a failure of work.
2. First Benchmark Mark Dataset Build: Who And When : This datastate includes a wide range of failure collected from 127 LLM multi -agent system, which was algorithfully manufactured or handmade by experts to ensure realism and diversity. Each failure’s logs are with fine human interprets with fine:
WHO: Agent responsible for failure.
While: The specific interaction phase where a decisive error has occurred.
Why?: A natural language description of failure.
3. Initial “automatic attribution” methods: Who and when using dataset, evaluates three separate methods for paper design and automatic failure attribution:
-With one another: This procedure provides LLM with user inquiry and full failure log, which is asked to identify the responsible agent and decisive error step in one pass. While cost effective, it can struggle to identify precise mistakes in long contexts.
-The approach: This approach imitates manual debugging through LLM, which reviews the login of interaction, unless the mistake is made to make a decision at every step. This is more precise to find the error phase, but there are high costs and errors collecting risks.
– Binary Search: A compromise between the first two ways, this strategy repeatedly divides the log into half, using LLM to determine which segment contains error. Then, by offering a balance of cost and performance, it repeatedly searches the identified segment.
Experimental results and key results
Experiments were carried out in two settings: one where the LLM knows that the ground is a true response to the ground that is trying to solve the multi -agent system (With the ground truth) And one where it does not do (Without the ground truth) The used primary model was GPT4O, though other models were also tested. Organized diagnosis of these methods when receiving multiple important insights on WHO and Dataset:
As for, as. A long journey to go: The current methods are far from perfect. Even the only way to perform the best performer got the accuracy about 53.5 % Responsible agent and merely identify 14.2 % In identifying the exact error phase. Some methods performed worse than a random guess, clarifying the difficulty of work.
As for, as. No “all in a” solution: Different methods perform well on different aspects of the problem. Once The procedure is better at indicating “who” Step -by -step The procedure is more effective in determining “when”. Binary search The method provides medium ground performance.
As for, as. Hybrid view shows promise but at higher costResearchers found that combining different ways, such as using each other’s approach to identify a potential agent and then applying a step -by -step method to find the error, can improve overall performance. However, it comes with a significant increase in computational costs.
-The struggle of the most models: Surprisingly, models of the latest reasoning such as Openi O1 and Dupic R1 also challenge the task.- It highlights the hereditary difficulty of automatic failure attribution, which demands far more reasoning than it is needed for more traditional tasks.
As for, as. The importance of obvious reasoningProviding clear indications in which LLM needs to explain its reasoning in once and step -by -step methods, shown to improve performance.
As for, as. The length of context is a limited factor: The study also revealed that as the length of failure logs increases, the performance of all the attribution methods decreases, which has a more clear effect on the accuracy of the error phase.
As for, as. Future Outlook: To pave the way for a more reliable multi -agent system
“Automatic failure attribution” is an important component in the development life of the multi -agent system. It has the ability to change the challenge of identifying “what is wrong and who is accused” with a surprising and analyzing issue. With the construction of a bridge between diagnosis and improvement, we can eventually create a multi -agent system that is more reliable, intelligent and reliable.