But the new standards aim to better measure the ability of legal models to work in the real world. The Benchmark of Professional Reasoningpublished by Scalai in November, reviewed leading LLMs on legal and financial courses designed by professionals in the field. The study found that the models differed critically in their reliability for professional adoption, with the best-performing model scoring only 37 percent on the most difficult legal problems, meaning it met only a third of the possible points on its evaluation criteria. The models often made incorrect legal decisions, and if they reached the right conclusions, they did so through incomplete or ambiguous reasoning processes.
“There are basically no substitutes (for) your lawyer,” says Afra Fiza Akureke, lead author of the paper. “Although many people think that an LLM has a good grasp of law, it still lags behind.”
This paper builds on other criteria for measuring model performance on economically valuable tasks. AI Productivity Indexpublished by data firm Merck in September and updated in December, found that the models had “substantial limitations” in performing legitimate tasks. The best-performing model scored 77.9 percent on legal tasks, meaning it satisfied nearly four out of five evaluations. Such a scoring model could generate considerable economic value in some industries, but in sectors where errors are costly, it may not be useful at all, an early version of the study said.
Professional benchmarks are a great step toward evaluating the real-world abilities of LLMSs, but they still don’t capture what lawyers actually do. “These questions, although more challenging than past benchmarks, are more challenging,” says Jon Choi, a law professor at the Washington School of Law who collaborated with A, who collaborated with a study On the legal standard in 2023.
Unlike math or coding, in which LLMs have made significant progress, legal reasoning can be difficult to learn models. Choi says the law deals with messy real-world problems, riddled with ambiguity and subjectivity, that often have no right answer. Making matters worse, a lot of legal work isn’t recorded in ways that can be used to train models, he says. When it is, the documents can span hundreds of pages, scattered across a complex hierarchy of laws, regulations and court cases.
But another fundamental limitation may be that LLMs are simply not trained to think like lawyers. “Reasoning models still don’t fully account for problems like we humans do,” says Julian Nyarko, a law professor at Stanford Law School. Models may be lacking Mental model They argue that the world—the ability to simulate and predict a scenario—and that this ability can be at the heart of complex legal reasoning. It is possible to train on the current pattern of LLM Predict the next word We only get so far.